Some of the largest outages on the internet can be traced back not only to changes in code, but also how the code changed underlying data models. Through countless discussions with software engineers, many noted the importance of the underlying data model for quality development, yet also highlighted the lack of incentives (or outright discouragement) by leadership to put in the extra effort to maintain it. Even more troubling, not only are applications impacted by data, but also downstream consumers within the business are taking major dependencies on the output of this data for business-critical workflows-- unbeknownst to the upstream engineers producing the data (i.e., shadow dependencies). In this talk, we highlight this growing problem, why engineer leadership is paying more attention to the risk of data, and how to surface and prevent these issues within the CI/CD workflow via an emerging pattern called "data contracts."
Mark Freeman, co-author of the upcoming O’Reilly book "Data Contracts: Developing Production-Grade Pipelines at Scale," has spent the past two years collaborating with organizations to implement data contracts and refine best practices. Previously, he was a data scientist turned data engineer with experience in building, putting into production, and maintaining data products across the entire development lifecycle, such as consumer-facing algorithms, personalized insights, and machine learning models. This work has put him at the intersection of software and data teams, where he has become obsessed with the translation between both technical fields.