The old “garbage in, garbage out” adage has never gone out of style. The ravenous appetite for data on the part of analytics and machine learning models has elevated the urgency to get the data right. The discipline of DataOps has emerged in response to the need for business analysts and data scientists alike to have confidence in the data that populates their models and dashboards.
The stakes for getting data right are rising as data engineers and data scientists are building countless data pipelines to populate their models. We have long worried about AI and ML model drift, but could the same be possible with data sources that degrade or go stale? Or with data pipelines where operations gradually veer off course owing to operational issues such as unexpected latency that could disrupt and throw off the reliability of data filtering or transforms.
The discipline of DataOps spotlights the use of automation to scale the challenge of data quality. Yet, applying automated data quality or cataloging tools won’t ensure that the data sets being used are the right or most relevant ones to the problem, nor can they ensure freshness or currency. At best, the answers are ad hoc: there are numerous sources of data lineage, so the question often boils down to which version of the truth to follow. Furthermore, data quality tools may not always provide full coverage. As for data catalogs, at best they only provide opportunities for team members to comment anecdotally about the usefulness of the data. All too often, DataOps occurs on an ad hoc, break/fix basis.
A team at Uber experienced the problem firsthand as they contended with confidence issues as data pipelines began proliferating by the thousands. Kyle Kirwan, a former product manager at Uber, came to the realization that data professionals needed to adopt a more continuous focus on managing data quality and relevancy. Specifically, a new discipline for “Data Reliability Engineering,” modelled after Site Reliability Engineering, was needed to maintain a constant eye.
The result is Bigeye, a startup that just received its second major shot of funding (bringing the total to $66 million), that has introduced what it terms a “data observability” platform that can help organizations create a data reliability engineering practice.
Delivered as a cloud service, Bigeye continuously samples each data set, providing an ongoing timeline of data profiling to continually check for parameters such as row counts, cardinality, dups, nulls and blanks, syntax, expected values, and other outliers. It also tracks “freshness” based on the timestamps of the dataset and when it was last updated. Thresholds can be set manually or through algorithmic recommendations.
Bigeye doesn’t store the raw data per se, but instead stores and tracks the health metrics over time. Currently, Bigeye has integrations to most of the usual suspects including Snowflake, Google BigQuery, Amazon Redshift, PostgreSQL, MySQL, SQL Server, and Databricks.
At this point, Bigeye is designed to turn data profiling into a continuous, dynamic activity through constant sampling of data feeds. That in essence provides the observability piece. To enable data reliability engineering, Bigeye plans to add workflows for monitoring and managing SLAs, capabilities for root cause analysis. Part of this could be addressed through analyzing data lineage. However, even if the sources of data continue to prove out, blips in server or network performance could corrupt the data; for instance, a blip in a network feed could compromise the reliability of data derived from time series sources. This is where the tie-in on application observability could tie in to build the full picture, and why we believe synergies with Datadog are not just theoretical.