Ensuring Data Quality with Snowflake

Republished from the original post.

Introduction

Why would we be collecting data if not to make business critical decisions? How accurate would our decisions be if our data were of low quality?

Actually, it depends. Sometimes, we can model from data that is not so good by cleaning up, making assumptions, etc. But generally, the lower the quality of data we deal with, the less accurate insights we are able to produce. So to maintain proper data quality, we mist understand both our customers (the tolerances) and our data (its inherent structure and quality).

In the application development world, we have our unit tests, integration tests, end-to-end tests, performance tests, and many other varieties.

Data world is a bit more complex and consists of the following:

  • Raw data sources maybe inaccurate, noisy, have issues with referential integrity, etc.
  • Transformations maybe pure SQL or code or both are more like our application code.
  • Data retrieval, when someone mistyped a column name or misinterpreted a data type.
  • Workflow complexity, where things seem to break as seemingly accurate data propagates through different steps in of a workflow.

Because of this, ensuring data quality is typically a bigger and a harder problem than with a typical application development lifecycle.

We are iteratively building a Data Quality Framework that is based on:

  • knowing your data
  • uninterrupted feedback loop based on monitoring and automation
  • improved tooling
  • applying Data Ops mindset and practices to learning and continuous improvement

Ensuring Initial Data Quality

How well do we know our data?

  • what meaning/insights does this data convey
  • how frequently is it updated
  • what do we consider noise and how it can be cleaned
  • which data features are we concerned the most
  • how does each dataset interact with the others
  • what the expected data types and values in each column are
  • is referential integrity properly maintained
  • what are the proper, uniquely identifiable primary keys

It would be impossible to make a decision of whether we could produce meaningful insights until we understand all of the above.

Here are some of the ways we used to to learn about our data and its behavior:

  • From Business. Documentation, talking to data owners, as well as reading original code.
  • Visually. Using metadata and representing it in SnowSight. For example, what you can see in the graphic below (apologies for scraping off proprietary detail), are the changes in the ingestion for unique primary keys across a set of tables, where the left side shows absolute changes and the right side showing a percentage change on the day. Here, the cyclical pattern of spikes on the right side lead to investigation of the update patterns for that table which ultimately lead to a discovery of a scheduled job that was running every Saturday. It happened to be an expected behavior, but this investigation resulted in a deeper understanding of our data sources.

No alt text provided for this image

  • DBT documentation (upcoming)

A general workflow (below) we have implemented can be found in my prior post.

No alt text provided for this image

What you will find here is an ELT pipeline, where we ingest raw data from various data sources using step functions as well as Change Data Capture events directly into Snowflake and then implement Transformations in Snowflake, thus turning Snowflake into both a Data Lake and a Data Warehouse.

There isn’t a better tool better suited to support data transformations in Snowflake than DBT, which is uniquely situated to support SDLC, improve data quality, and simplify incremental updates. Unfortunately, due to the volume of diverse datasets, variable quality of our initial data sources, the intrinsic complexity of data transformations, and time-to-market constraints, bringing a tool like DBT day one proved too difficult at the inception of the project. We learned that implementing DBT under time constraints works best when there is at least a minimal understanding of the behavior of underlying data and some organizational structure has already been established. Therefore DBT implementation has been initially postponed, and is being worked on as we speak.

We have taken the following steps to ensure initial data quality to date:

  • described and catalogued datasets in Confluence
  • identified the primary keys to help our transformations
  • created metadata tables at each step to enable visual insights
  • identified an initial set of notification thresholds to trigger PagerDuty
  • partially automated and continuing to automate regression of the most important artifacts using Gauge
  • engineers use seed data in dedicated staging environment to manually test every transformation
  • as we gained a much deeper understanding of our data, and were able to reduce the magnitude of the original raw data sources to just a few tables in our new Star Schema, we are bringing in DBT to improve on data recency, automated (unit/seed) testing, automated quality testing (nulls/uniqueness/referential integrity/etc.), engineer productivity, cataloging, documentation, and much more.
  • executing DevOps-like postmortems on all issues discovered

Testing Transformations

Initially, we chose not to bring in a workflow management tool such as Apache Airflow (Amazon MWAA). And we did it for a number of reasons:

  • Most of our data sources today require CDC and flow directly through the Snowflake Kafka Connector
  • For those few initial sources that would come in as batch, such as files in google drive, logs, etc, we thought that the maintenance overhead of the Airflow-based solution such as Amazon MWAA was too much both in terms of both cost and maintenance at project’s inception. We have decided to stick with AWS Step Functions, which may not have been perfectly suited for the problem at hand, but were easier to learn and implement by a team of engineers who did not have prior experience with Apache Airflow.

Our transformations today come in 2 flavors:

  • lambda-driven based on external events
  • scheduled task hierarchies in Snowflake

Testing transformation took a combination of a typical development testing workflow, such as unit/integration/functional with pytest/Gauge and a pure data quality testing techniques which use seeding and isolated development environments to test changes to SQL and stored procedures in isolation, with DBT coming soon to help.

Testing Microservices

Our data would have been useless if we kept it all to ourselves. We are currently exposing it through a BI Tool (Domo), SnowSight, REST (through AWS API Gateway and AWS Step Functions), Retool, and single-page web apps will be coming soon. As you maybe able to observe from my prior post, we have a number of different applications and micro-services already accessing curated data in a variety of ways and more will be coming soon.

When we deployed our first application, we were comparing results from an existing Django-based system with the results coming from Snowflake. Therefore, in addition to everything I described above, we also had to run a check against a vetted “golden-copy”. We have developed a Jupyter Notebook which creates an extensive delta report to help engineers verify every mismatched record. This notebook is currently being transformed into a full regression test written with Gauge, which could be applied to every data point, with the end results delivered to engineer’s mailbox for detailed analysis. I am pleased to say that today results coming from Snowflake have often been more accurate than the “golden-copy”.

No alt text provided for this image

In order to test python-based step functions we are focusing on a typical development workflow, such as unit/integration/functional testing with pytest/Gauge. This approach also comes handy when testing our next-gen architecture which relies heavily on Apache Kafka (Amazon MSK) and event-driven micro-services.

Roadmap

  • As our batch volumes increase, so will the complexity of managing those data sources. At that junction, we will bring-in Amazon MWAA or an equivalent fully-managed Airflow distribution to both implement workflows and perform additional testing directly from Airflow DAGs as well as through our upcoming DBT pipelines.
  • We are dealing with a few roadblocks trying to bring data to Snowflake. Performance characteristics of our CDC data prior to it making it into Snowflake made it impossible to perform a meaningful scan of data prior to it flowing into kafka or immediately before it is loaded into Snowflake. One of the most important features that Monte Carlo could be able to provide to us today would be its use of Artificial Intelligence to identify patterns in our data, such as the update frequency, value ranges, etc. This ability to detect errors would have been invaluable if we could also apply it to Kafka. Today, we have to rely on Kafka-monitoring and other metadata to make sure that we did not lose anything during initial ingestion. It is somewhat imperfect and required a lot of work to reach a required level of confidence. As Monte Carlo is bringing in additional features, it will become more attractive to us. Therefore, we will be revisiting our initial decision in the coming months.
  • Snowflake is working on native-SQL machine learning capabilities, which we could use to improve our own error detection.
  • Although regressions have not yet been baked into our automated deployment pipeline which uses Terraform/Github Actions, that is also on the roadmap.

Conclusion

Our Snowflake journey unravelled at a speed of light, which meant we learning as we moved along and we will have to revisit a number of initial decisions (see Roadmap). Those decisions were made under constraints, which we hope to be incrementally lifting over the coming months.

Perhaps the most important learning of our journey was that The Data Quality Framework established above seems to be working and is constantly improving in an iterative, agile manner. It also seems to help improve team’s performance as the amount of manual, repetitive testing is being automated away and quality improves resulting in fewer bugs and investigations.

To view or add a comment, sign in

More articles by Vladimir Pasman

Insights from the community

Others also viewed

Explore topics