Ensuring Data Quality with Snowflake
Republished from the original post.
Introduction
Why would we be collecting data if not to make business critical decisions? How accurate would our decisions be if our data were of low quality?
Actually, it depends. Sometimes, we can model from data that is not so good by cleaning up, making assumptions, etc. But generally, the lower the quality of data we deal with, the less accurate insights we are able to produce. So to maintain proper data quality, we mist understand both our customers (the tolerances) and our data (its inherent structure and quality).
In the application development world, we have our unit tests, integration tests, end-to-end tests, performance tests, and many other varieties.
Data world is a bit more complex and consists of the following:
Because of this, ensuring data quality is typically a bigger and a harder problem than with a typical application development lifecycle.
We are iteratively building a Data Quality Framework that is based on:
Ensuring Initial Data Quality
How well do we know our data?
It would be impossible to make a decision of whether we could produce meaningful insights until we understand all of the above.
Here are some of the ways we used to to learn about our data and its behavior:
A general workflow (below) we have implemented can be found in my prior post.
Recommended by LinkedIn
What you will find here is an ELT pipeline, where we ingest raw data from various data sources using step functions as well as Change Data Capture events directly into Snowflake and then implement Transformations in Snowflake, thus turning Snowflake into both a Data Lake and a Data Warehouse.
There isn’t a better tool better suited to support data transformations in Snowflake than DBT, which is uniquely situated to support SDLC, improve data quality, and simplify incremental updates. Unfortunately, due to the volume of diverse datasets, variable quality of our initial data sources, the intrinsic complexity of data transformations, and time-to-market constraints, bringing a tool like DBT day one proved too difficult at the inception of the project. We learned that implementing DBT under time constraints works best when there is at least a minimal understanding of the behavior of underlying data and some organizational structure has already been established. Therefore DBT implementation has been initially postponed, and is being worked on as we speak.
We have taken the following steps to ensure initial data quality to date:
Testing Transformations
Initially, we chose not to bring in a workflow management tool such as Apache Airflow (Amazon MWAA). And we did it for a number of reasons:
Our transformations today come in 2 flavors:
Testing transformation took a combination of a typical development testing workflow, such as unit/integration/functional with pytest/Gauge and a pure data quality testing techniques which use seeding and isolated development environments to test changes to SQL and stored procedures in isolation, with DBT coming soon to help.
Testing Microservices
Our data would have been useless if we kept it all to ourselves. We are currently exposing it through a BI Tool (Domo), SnowSight, REST (through AWS API Gateway and AWS Step Functions), Retool, and single-page web apps will be coming soon. As you maybe able to observe from my prior post, we have a number of different applications and micro-services already accessing curated data in a variety of ways and more will be coming soon.
When we deployed our first application, we were comparing results from an existing Django-based system with the results coming from Snowflake. Therefore, in addition to everything I described above, we also had to run a check against a vetted “golden-copy”. We have developed a Jupyter Notebook which creates an extensive delta report to help engineers verify every mismatched record. This notebook is currently being transformed into a full regression test written with Gauge, which could be applied to every data point, with the end results delivered to engineer’s mailbox for detailed analysis. I am pleased to say that today results coming from Snowflake have often been more accurate than the “golden-copy”.
In order to test python-based step functions we are focusing on a typical development workflow, such as unit/integration/functional testing with pytest/Gauge. This approach also comes handy when testing our next-gen architecture which relies heavily on Apache Kafka (Amazon MSK) and event-driven micro-services.
Roadmap
Conclusion
Our Snowflake journey unravelled at a speed of light, which meant we learning as we moved along and we will have to revisit a number of initial decisions (see Roadmap). Those decisions were made under constraints, which we hope to be incrementally lifting over the coming months.
Perhaps the most important learning of our journey was that The Data Quality Framework established above seems to be working and is constantly improving in an iterative, agile manner. It also seems to help improve team’s performance as the amount of manual, repetitive testing is being automated away and quality improves resulting in fewer bugs and investigations.