Art of Data Newsletter - Issue #7
Welcome all Data fanatics. In today's issue:
Let's dive in!
Mojo is a new programming language, that promises to address the major flaws of Python. Mojo code can be directly “compiled” into MLIR. MLIR is basically a virtual machine instruction language, which on today’s hardware can be more efficient than native code. MLIR does not directly target hardware, but instead is turned into native code for particular CPU and GPU instruction sets in a staging step after “compilation”. This means a program written in Mojo is completely self contained, and can be plugged into any machine with a small and well defined runtime.
dbt is a tool that has revolutionized how analytics are done. It allows teams to collaborate more effectively and pushes the industry towards being more efficient by embracing techniques from software engineering such as modularity, documentation, CI/CD, source control, testing, environments, and SLAs. However, dbt does not currently provide the tools for data teams to implement the same solutions that software engineering teams use, so data organizations still struggle with velocity, collaboration, quality, and SLA achievement. To improve, teams need the ability to define and validate interfaces and contracts, as well as the ability to upgrade code without causing downstream breakages. dbt Core v1.5, due out at the end of April, will provide these capabilities.
The Prime Video team at Amazon were able to save a whopping 90% on operating costs by replacing their serverless, microservices architecture with a monolith. It has become clear that in practice, microservices can lead to needlessly complex systems. The idea of Service Oriented Architecture made sense for Amazon's scale, but when adopted in different contexts, it is often toxic. This is the third time this idea has been rejected, but it is important to remain vigilant for future iterations of this idea.
This is a story of how Welcome to the Jungle carried out a data warehouse migration from PostgreSQL to Snowflake. It took 3 quarters to migrate 8 ingestion pipelines, 2 reverse-ETLs, and 412 tables while implementing double run, setting up third-party ingestions and a data modeling tool, and running QA tests. The transition led to improvements such as faster query time, reduced development time and complexity, better scalability, and the ability to use advanced features such as data masking and time travel. When carrying out similar projects, it is important to fix milestones, discuss refactoring and plan for stakeholders.
Recommended by LinkedIn
This collection of talks covrs a comprehensive approach to efficient data engineering, focusing on optimizing Spark with Iceberg, ensuring data quality with Great Expectations and elevating software decision making with game theory. The presentation provided deep technical insights from experienced data engineers and showed how to solve common engineering challenges.
Data engineering requires following standards, design patterns, and principles, which are slightly different from those of software engineering. These principles include precision and formality, separation of concerns, modularity, abstraction, anticipation of change, incremental development, and reliability. Data engineers need to ensure accuracy, document standards, break down systems into manageable parts, focus on reusability and composability, hide implementation details, design pipelines that can handle changes and failures, and build pipelines incrementally.
Data Modeling - The Unsung Hero of Data Engineering: Modeling Approaches and Techniques (Part 2) | 21mins
Part 2 of the Data Modeling series has discussed the various data modeling approaches and techniques, from dimensional and data vault modeling to anchoring and bitemporal modeling. It has explored key concepts such as top-down and bottom-up approaches, data vault schemas, and the importance of standardization. Common problems with data modeling were also addressed, such as data replication and more.
Data validation is a crucial step in data processing to maintain data accuracy and integrity. Great Expectations is an open-source framework that provides a flexible and efficient way to perform data validation. It allows data scientists and analysts to quickly identify and correct data issues. The article shares the implementation experience of Great Expectations in a Hadoop environment and discusses its benefits and limitations.
Semantic Diff for SQL was built to help users detect functional changes (as opposed to cosmetic or structural ones) and assess their downstream impact. This process categorizes changes into breaking and non-breaking to reduce recomputation while guaranteeing correctness. Presently, the only kind of change considered non-breaking is adding new projections to the query's SELECT statement. If users don't like a category assigned to their changes by SQLMesh, they can always override and set it manually. The algorithm for this process is a few lines of Python code and additional heuristics will be added in the future to automatically detect more kinds of non-breaking changes. Additionally, column-level lineage will be used to categorize changes per each impacted downstream
The video discusses errors made during a GitLab 's system outage postmortem. Best to learn on other's mistakes :)