Art of Data Newsletter - Issue #7

Bartosz Gajda

Databricks - Azure - Python | Staff Azure Data Engineer @ Lingaro

Published May 6, 2023

+ Follow

Welcome all Data fanatics. In today's issue:

#Mojo - new programming language, based on #Python, which promises to fix Python’s performance and deployment problems.
dbt Labs calls for efficient and reliable multi-team projects as a future of #AnalyticsEngineering
Amazon Prime Video team ditched microservices for a monolith
Welcome to the Jungle migrated their data warehouse from #PostgreSQL to #Snowflake
Wix discusses approaches for efficient #DataEngineering, including optimizing #Spark with #Iceberg, ensuring data quality with #GreatExpectations, and using game theory for code review
Airbyte 's part 2 of a 3-part series on #DataModeling, covering techniques and benefits of data modeling approaches
SQLMesh's Semantic diff tool detects functional changes in SQL queries and categorizes them for efficiency
"Dev Deletes Entire Production Database, Chaos Ensues" - GitLab 's system outage analysis

Let's dive in!

fast.ai - Mojo may be the biggest programming language advance in decades | 27mins

Mojo is a new programming language, that promises to address the major flaws of Python. Mojo code can be directly “compiled” into MLIR. MLIR is basically a virtual machine instruction language, which on today’s hardware can be more efficient than native code. MLIR does not directly target hardware, but instead is turned into native code for particular CPU and GPU instruction sets in a staging step after “compilation”. This means a program written in Mojo is completely self contained, and can be plugged into any machine with a small and well defined runtime.

The next big step forwards for analytics engineering | 17mins

dbt is a tool that has revolutionized how analytics are done. It allows teams to collaborate more effectively and pushes the industry towards being more efficient by embracing techniques from software engineering such as modularity, documentation, CI/CD, source control, testing, environments, and SLAs. However, dbt does not currently provide the tools for data teams to implement the same solutions that software engineering teams use, so data organizations still struggle with velocity, collaboration, quality, and SLA achievement. To improve, teams need the ability to define and validate interfaces and contracts, as well as the ability to upgrade code without causing downstream breakages. dbt Core v1.5, due out at the end of April, will provide these capabilities.

Even Amazon can't make sense of serverless or microservices | 4mins

The Prime Video team at Amazon were able to save a whopping 90% on operating costs by replacing their serverless, microservices architecture with a monolith. It has become clear that in practice, microservices can lead to needlessly complex systems. The idea of Service Oriented Architecture made sense for Amazon's scale, but when adopted in different contexts, it is often toxic. This is the third time this idea has been rejected, but it is important to remain vigilant for future iterations of this idea.

From PostgreSQL to Snowflake: A data migration story | 16mins

This is a story of how Welcome to the Jungle carried out a data warehouse migration from PostgreSQL to Snowflake. It took 3 quarters to migrate 8 ingestion pipelines, 2 reverse-ETLs, and 412 tables while implementing double run, setting up third-party ingestions and a data modeling tool, and running QA tests. The transition led to improvements such as faster query time, reduced development time and complexity, better scalability, and the ability to use advanced features such as data masking and time travel. When carrying out similar projects, it is important to fix milestones, discuss refactoring and plan for stakeholders.

A Comprehensive Approach to Efficient Data Engineering | 1mins

This collection of talks covrs a comprehensive approach to efficient data engineering, focusing on optimizing Spark with Iceberg, ensuring data quality with Great Expectations and elevating software decision making with game theory. The presentation provided deep technical insights from experienced data engineers and showed how to solve common engineering challenges.

Data Engineering Design Principles You Should Follow | 5mins

Data engineering requires following standards, design patterns, and principles, which are slightly different from those of software engineering. These principles include precision and formality, separation of concerns, modularity, abstraction, anticipation of change, incremental development, and reliability. Data engineers need to ensure accuracy, document standards, break down systems into manageable parts, focus on reusability and composability, hide implementation details, design pipelines that can handle changes and failures, and build pipelines incrementally.

Data Modeling - The Unsung Hero of Data Engineering: Modeling Approaches and Techniques (Part 2) | 21mins

Part 2 of the Data Modeling series has discussed the various data modeling approaches and techniques, from dimensional and data vault modeling to anchoring and bitemporal modeling. It has explored key concepts such as top-down and bottom-up approaches, data vault schemas, and the importance of standardization. Common problems with data modeling were also addressed, such as data replication and more.

Implementing Data Validation with Great Expectations in Hybrid Environments | 8mins

Data validation is a crucial step in data processing to maintain data accuracy and integrity. Great Expectations is an open-source framework that provides a flexible and efficient way to perform data validation. It allows data scientists and analysts to quickly identify and correct data issues. The article shares the implementation experience of Great Expectations in a Hadoop environment and discusses its benefits and limitations.

Tobiko Data - Automatically detecting breaking changes in SQL queries | 4mins

Semantic Diff for SQL was built to help users detect functional changes (as opposed to cosmetic or structural ones) and assess their downstream impact. This process categorizes changes into breaking and non-breaking to reduce recomputation while guaranteeing correctness. Presently, the only kind of change considered non-breaking is adding new projections to the query's SELECT statement. If users don't like a category assigned to their changes by SQLMesh, they can always override and set it manually. The algorithm for this process is a few lines of Python code and additional heuristics will be added in the future to automatically detect more kinds of non-breaking changes. Additionally, column-level lineage will be used to categorize changes per each impacted downstream

Dev Deletes Entire Production Database, Chaos Ensues | 10mins

The video discusses errors made during a GitLab 's system outage postmortem. Best to learn on other's mistakes :)

Art of Data Newsletter - Issue #7

Bartosz Gajda

Databricks - Azure - Python | Staff Azure Data Engineer @ Lingaro

fast.ai - Mojo may be the biggest programming language advance in decades | 27mins

The next big step forwards for analytics engineering | 17mins

Even Amazon can't make sense of serverless or microservices | 4mins

From PostgreSQL to Snowflake: A data migration story | 16mins

Recommended by LinkedIn

A Comprehensive Approach to Efficient Data Engineering | 1mins

Data Engineering Design Principles You Should Follow | 5mins

Data Modeling - The Unsung Hero of Data Engineering: Modeling Approaches and Techniques (Part 2) | 21mins

Implementing Data Validation with Great Expectations in Hybrid Environments | 8mins

Tobiko Data - Automatically detecting breaking changes in SQL queries | 4mins

Dev Deletes Entire Production Database, Chaos Ensues | 10mins

Art of Data

285 followers

More articles by Bartosz Gajda

Insights from the community

Others also viewed

Understanding How Apache MLlib Empowers Scalable Machine Learning with Apache Spark

What are the benefits of using PySpark for Data Analysis?

Handling Big Data with Python

Marvelous MLOps #50: Dealing with private Python packages in Databricks Asset Bundles, part 1.

Understanding the essential Data Processing libraries

Going Multi Region with 0 Downtime and High Availability with Event Bridge Global Endpoints

🚀 Day 11: Navigating the Depths of Data Structures and Algorithms for Data Science!

A Comprehensive Comparison of Programming and Query Languages for Data Analytics and Data Science Jobs

Python in Data Engineering: Powering Databricks, Snowflake, dbt, and Airflow for Big Data Pipelines

Explore topics

fast.ai - Mojo may be the biggest programming language advance in decades | 27mins

The next big step forwards for analytics engineering | 17mins

Even Amazon can't make sense of serverless or microservices | 4mins

From PostgreSQL to Snowflake: A data migration story | 16mins

Recommended by LinkedIn

A Comprehensive Approach to Efficient Data Engineering | 1mins

Data Engineering Design Principles You Should Follow | 5mins

Data Modeling - The Unsung Hero of Data Engineering: Modeling Approaches and Techniques (Part 2) | 21mins

Implementing Data Validation with Great Expectations in Hybrid Environments | 8mins

Tobiko Data - Automatically detecting breaking changes in SQL queries | 4mins

Dev Deletes Entire Production Database, Chaos Ensues | 10mins

Art of Data

285 followers

More articles by Bartosz Gajda

Art of Data Newsletter - Issue #19

Art of Data Newsletter - Issue #18

Art of Data Newsletter - Issue #17

Art of Data Newsletter - Issue #16

Art of Data Newsletter - Issue #15

Art of Data Newsletter - Issue #14

Art of Data Newsletter - Issue #13

Art of Data Newsletter - Issue #12

Art of Data Newsletter - Issue #11

Art of Data Newsletter - Issue #10

Insights from the community

Others also viewed

Understanding How Apache MLlib Empowers Scalable Machine Learning with Apache Spark

What are the benefits of using PySpark for Data Analysis?

Handling Big Data with Python

Marvelous MLOps #50: Dealing with private Python packages in Databricks Asset Bundles, part 1.

Understanding the essential Data Processing libraries

Going Multi Region with 0 Downtime and High Availability with Event Bridge Global Endpoints

🚀 Day 11: Navigating the Depths of Data Structures and Algorithms for Data Science!

A Comprehensive Comparison of Programming and Query Languages for Data Analytics and Data Science Jobs

Python in Data Engineering: Powering Databricks, Snowflake, dbt, and Airflow for Big Data Pipelines

Explore topics