DATA Pill #014 - Future-Aware Data Engineering & Post-Deployment Data Science

DATA Pill #014 - Future-Aware Data Engineering & Post-Deployment Data Science

Hi everyone 👋,


Today we have one clickbait, 

one “put the cat amongst the pigeons” kinda article 🐦

one podcast that has the potential to go viraland more.

Let’s take a look;)



ARTICLES 

Keeping track of shipments minute by minute: How Mercado Libre uses real-time analytics for on-time delivery | 12 min read | Data Analytics | Pablo Fernández Osorio | Mercado Libre | Google Cloud Blog

Mercado shares a continuous intelligence framework that enables them to deliver 79% of our shipments in less than 48 hours (due to increased demand).

Data used to support decision-making in key processes:

  1. Carrier Capacity Optimization - monitor the percentage of network capacity utilized across every delivery zone and identify where delivery targets are at risk in almost real time.
  2. Outbound Monitoring - enables them to identify places with lower delivery efficiency and drill into the status of individual shipments.
  3. Air Capacity Monitoring - Provides capacity usage monitoring for aircrafts running each of our shipping routes.

No alt text provided for this image


Airflow's Problem | 7 min read | Airflow | Stephen Bailey | Data People Etc.

Let’s put the cat amongst the pigeons ;) Why the author doesn’t like Airflow and disputes the data mesh times we should seek as an alternative. 


Google Introduces Zero-ETL Approach to Analytics on Bigtable Data Using BigQuery | 7 min read | Cloud | Steef-Jan Wiggers | InfoQ Blog

Previously, customers had to use ETL tools such as Dataflow or self-developed Python tools to copy data from Bigtable into BigQuery; however, now they can query data directly with BigQuery SQL.

{ MORE LINKS }


NEWS 

Python models | 10 min read | Databricks Blog 

Update on the future feature of dbt, python models.

A dbt Python model is a function that reads in dbt sources or models, applies a series of transformations and returns a transformed dataset. DataFrame operations define the starting points, the end state and each step along the way. This is similar to the role of CTEs in dbt SQL models.

{ MORE LINKS }


TUTORIALS 

Iceberg Tables: Powering Open Standards with Snowflake Innovations | 7 min read | Data Lake | James Malone | Snowflake

Snowflake is used to solve three challenges commonly related to large data sets: control, cost, and interoperability. Iceberg Tables combine unique Snowflake capabilities with the Apache Iceberg and Apache Parquet open source projects to solve this. This article explains how Iceberg Tables are supposed to help with that.

{ MORE LINKS }


PODCAST

Future-Aware Data Engineer | 42  min | Data Engineering | 💪 Paweł Leszczyński | GetInData

Will this go viral? It’s already widely commented and shared material. …

It is the story of past and current inventions like Facebook by Mark Zuckerberg vs the airplane by the Wright brothers. What is the Dunning-Krueger effect and what does it have in common with Wikipedia? Why did Jacek Kuroń not have to pay his phone bills? We're going to look at these inventions through the lens of Yuval Noah Harari, Daniel Kahneman, and Slavoj Zizek. Seems like the perfect authors' trio for the ideal data-related holiday podcast. 

  

Post-Deployment Data Science | 33  min | ML | Hakim Elakhrass | DataCamp

Many machine learning practitioners dedicate most of their attention to creating and deploying models that solve business problems. However, what happens post-deployment? Moreover, how should data teams go about monitoring models in production?

Takeaway: Data scientists need to cultivate a thorough understanding of a model’s potential business impacts, as well as the technical metrics of the model.


DataTube

WHOOPS, THE NUMBERS ARE WRONG! SCALING DATA QUALITY NETFLIX | 0,5 h | Michelle Ufford | Netflix | DataWorks Summit

We just found out that there exists a named development pattern of data pipeline DAGs that concern data quality called “Write-Audit-Publish”.

It’s like “blue-green deployment but for data”. I know, it’s obvious, but hey, it’s good to have names for simple things ;)

The original name shows up in this Netflix presentation.

You’re probably curious about how people apply this pattern in tools like dbt.

We only found one video and some slides - you will find them by clicking on MORE LINKS button ⬇

If you know of some interesting sources on this subject, please leave a comment ;)

{ MORE LINKS } 


CONFS AND MEETUPS

How to simplify data and AI governance | 16 August | Online | databricks & Milliman

  • How to manage user identities, set up access permissions and audit controls, discover quality data and leverage automated lineage across all workloads
  • How to securely share live data across organizations without any data replication
  • How Databricks customer Milliman is leveraging Unity Catalog to simplify access management and reduce storage complexity

Speakers: Paul Roome, Liran Bareket, Dan McCurley

 

—---

That’s it for today! Please don't hesitate to forward this on.

See you next week 👋

Adam Kawa from GetInData

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics