DATA Pill #014 - Future-Aware Data Engineering & Post-Deployment Data Science

Adam Kawa

CEO at GetInData, ex-Spotify | Data & AI for banks, telecoms, retail & more.

Published Aug 15, 2022

+ Follow

Hi everyone 👋,

Today we have one clickbait,

one “put the cat amongst the pigeons” kinda article 🐦

one podcast that has the potential to go viraland more.

Let’s take a look;)

ARTICLES

Mercado shares a continuous intelligence framework that enables them to deliver 79% of our shipments in less than 48 hours (due to increased demand).

Data used to support decision-making in key processes:

Carrier Capacity Optimization - monitor the percentage of network capacity utilized across every delivery zone and identify where delivery targets are at risk in almost real time.
Outbound Monitoring - enables them to identify places with lower delivery efficiency and drill into the status of individual shipments.
Air Capacity Monitoring - Provides capacity usage monitoring for aircrafts running each of our shipping routes.

Airflow's Problem | 7 min read | Airflow | Stephen Bailey | Data People Etc.

Let’s put the cat amongst the pigeons ;) Why the author doesn’t like Airflow and disputes the data mesh times we should seek as an alternative.

Google Introduces Zero-ETL Approach to Analytics on Bigtable Data Using BigQuery | 7 min read | Cloud | Steef-Jan Wiggers | InfoQ Blog

Previously, customers had to use ETL tools such as Dataflow or self-developed Python tools to copy data from Bigtable into BigQuery; however, now they can query data directly with BigQuery SQL.

{ MORE LINKS }

NEWS

Python models | 10 min read | Databricks Blog

Update on the future feature of dbt, python models.

A dbt Python model is a function that reads in dbt sources or models, applies a series of transformations and returns a transformed dataset. DataFrame operations define the starting points, the end state and each step along the way. This is similar to the role of CTEs in dbt SQL models.

{ MORE LINKS }

TUTORIALS

Iceberg Tables: Powering Open Standards with Snowflake Innovations | 7 min read | Data Lake | James Malone | Snowflake

Snowflake is used to solve three challenges commonly related to large data sets: control, cost, and interoperability. Iceberg Tables combine unique Snowflake capabilities with the Apache Iceberg and Apache Parquet open source projects to solve this. This article explains how Iceberg Tables are supposed to help with that.

Recommended by LinkedIn

9 Predictions for Data in 2023

Tomasz Tunguz 2 years ago

Learn Data Science From Scratch by : 10 Skills You…

Abhinavan Sarikonda ✨ 2 years ago

Hiding within those mounds of data is knowledge that…

Santosh Raman Mishra 3 years ago

{ MORE LINKS }

PODCAST

Future-Aware Data Engineer | 42 min | Data Engineering | 💪 Paweł Leszczyński | GetInData

Will this go viral? It’s already widely commented and shared material. …

It is the story of past and current inventions like Facebook by Mark Zuckerberg vs the airplane by the Wright brothers. What is the Dunning-Krueger effect and what does it have in common with Wikipedia? Why did Jacek Kuroń not have to pay his phone bills? We're going to look at these inventions through the lens of Yuval Noah Harari, Daniel Kahneman, and Slavoj Zizek. Seems like the perfect authors' trio for the ideal data-related holiday podcast.

Post-Deployment Data Science | 33 min | ML | Hakim Elakhrass | DataCamp

Many machine learning practitioners dedicate most of their attention to creating and deploying models that solve business problems. However, what happens post-deployment? Moreover, how should data teams go about monitoring models in production?

Takeaway: Data scientists need to cultivate a thorough understanding of a model’s potential business impacts, as well as the technical metrics of the model.

DataTube

WHOOPS, THE NUMBERS ARE WRONG! SCALING DATA QUALITY NETFLIX | 0,5 h | Michelle Ufford | Netflix | DataWorks Summit

We just found out that there exists a named development pattern of data pipeline DAGs that concern data quality called “Write-Audit-Publish”.

It’s like “blue-green deployment but for data”. I know, it’s obvious, but hey, it’s good to have names for simple things ;)

The original name shows up in this Netflix presentation.

You’re probably curious about how people apply this pattern in tools like dbt.

We only found one video and some slides - you will find them by clicking on MORE LINKS button ⬇

If you know of some interesting sources on this subject, please leave a comment ;)

{ MORE LINKS }

CONFS AND MEETUPS

How to simplify data and AI governance | 16 August | Online | databricks & Milliman

How to manage user identities, set up access permissions and audit controls, discover quality data and leverage automated lineage across all workloads
How to securely share live data across organizations without any data replication
How Databricks customer Milliman is leveraging Unity Catalog to simplify access management and reduce storage complexity

Speakers: Paul Roome, Liran Bareket, Dan McCurley

—---

That’s it for today! Please don't hesitate to forward this on.

See you next week 👋

Adam Kawa from GetInData

DATA Pill

2,495 followers

+ Subscribe

NannyML

Thanks for sharing! 😁

1 Reaction

To view or add a comment, sign in

See all

DATA Pill #014 - Future-Aware Data Engineering & Post-Deployment Data Science

Adam Kawa

CEO at GetInData, ex-Spotify | Data & AI for banks, telecoms, retail & more.

ARTICLES

NEWS

TUTORIALS

Recommended by LinkedIn

PODCAST

DataTube

CONFS AND MEETUPS

DATA Pill

2,495 followers

More articles by this author

Insights from the community

Others also viewed

💊 DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ❤

Architecting Data Pipelines

Azure Data and Power BI News (February 2023)

DATA Pill #075 - 5 Best Data Observability Platforms, to dbt or not to dbt

Bing New Search - End-to-End Azure Data Engineering Project using Microsoft Fabric.

Data Engineering with Apache Airflow, Snowflake, Snowpark, dbt & Cosmos, Astronomer

Wednesday Wisdom: Demystifying File Formats in Data Engineering

Analytics and Data Science News for the Week of September 20; Updates from Firebolt, Qrvey, Teradata & More

Embark on Your Data Odyssey: Unveiling the Data Science Guidebook for Success

Modern Data Stack - using Google AppSheet, Airflow, DBT, Google Big Query, and Looker Studio

Explore topics

ARTICLES

NEWS

TUTORIALS

Recommended by LinkedIn

PODCAST

DataTube

CONFS AND MEETUPS

DATA Pill

2,495 followers

💊 DATA Pill #136 - From Apache Iceberg to Real-Time AI: Trends, Tutorials, and Tools for Modern Data Pros

Dec 22, 2024

💊 DATA Pill #135 - LLM Fine-Tuning for Modern AI Teams, Data Pipelines with Apache Airflow

Dec 16, 2024

💊 DATA Pill #134 - Dear IT Departments, Please Stop Trying To Build Your Own RAG

Dec 9, 2024

💊 DATA Pill #133 - CDC at Pinterest, GCP & Iceberg, Databricks vs. Snowflake

Dec 2, 2024

💊 DATA Pill #132 - MinIO, Iceberg, Polars, chDB, NEO, and more!

Nov 25, 2024

DATA Pill #131 - Embeddings are underrated, The advent of the Open Data Lake

Nov 18, 2024

💊 DATA Pill #130 - Top 7 Alternatives to Apache Flink, How to run data science projects

Nov 11, 2024

💊 DATA Pill #129 - From ETL to AI, dbt: Incremental but Incomplete

Nov 4, 2024

💊 DATA Pill #128 - dbt™ at BlaBlaCar, What CDC is (and isn’t)

Oct 28, 2024

💊 DATA Pill #127 - dbt Semantic Layer, CSVs Into Graphs Using LLMs

Oct 21, 2024

Insights from the community

Others also viewed

💊 DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ❤

Architecting Data Pipelines

Azure Data and Power BI News (February 2023)

DATA Pill #075 - 5 Best Data Observability Platforms, to dbt or not to dbt

Bing New Search - End-to-End Azure Data Engineering Project using Microsoft Fabric.

Data Engineering with Apache Airflow, Snowflake, Snowpark, dbt & Cosmos, Astronomer

Wednesday Wisdom: Demystifying File Formats in Data Engineering

Analytics and Data Science News for the Week of September 20; Updates from Firebolt, Qrvey, Teradata & More

Embark on Your Data Odyssey: Unveiling the Data Science Guidebook for Success

Modern Data Stack - using Google AppSheet, Airflow, DBT, Google Big Query, and Looker Studio

Explore topics