DATA Pill #049 - 91% of ML Models degrade in time, MLflow 2.3 and Secrets of Deep Reinforcement Learning
Hi,
After this week, it is necessary to say:
ML and AI are doing well and are still among the year's hottest topics.
So, dig into the newest dose of knowledge.
ARTICLES
91% of ML Models Degrade in Time | 10 min | ML | Santiago Víquez | nannyML
The study by Vela et al. showed that the ML model's performance doesn't remain static, even when they achieve high accuracy during deployment. And that different ML models age at different rates even when trained on the same datasets. Another relevant remark is that not all temporal drifts will cause performance degradation. Therefore, the choice of the model and its stability also becomes one of the most critical factors in dealing with performance temporal degradation.
Use dbt and Duckdb instead of Spark in data pipelines | 7 min | Data Engineering | Niels Claeys | datamindedbe Blog
Niels presents several reasons to consider using dbt and Duckdb instead of Spark. He also highlights some limitations and challenges of using DBT and DuckDB.
The article provides a comprehensive overview of DBT and DuckDB and how they can be used in data pipelines. It encourages readers to explore these tools as alternatives to Spark.
Fivetran Puts the Customer Last | 10 min | Data Engineering | Lauren Balik | Personal Blog
Lauren strikes back. This time some conspiracy theory about Modern Data Stack vendors and what the long-awaited Fivetran's S3 connector has to do with that. As usual, it may be a provocative narration style, but it is still good food for thought. If you start looking at your cloud spend, human capital, and products as a portfolio of investments that generate returns, you will develop habits that lead you away from these Modern Data Stack games.
The road to running Apache Flink applications on AWS KDA | 6 min | Cloud | Duc Anh Khu | Deliveroo Engineering blog
In this article, you will read about the road to running Apache Flink applications on AWS KDA. Why did the Deliveroo team choose AWS KDA, and what lessons they’ve learned? Dive into the text and let yourself know their plan for the future.
In MORE LINKS you will read about How Databricks Performed ETL on One Billion Records For Under $1 and how to save 80% of GCP costs.
DATA LIBRARY
Artificial Intelligence Index Report 2023 | takes time to dig in | AI | Stanford University Human-Centered Artificial Intelligence
The sixth edition of the AI Index Report is here, featuring more original data than any previous version. Few takeaways for you:
TUTORIAL
Managing Multiple BigQuery Projects With One dbt Cloud Project | 9 min | GCP | Lucas Ortiz | Xebia Blog
This one provides a step-by-step guide to set up a BigQuery connection in the dbt Cloud project, how to enable BigQuery API, and how to create a service account for the project. It concludes by providing a workflow to manage and execute dbt projects on multiple big projects in dbt Cloud.
In MORE LINKS you will read about introducing MLflow 2.3: Enhanced with Native LLM Support and New Features.
Recommended by LinkedIn
DATA ODDITIES
You Can Try Auto-GPT, the Next Generation of ChatGPT, Right Now | 4 min | AI | Jake Peterson | Lifehacker
Auto-GPT is a complex system relying on multiple components. It connects to the internet to retrieve specific information and data (something ChatGPT’s free version cannot do), features long-term and short-term memory management, uses GPT-4 for OpenAI’s most advanced text generation, and GPT-3.5 for file storage and summarization.
NEWS
Releasing Ververica Cloud - A Fully Managed Cloud Native Service | 3 min | Cloud | Vladimir Jandreski | Ververica Blog
Ververica has announced the beta release of Ververica Cloud. It is a fully-managed service for deploying, operating, and monitoring Apache Flink applications, including stream processing and real-time analytics. Ververica Cloud offers several benefits, including:
In MORE LINKS news from AWS and Databricks
PODCAST
Data and analytics for an audience engagement platform | 45 min | host: Adam Kawa guest: Ludwig Holmstrom | Radio DaTa Podcast
Ludwig works as a Product Analytics Director at Mentimeter. Before joining Mentimeter, he worked with data & analytics for over a decade at various companies such as Kry, Spotify, and Google.
Discussed subjects:
Secrets of Deep Reinforcement Learning | 2 h 47 min | host: Tim Scarfe guest: Minqi Jiang | Machine Learning Street Talk
Dr. Tim Scarfe interviews Minqi Jiang, on the impact of deep reinforcement learning on technology, startups, and research. Minqi shares his experiences in balancing serendipity and planning, explains the role of objectives and Goodhart's Law in decision-making, and discusses the differences between RL and supervised learning.
They also explore the possibilities of open-endedness and the intelligence explosion, as well as limitations of RL and interpretability concerns with software 2.0.
CONFS EVENTS AND MEETUPS
Snowflake Summit 2023 | 26-29th June | Las Vegas
Attend Snowflake Summit 2023 to learn how to access, build, and monetize data, tools, models, and applications in ways that were previously unimaginable. Enable seamless alignment and collaboration across these crucial functions in the Data Cloud to transform nearly every aspect of your organization.
At the Summit, you’ll hear all about the latest innovations coming to the Data Cloud, and learn from hundreds of technical, data, and business experts about what’s possible for you and your organization in a world of data collaboration.
________________________
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
Adam from the GetInData | Part of Xebia
I help Data Analytics teams get value from their data faster, cheaper, and reliably
1yThat is a lot of thought leadership in 1 post. Ty for curating!
Data Science @ NannyML. Writing "The Little Book of ML Metrics".
1yThanks for the spotlight!