August 2024 DVC Pulse!

👋🏼 Hi friend!

In case you missed it we had a major open source release of DataChain! 🎉 As a result I skipped a month, so there's lots to discuss! Sorry guys, this one will be long, so grab a coffee and welcome to August!

✨The Latest

DataChain repo - DataChain our new open-source tool for the management of unstructured data was released on July 23rd, and we've already reached 578 stars (and counting)! ⭐️🤩🚀 Please star, follow, and give it a try!

🤲🏼 Contributions

We’d like to give a shout-out to the following Community members for already jumping in and contributing to DataChain:

Many thanks to michaelfeil for being the first bug reporter with Issue #170 - No “batched” inference 🐛
eltociear - with a correction to the README in PR #168
ayasyrev for his contribution with Issue #265 - Correcting the device usage in the multimodal example ⚙️
EdwardLi-Coder with cleaning up some code in this commit. 💫

🗞️ In the News

The DataChain release inspired some great news articles including:

Forbes - Curating Cleaner Data In Messy Multimodal Modals
Cerebral Valley - DataChain is AI’s Advanced Infra Platform Interview with Dmitry Petrov!
MarkTech Post - DVC.ai Released DataChain: A Groundbreaking Open-Source Python Library for Large-Scale Unstructured Data Processing and Curation
Heise Developer (German language) - DataChain: New open-source tool for AI-powered data curation
DBTA - Iterative Announces Data Chain, the Open Source, AI-Based Tool for Perfecting Unstructured Data
Datanami Iterative’s New DataChain Enables Use of AI Models to Evaluate the Quality of Unstructured Data
SiliconANGLE Iterative debuts DataChain for curating and processing unstructured data with AI models

🎥 New Videos

Dmitry Petrov 's with TFIR - Iterative’s DataChain tackles the challenges of unstructured data in AI development In this interview, Swapnil Bhartiya asks Dmitry about DataChain and how it helps teams better leverage AI models and extract value from unstructured data. Dmitry also shares his thoughts on how AI has developed in the last 20 years, where it’s going, and why properly managing data is the key to solving AI challenges.

We predict that the trajectory is going to be very similar to like it was with machine learning in general, so first get a new quality model and everyone is excited and playing with that and getting good results…but the next stage it becomes more of an engineering discipline…so maybe right now [as hype declines] it looks like it’s not good… but that is actually really good for society because that’s the way AI will be delivered to the people…when you can confidently put your chatbot against millions of people and they can build value for business and society. - Dmitry Petrov

Community Spotlight! - Beth Rudden , CEO, and Thanh Lam , CTO of Bast AI joined me for a fantastic session on Building Ethical AI: Leveraging DVC for Transparency and Trust in LLM Applications. She and Thanh provided great use cases and demos to show how they do this at Bast AI. Best quote from our discussion regarding data versioning and proper processes: "Take the hard road to easy. Don't take the easy road to hard." This was the chef's kiss moment. 🤌🏽❤️ Don’t miss this one if you care about developing AI for the long term! And see another quote below!

If we don’t have the proper version control, if we don’t know how to sanitize all of the data that goes into a system, we can never have any control or assurance on the output, so if you sanitize the inputs and you’re understanding what you’re putting in, you can control what you get out and that is what we think is so incredible about the technology that we’re doing is that we’re able to have your cake and eat it too. You can have the indeterministic alongside the deterministic, and it’s all because we understand how to properly understand our data and version it. - Beth Rudden, CEO Bast AI

🤔 What do you think?

Dmitry posed this question in Reddit, join the conversation and let us know what you think!

🔎 What we're looking at

Sam Altman’s Opinion piece in the Washington Post - Who will control the future of AI? - (This article is currently under a paywall in the US.) It discusses the consequences of leadership in AI from a geopolitical sense which is a worthy topic to crunch on. However, I found the rest of the article hand-wavy and written to sway people to make policy in OpenAi’s favor. Reading the comments on the article led me to ultimately have an enlightening conversation with Jason Green-Lowe , Executive Director of the Center for AI Policy. Spoiler alert: I may have an upcoming online meetup with him and his teammates. Stay tuned! 😉
Data Flywheels for LLM Applications - Shreya Shankar is hard at work as always solving the challenges of MLOps and now LLMops workflows. Her latest piece particularly caught my eye from the use of the concept of flywheels, one that I relate to very much in my work as a Community Manager. But in this case, Shreya discusses the concept of data flywheels in LLM evaluation, she discusses the framework for creating a flywheel, how to measure, operationalize, monitor, and improve the flywheel, and the special challenges LLM applications bring and how to deal with them. Of particular interest was validating the data that goes into the LLM. There’s that pesky need for quality data again! The models won’t do it alone.
Andrej Karpathy's Keynote & Winner Pitches at UC Berkeley AI Hackathon Awards Academy - This was an inspirational talk about his path, particularly at 15:00 where he talked about small snowballs, hard work, and failure. It's all about putting in the work, even if there's failure. 💪🏽

🔮What's Coming!

✍🏼 Content! The team is hard at work developing use case tutorials for you to dive in with DataChain. Stay tuned to the socials (LinkedIn, YouTube, Twitter) for upcoming info.

🤗 Community-Generated Content

Poster: Applying Data Version Control to Molecular Simulations - Marcelle Spera from the University of Stuttgart presented this poster at the 33rd European Simposium on Applied Thermodynamics ESAT 2024. She and her team used DVC in the context of molecular dynamics simulations by using it to manage and track changes to simulation data. This allows for versioning of input data such as molecular structures, force field parameters, and initial configurations.

Videos:

End-to-End Machine Learning Portfolio Project (Data and Code Versioning): In this video, DataThinkers demonstrates how to use DVC for data and code versioning in a machine learning project by creating a new project using the cookie-cutter template and then removing unnecessary files. PriyangBhattDS explains that DVC stores metadata about the data in the Git repository, while the actual data is stored in a separate location. He then creates two versions of the code and data: one with the mean value for missing values and one with the median value. Finally, Priyang shows how to clone the project and fetch the data using DVC commands.

Churn End-to-End MLOps Project (Data Validation Component): This video segment by Minich Analytics primarily focuses on establishing the data validation process within a machine learning pipeline, using DVC for version control and pipeline management. Overall, the video demonstrates data validation's initial setup and implementation within a DVC pipeline. It emphasizes defining clear configurations, schemas, and validation checks to ensure data quality. DVC was used to manage the data validation process as a pipeline stage, tracking dependencies, and outputs, and ensuring reproducibility.

How to create a Reproducible Data and ML Pipeline using DVC: In this video by Ashutosh Tripathi he talked about the step-by-step guide on how to create and build data and ML pipeline using DVC and explained the need for reproducibility in MLOps and how we can achieve it with the help of DVC ML pipelines. In the step-by-step approach, the video starts with explaining the model training code in the notebook and then it takes us to the VS code where we define the proper project structure and create separate py files for data preparation, feature generation, model training, and model evaluation.

End-to-End Machine Learning Portfolio Project (Experiments Tracking with DVC): This video by DataThinkers is about performing experiments tracking with DVC. In this video, the PriyangBhattDS first creates a DVC pipeline from scratch using a cookie-cutter template. He then shows you how to use DVC to track your experiments. He also shows you how to compare different experiments and how to reproduce your best experiment.

Articles:

Using DVC to Manage Machine Learning Projects: This article by Shun Lung C. describes how to use DVC to manage machine learning projects with the following key points: Managing ML Pipelines, Running Experiments, and instructions for setting up DVC, creating a sample pipeline, and using DVC Studio with GitHub Actions for deployment automation.

10 Best Practices for Data Science: A Modern Guide: This article by Mohammed Arshad lists 10 best practices for data science projects, one of which is adopting version control for code and data. The article recommends using Git for code version control and DVC for data. It includes an example of using DVC to track a data file. The commands involve initializing DVC in your project, adding a data file or directory for tracking, and committing the changes with Git.

Data Version Control: Your Data Science Superpower: This article by Nyab explains how Data Version Control (DVC) can be a data scientist's superpower. How it helps data scientists track changes in data, models, and experiments alongside their code. The article also mentions Makefiles as a companion tool to DVC with key benefits such as Reproducibility, Collaboration, and Experimentation.

Empowering Machine Learning Workflows with DVC and Git: Best Practices for Team Collaboration: This article by Rohan Chakraborty highlights the importance of DVC in managing data and models for machine learning projects, particularly when collaborating with a team, and how it helps with Machine Learning workflows from versioning large files, remote storage solutions like Amazon S3 or Azure Blob storage, and resolving merge conflicts The article also emphasizes clear naming conventions for branches, datasets, and models to improve project organization.

Top 20 AI tools for business: Your 2024 AI toolbox: In this article, Yulia Dmitrievna and Eduard Parsadanyan explore various AI tools helpful for businesses. It categorizes them into five sections: data and model management, LLM and embedding solutions, specialized AI services, quality assurance and monitoring, and workflow orchestration. The article highlights OpenAI, Anthropic, and LLMware as some providers offering access to LLMs via APIs. for data and model management, it explains DVC as an open-source version control system designed specifically for machine learning projects which allows data scientists and ML engineers to track changes in data, models, and experiments alongside their code.

Leveraging DVC in ML workflows: This article by Rohan Chakraborty describes how a team used Data Version Control (DVC) to build a scalable and efficient Machine Learning Operations (MLOps) framework for a car door module fault detection system. DVC helped manage large datasets, version control models and pipelines, and integrate with CI/CD for deployment. The article details the benefits of DVC in data versioning, remote storage configuration, pipeline management, collaboration and experimentation, and CI/CD integration. Overall, DVC improved the efficiency, reproducibility, and scalability of the ML project.

Phew! 😅 I told you that would be long! Thanks for the read! Many thanks to all our Community members for sharing their use cases and learnings on DVC! We are looking forward to seeing what you do with DataChain! And as always special thanks to Gift Ojeabulu for his help curating and writing about the Community content! We'll see you next month!

To your continued success,

Jenifer De Figueiredo (Jeny), Community Manager, DVC.ai

August 2024 DVC Pulse!

iterative.ai

Developer tools for Data, Machine Learning and Generative AI

✨The Latest

🤲🏼 Contributions

🗞️ In the News

🎥 New Videos

🤔 What do you think?

Recommended by LinkedIn

🔎 What we're looking at

🔮What's Coming!

🤗 Community-Generated Content

More articles by this author

Insights from the community

Others also viewed

132B Open LLM from Databricks outperforms Mixtral and Grok-1 🔥

Using Databases and Data Warehouses as Vector Databases for AI Agents

Harnessing AI for Log analysis using AI functions in Databricks

Delphina Blueprints: Freeing Data Scientists to (Finally) Focus on the Deep Work

DATA Pill #049 - 91% of ML Models degrade in time, MLflow 2.3 and Secrets of Deep Reinforcement Learning

Retrieval Augmented Generation (RAG) for Structured Data Processing

DATA Pill #052 - LLM, observability, Data Catalogs & storage cost reduction again

How to build your scale-up data infrastructure for AI workloads?

💊 DATA Pill #121 - Local & Free Multi-Agent RAG Superbot, Data Mesh - Where Are We Now?

Do you still need RAG (Retrieval Augmentation Generation) now that we have Microsoft Copilot Pro?

Explore topics

✨The Latest

🤲🏼 Contributions

🗞️ In the News

🎥 New Videos

🤔 What do you think?

Recommended by LinkedIn

🔎 What we're looking at

🔮What's Coming!

🤗 Community-Generated Content

October 2024 DVC Pulse!

Oct 18, 2024

June 2024 DVC Pulse!

Jun 27, 2024

May 2024 DVC Pulse!

May 30, 2024

April 2024 DVC Pulse!

Apr 26, 2024

DVC Community February Updates!

Feb 28, 2024

DVC Community January Updates!

Jan 26, 2024

DVC Community December Updates

Dec 5, 2023

DVC Community November Updates!

Nov 7, 2023

DVC Community September Updates

Sep 26, 2023

DVC August '23 Community Updates

Aug 22, 2023

Insights from the community

Others also viewed

132B Open LLM from Databricks outperforms Mixtral and Grok-1 🔥

Using Databases and Data Warehouses as Vector Databases for AI Agents

Harnessing AI for Log analysis using AI functions in Databricks

Delphina Blueprints: Freeing Data Scientists to (Finally) Focus on the Deep Work

DATA Pill #049 - 91% of ML Models degrade in time, MLflow 2.3 and Secrets of Deep Reinforcement Learning

Retrieval Augmented Generation (RAG) for Structured Data Processing

DATA Pill #052 - LLM, observability, Data Catalogs & storage cost reduction again

How to build your scale-up data infrastructure for AI workloads?

💊 DATA Pill #121 - Local & Free Multi-Agent RAG Superbot, Data Mesh - Where Are We Now?

Do you still need RAG (Retrieval Augmentation Generation) now that we have Microsoft Copilot Pro?

Explore topics