October 2024 DVC Pulse!

October 2024 DVC Pulse!

👋🏼 Hi friend! Welcome to October! 

Lots of juicy new content this month to dig into, so let's get at it!

✨The Latest  

DataChain repo- Thank you Community for helping us reach 776 stars already! ⭐️🤩🚀 Please check the repo and the starter projects that lie therein!  We would love your feedback! 🙏

🎥 Videos

Fine-tuning Mutlimodal Models (CLIP) with DataChain to Match Cartoon Images to Joke Captions - Technical Product Manager,  David Berenbaum shows off his blog tutorial (see below) in this video using DataChain. You will learn how to ingest and process image and text data, join datasets using DataChain, calculate image-text similarities with CLIP, fine-tune CLIP on custom datasets, and evaluate model performance before and after fine-tuning

Scalable PDF Document Processing with DataChain and Unstructured.io - Customer Success Engineer,  Tibor Mach demonstrates how to efficiently process large collections of documents using DataChain and Unstructured.io. Key points covered are introductions to DataChain and Unstructured.io, scalable document processing without moving data, filtering and lazy evaluation for efficient processing, creating custom logic with user-defined functions (UDFs), versioning and metadata layer management, and transforming messy document collections into structured tables.

✍🏼 Blog Posts

Check out lots of new content on the blog in the last month: Scalable PDF Document Processing with DataChain and Unstructured.io - This is the blog post tutorial that Customer Success Engineer, Tibor Mach , showcases in the above-referenced video. We’d love your feedback when you try out the tutorial! Let us know in this topic in our discourse forum.

Post-modern AI Data Stack - DataChain’s Technical Product Manager, Daniel Kharitonov shares how and why we think Generative AI will change the modern data stack. What's your take on this? Leave your feedback here in our forum!

You Do the Math: Fine-Tuning Multimodal Models (CLIP) to Match Cartoon Images to Joke Captions - This is the blog post DVC Technical Product Manager, David Berenbaum goes over in the above-referenced video. Try out the project and let us know what you think at this link to the forum!

Enforcing JSON Outputs in Commercial LLMs - Daniel Kharitonov no tested the structured output capabilities of Google Gemini Pro, Anthropic Claude, and OpenAI GPT. In their best-performing configurations, all three models can generate structured outputs on a scale of thousands of JSON objects. However, the API capabilities vary significantly in the effort required to prompt the models to produce JSONs and in their ability to adhere to the suggested data model layouts. This article originally appeared here in Towards Data Science. Add your thoughts here in the forum!

If you didn’t know yet, we now have a publication page in Medium! Be sure to bookmark it for your reading list!


DVC on Medium
DVC Publication on Medium

🤲🏼 Contributions

We’d like to give a shout-out to the following Community members for already jumping in and contributing to DataChain:

EdwardLi Coder for:

Srini047 for:

ayasyrev for:

Feel free to open an issue in our repos or you can always reach out to us for support in our Discord Channel, or write us at support@dvc.org.


🗞️ In the News

The NEWSTACK released an article by our CEO, Dmitry Petrov recently entitled Prioritize Robust Engineering Over Overblown GenAI Promises.  This article addresses what we know will be true: that ultimately we've got to do the practical engineering work to actually deliver on the speculative, unfulfilled GenAi potential we've seen unfold in the last year. 


🔎 What we're looking at

The State of AI Report - Nathan Benaich just released his annual not-to-be-missed State of AI report with predictions for the next year.

Dr. Sasha Luccioni . Bruna Sellin Trevelin , and Margaret Mitchell recently released The Environmental Impacts of AI - Policy. As the name suggests, it a great starter guide to the different ways the AI supply chain uses natural resources. Currently, there is not much good data on this topic and actual impact is ambiguous and being estimated. This primer works to shed light on the subject and show a path forward. Many countries are starting to put forward laws requiring transparency and reductions, but they are largely in their nascent stages. Also addressed in the piece were technical, behavioral, organizational, and policy interventions that can reduce AI’s environmental impact.

NVIDIA releases NVLM - An Open Frontier-class Multimodal LLM with performance on benchmarks that competes with or in some cases out-performs leading LLMs. The multimodal results are accomplished without degrading text-only performance. The architecture and data curation insights of the Multimodal Pretraining data and the Multimodal supervised fine-tuning (SFT) data are provided in the paper. Important to note this quote:the quality of the dataset matters more than its scale, even at the pretraining stage”


NVIDIA's NVLM Open Frontier-class Multimodal benchmark performance
NVIDIA NVLM Open Frontier-class Multimodal LLM Benchmark performance

Software as a Public Good - by Felix Reda of GitHub - Felix makes the case for open source being a public good and as such, its maintenance should be funded by industry and government alike. He brings attention to the recent initiative of the UN hosting the OSPOs for Good summit at their headquarters in New York in hopes of ultimately adopting the Global Digital Compact to help increase funding for Digital Public Goods (DPGs) that attain Sustainable Development Goals (SDGs). Yep, there’s a lot of acronyms here. 😅 TLDR, 80% of DPGs live on GitHub, so you can get involved in supporting open-source principles by contributing to For Good First Issues. These projects have been identified as DPG’s.  

𝗗𝗮𝘁𝗮𝗖𝗼𝗺𝗽-𝗟𝗠: 𝗜𝗻 𝘀𝗲𝗮𝗿𝗰𝗵 𝗼𝗳 𝘁𝗵𝗲 𝗻𝗲𝘅𝘁 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 𝗼𝗳 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘀𝗲𝘁𝘀 𝗳𝗼𝗿 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹𝘀 was released in June. The DataComp competition was developed to create a benchmark where the models are fixed and the objective is to create the best possible dataset. The competition has expanded to include LLMs.

The research by Jeffrey Li , et. al. representing a large number of Universities and Industry professionals alike shows how 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗱 𝗱𝗮𝘁𝗮𝘀𝗲𝘁 𝗰𝘂𝗿𝗮𝘁𝗶𝗼𝗻 𝗰𝗮𝗻 𝗵𝗲𝗹𝗽 𝗱𝗲𝗹𝗶𝘃𝗲𝗿 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗱 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗮𝘁 𝗮 𝗹𝗼𝘄𝗲𝗿 𝗰𝗼𝘀𝘁 𝘁𝗵𝗮𝗻 𝗰𝗼𝗻𝘁𝗶𝗻𝘂𝗲𝗱 𝘀𝗰𝗮𝗹𝗶𝗻𝗴 𝗼𝗳 𝘁𝗵𝗲 𝗱𝗮𝘁𝗮.


DCLM State-of-the-art-comparison
DCLM State-of-the-art-comparison

Image of an Actually Lean AI Strategy: Avoiding Compounding Costs, Multiplying Impact of AI Initiatives, Mitigating Fatal or Hidden Risks, and Always Putting Data First - Michele Nieberding 🚀 got my attention on LinkedIn with the great meme. Her article discusses the importance of establishing a strong data foundation and effective governance practices for successful AI deployments. She emphasizes that avoiding pitfalls in AI implementation requires focusing on data quality, availability, and integrity, as well as fostering collaboration between data producers and consumers while aligning AI initiatives with business objectives.


Data quality over AI meme
Yep! Pretty much!

🤗 Community-Generated Content

Videos:

Paying off Technical Debt in the Development of Machine Learning Apps - William Galindez Arias presented at Python Italia earlier this year on this important topic. He shares some best practices for reducing technical debt. DVC makes an appearance as part of a Chatbot project he worked on in the past. William produces content that is entertaining and informative! Be sure to check out some of his other work here on Scaling your RASA chatbot, Scaling Financial Machine Learning with MLOps, and Beyond Grid Search: XGBoost and OPtuna as the ultimate ML Optimization.

Sunny Tech 2024 - PAG Pour Une Entreprise Médicale: Les Défis techniques et products AI & Data - In French! Noé Achache speaks at Sunny Tech Conference on a new LLM use case for Vidal using Langsmith, OpenAI, DVC, and Qdrant, Chainlit - You can find out more about his techniques here on our channel using DVC with Qdrant in a RAG application.

A couple of great videos are up from Data Thinkers at Minish Analytics: Part 4: End-to-end machine Learning Portfolio Project | Modular Code| MLOps DVC Pipeline - From Minish Analytics, this video is part four in a series of an end-to-end water potability project that covers converting to modular code, using dvc repro, and dvc dag Part 6: Churn End-to-End MLOps Project | Data Validation Component - This video is part 6 in an end-to-end series for a water potability project. This one focuses on setting up versioning with DVC and creating a DVC pipeline.

Articles:

Experiment Tracking with Data Version Control (DVC) - Michael Grüner 's 3rd piece in a series on DVC. This one focuses on experiment tracking and DVC’s methods and integrations therewith. It covers how experiments work with DVC in your Git repo, how to queue experiments, hyperparameter tuning, Hydra usage, and pushing and pulling experiments on a computer vision project. It also includes some great diagrams!

AI Bill of Materials (AI BOM) by Bijit Ghosh , CTO Deutsche Bank provides an extensive overview of all that goes into an AI Bill of Materials and even more interestingly what to expect in future development of this best practice.

Reproducibility in Academic Research (1/3): Key Components of Reproducible Research - This article is the first part of a three-part series on reproducibility in academic research, focusing on computational analysis. Valentin Guigon, PhD , discusses the importance of reproducibility in scientific research, explaining the FAIR principles (Findable, Accessible, Interoperable, Reusable) for data management. The article outlines key components of reproducible research, including data management, project organization, coding practices, documentation, version control, and experiment tracking. It emphasizes the need for stable computational environments and draws parallels between industry practices (DevOps and MLOps) and academic research needs. The author also introduces tools and practices that can help researchers achieve reproducibility in their work.

Empowering Machine Learning Workflows with DVC and Git: Best Practices for Team Collaboration: Rohan Chakraborty discusses best practices for machine learning workflows using Git and DVC (Data Version Control) for team collaboration. He covers branching strategies, version control for datasets and models, and resolving merge conflicts. He emphasizes the importance of clear communication, consistent naming conventions, and documentation. The article also touches on using cloud storage options like Amazon S3 and Azure Blob Storage with DVC, and recommends automated testing and deployment processes. Definitely worth a read if you are getting started with DVC!

Leveraging DVC in ML workflows - Another great article from Rohan Chakraborty this time discusses the implementation of a machine learning-based fault detection system for car door modules across multiple European manufacturing plants. The project involved handling high-resolution vibration data from various sensors. DVC was used for data versioning, remote storage configuration, pipeline management, and facilitating collaboration. The article concludes by discussing DVC's integration with CI/CD pipelines for maintaining up-to-date models.

Getting Started with Git and DVC: Version Control Essentials for ML Projects - Surekha Gaikwad created this solid getting-started tutorial of Git and DVC with another great diagram! She provides the sequence of commands to learn how to navigate Git and DVC seamlessly! 💪🏽

FTzard: Simplifying LLM Experiment Management for Data Scientists - Aamir Ahmad Ansari , Data Scientist at Roadzen Technologies, introduces FTzard, a comprehensive framework designed to simplify large language model (LLM) experiment management for data scientists. FTzard integrates various tools including Dagster for orchestration, MLflow for experiment tracking, Pixi for environment reproducibility, and DVC for data versioning. The framework offers an end-to-end continual learning pipeline, reproducible environments, a simplified project structure, and enhanced organization. Key features include Dagster integration, MLflow for experiment tracking, Pixi for environment reproducibility, and DVC for data versioning. The article also details FTzard's enhanced MLflow design and advanced Dagster orchestration.

Building A.I. Based Apps: A General Approach - Forrester Terry mentions DVC as a resource for cross-phase versioning in this LLM application. It outlines a five-phase approach to AI application development: Input, Process, Quality Assurance, Deployment, and Output. Each phase is explained in detail, with key components, pro tips, and common pitfalls. The article uses an example of building an AI-powered content recommendation system for a news website to illustrate concepts. It concludes with a list of resources for each development phase and some final advice for AI application developers.

LinkedIn: Using Conventional Commit with DVC - Dr. Matthew Upson 📈 poses a question and potential issue for Conventional Commit in this post exploring best practices to provide better readability in your commits and make them machine readable for AI applications. Seems others are interested in additional commit types 1, 2, 3


Thanks for the read!  Many thanks to all our Community members for sharing their use cases and learnings on DVC! We are looking forward to seeing what you do with DataChain! And as always special thanks to  Gift Ojeabulu for his help curating the Community content! We'll see you next month! 

To your continued success,

Jeny

Community Manager, DVC.ai

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics