Why do ML Projects Fail?

Why do ML Projects Fail?

Welcome to Continual Learnings

A weekly newsletter for practitioners building ML-powered products.

What we're reading this week

Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming: Copilot is arguably the most successful of the current generation of AI-powered products. This paper studies how it has impacted the behavior of programmers who use it.

Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering: LLMs perform best when up-to-date, relevant context is provided in the prompt. That’s one of the secrets of why Copilot so good. A pattern that is rapidly emerging to do so is retrieval-augmentation, where approximate nearest neighbor search is used to pull in the relevant information from a corpus. This paper pushes the concept forward for question answering systems and (claims to) hit state-of-the-art.

A Watermark for Large Language Models: Detecting language model-generated output is critically important in domains like education. This paper proposes a framework for statistical watermarking. When deployed as part of the model, it makes the model’s outputs reliably detectible without impacting text quality.

Batch Prompting: Efficient Inference with Large Language Model APIs: Proposes an approach to do inference for language models at lower cost by enabling the LLM to run inference in batches.

All My Machine Learning Problems are Actually Data Management Problems: This talk by Shreya Shankar draws some interesting connections between challenges in production machine learning and parallel problems in data management

Precise Zero-Shot Dense Retrieval without Relevance Labels: Proposes Hypothetical Document Embeddings (HyDE). If you’re using the retrieval-augmentation pattern described above, rather than naively embedding your query, instead use a language model to “expand” the query into a hypothetical document and embed that instead. The intuition for why this works is that the hypothetical document is more like the rest of your corpus, which leads to more accurate nearest neighbor results.

Production ML papers to know

In this series, we cover important papers to know if you build ML-powered product.

MLOps: A Holistic Approach

MLOps is a field devoted to studying how infrastructure, tools, and other technology can solve ML problems.

But a new paper asserts that this is, at best, misguided: “technology is rarely why ML projects fail.”

So why do ML projects fail? And what can we do about it?

Why do ML Projects Fail?

The thesis of today’s paper (from Weights & Biases) is that the hard parts of doing ML are (i) enabling people to succeed and (ii) having well-designed processes in place. Of course, (iii) the technology platform matters, but to a lesser degree.

Let’s talk through some of the main failure modes to avoid.

People

To be honest, you already probably know how to help your people succeed at delivering ML. That doesn’t stop most teams from making some common mistakes.

First, the authors observe that people perform better when their roles are clearly defined and they sit in the right part of the org (whether that’s embedded in the business unit or in a separate team). That sounds pretty obvious, yet it’s still common for everyone on the team to cover every part of ML, from data cleaning to modeling to platform engineering.

Second, the authors assert that ML requires a different approach to project management that views ML delivery as R&D and not business-as-usual. That’s because model performance requirements can be unknown, progress non-linear, and breakthroughs hard to predict, making estimation of timelines and business value hard.

For those of us in ML day-to-day that won’t come as a surprise, but many teams still fail by focusing on low-risk - and low-ROI - tasks, and using management techniques more suited to software management, such as agile and scrum.

Processes

Like project planning and team organization, processes are the fruit and vegetables of successful ML delivery. You already know you’re supposed to be doing these things, but it’s hard to be consistent.

For example, ML teams need to understand the business value a project can achieve. So the authors recommend doing a scoping and opportunity sizing process before starting the project.

The paper includes a framework, originally provided by fast.ai, that covers the business drivers to consider upfront:

No alt text provided for this image

The paper also points out that “ML is often used to optimize operational decisions. However, models only provide predictions, not business decisions.” So the authors recommend doing a “decision optimization” process to calibrate model outputs to business decisions.

This might be as straightforward as recognizing that the cost of false positives results for a classification model are high, and tweaking the decision threshold accordingly; or more involved, for example when a model output is used to inform a profit curve that underpin decision making.

Another interesting process suggestion is to use governance to find and address potential sources of bias. Drawing from a paper - A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle - it illustrates seven types of machine learning bias that may negatively impact a model:

No alt text provided for this image

Each type of bias has different causes and different potential mitigations, so it’s helpful to have a governance process that is focused on the project as a whole, not any one particular piece (like model evaluation).

Platform

Though the authors believe that people and processes are the most effective lever for accelerating ML teams, infrastructure plays a role as well.

When selecting tools for an end-to-end platform, keep in mind that there’s no one-size-fits-all solution. Things that affect the right tool stack include skill sets of your org (for example, if Kubernetes is required, is there sufficient knowledge of this on the team?); the existing infrastructure, and whether a set of specialized point solutions is better than a monolith solution.

As an editorial note, I’d add that the product you’re trying to build with ML has a bigger impact on the right tools than any of the factors outlined above. There is no monolithic MLOps stack — the right tools for building a lead segmentation model are completely different than those needed to build a chatbot.

The paper also highlights the need to identify the right level of abstractions for data scientists, so that they can focus on high-value activities. This is tricky because of a “fundamental mismatch between how much infrastructure is needed at different parts of the stack and what data scientists care about”, as illustrated by the chart below:

No alt text provided for this image

Data scientists tend to care less about capabilities that require more technological infrastructure. Source: Effective Data Science Infrastructure

Generally speaking, data scientists want to spend more time in model development and feature engineering - this means choosing tooling at a higher level of abstraction in other parts of the stack.

So what?

MLOps has been great for the industry because it has led to increased focus on getting models out of the lab and into production. But MLOps is also misguided, because it puts the focus on tools and infrastructure rather than people and products.

This paper’s contribution is moving beyond the usual focus on the technology and tools required to support MLOps, to areas such as organization design, managing projects, and determining determining business value.

The paper can be found here.

Thanks for reading!

Feel free to get in touch if you have any questions: you can message us on socials or simply reply to this email.

You can also find previous issues on our blog, on twitter and here on LinkedIn.

The Gantry team

To view or add a comment, sign in

More articles by Gantry

Insights from the community

Others also viewed

Explore topics