MLOps at Industrial-Scale: Lessons from Google
Hi everyone,
This is the last edition of Continual Learnings for 2022. Thanks for learning alongside us this year!
I’m not sure if I had more time to read this week, or everyone is trying to ship their latest thing by the end of the year, but we have a particularly long edition this time. There are a lot of fascinating articles, papers, and repos below — hope it makes for some fun holiday reading!
What are we reading this week
Building ML-powered products from the trenches
What building “Copilot for X” really takes: This is a fascinating look into the nitty-gritty details of building a LLM-enabled AI-powered product.
Copilot internals: The authors of this post reverse-engineered copilot and then explain how it works. This is a must-read if you’re building with language models.
New models to try
OPT-IML: This new open source language model from Meta uses instruction fine-tuning, which is one of the techniques driving the rapid advancements in language model capabilities of late (as seen in Google’s FLAN).
New and Improved OpenAI Embedding Model: Better and cheaper. What’s not to like. I built a demo with it this week, and it was super easy and worked quite well out of the box.
Point-E: a System for Generating 3D Point Clouds from Complex Prompts: What GPT-3 did for text generation and DallE2 / Stable diffusion did for image generation will happen to the generation of 3D shapes soon as well. This isn’t quite at the level to break the internet yet though.
New LLM capabilities
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models: We all know by now that language models tend to make up the answer when they don’t know it. This paper proposes a way to have LLMs generate attributions alongside their answers.
Controllable Text Generation with Language Constraints: This paper introduces a benchmark on constrained language generation — the task of generating text while avoiding things that the model maintainer doesn’t want to include in the response. They also present a baseline approach that performs better than off-the-shelf language models.
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions: Chain-of-thought prompting is one of the most consistently helpful tools in the prompt engineering toolkit. This paper shows how to extend it to knowledge retrieval as well.
From prompt hacking to prompt engineering
Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters: I’ve remarked in the past that “prompt engineering” today is really more like prompt hacking because we have little understanding of what makes a good or bad prompt — just what works. This paper takes a step toward understanding chain-of-thougt prompting. One surprising finding is that it works even if the reasoning demonstrations provided are invalid.
Learn Prompting: I’m really trying to avoid this becoming an LLM newsletter, but what can you do when there’s so much interesting stuff happening in that field right now. Anyway, this looks like a great resource if you’re trying to get started in prompt hacking engineering.
Prompt engineering guide: This is a good complement to Learn Prompting: it contains links to a range of resources on this emerging field.
Reinforcement learning from human feedback
RL4LMs: Another powerful-looking library for fine-tuning language models based on human preferences to consider as an alternative to TRLX.
Coadaptive Harness for Effective Evaluation, Steering, & Enhancement (Cheese): This new library attempts to make it easier to build human feedback collection UIs.
Continual Learning for Instruction Following from Realtime Feedback: Most reinforcement learning from human feedback techniques are performed in batches offline. This is an example of how to do it online as human users interact.
Constitutional AI: Rather than fine-tuning language models from human feedback directly, this paper proposes RLAIF: RL from AI feedback. Humans just provide rules, and a model uses those rules to generate a “critique” that can be used as a reinforcement learning signal.
Production ML papers to know
In this series, we cover important papers to know if you build ML-powered product.
MLOps at Industrial-Scale: Lessons from Google
Have you ever wondered what it's like to do MLOps at Google-scale?
A new paper shines a light on how Google deploys, maintains, and improves on an “industrial scale” ML system to predict click-through rate - and it is eye opening.
It's a world where “many dozens of engineers” undertake R&D to drive improvement on a system that supports over 100,000 queries per second.
And despite the scale, there are technologies, techniques and advice for everyone trying to do MLOps. So let’s jump in.
Why CTR prediction is hard
Click-through rate (CTR) prediction is valuable because it’s a primary signal of the usefulness of ads. It feeds directly into the cost per click that advertisers pay.
Google’s CTR prediction model “consists of billions of weights, trains on more than one hundred billion examples, and is required to perform inference at well over one hundred thousand requests per second.” This isn’t a set-and-forget model either. Google is constantly trying to improve its performance without adding training / serving costs or undue complexity.
The paper covers techniques Google uses to improve accuracy, efficiency, reproducibility, calibration, and credit attribution.
Recommended by LinkedIn
We’ll cover their approach here, much of which is applicable for smaller scale systems too. But, first, we’ll describe the model itself.
Model architecture
The paper does not describe the full model architecture, but it reveals some interesting details.
First, the Google team found that the text of the query and ad headlines are critical context for the model, but, for performance reasons, they forgo representing them with a LLM in favor of a smaller model that uses classical text features like n-grams.
Beyond that, the baseline model is pretty standard — the remaining features are embedded, the embeddings are concatenated, and the model is trained using AdaGrad, log loss, and ReLUs. Google CTR engineers are just like you and me.
The rest of the paper describes ways they improve on this baseline.
Reducing costs through efficiency
As committed as Google is to ML, even for them any gain from ML needs to be weighed against cost: not just cost of training, but also “long-term cost to future R&D.” This frequently leads to killing ideas that improve performance but are deemed not worth the cost.
So, a parallel aim to improving accuracy is improving efficiency. That means models are evaluated by two metrics: 1) Does accuracy go up when training cost is flat? 2) Is training cheaper if model capacity is lowered until accuracy is neutral?
Here are some techniques Google uses to improve efficiency.
Improving accuracy
The paper also discusses techniques aimed at improving accuracy.
- Ranknet loss, which aims to make sure the set of candidate ads are properly ranked relative to one another.
- Distillation, which trains a smaller model called the “student” to match the predictions of a larger “teacher” model. A surprising discovery in modern deep learning is that knowledge distillation often leads to “student” models that are more capable than training the smaller model from scratch. In Google’s case, this lets them train a larger “teacher” model than would normally be computationally feasible in production.
- Loss curriculum, which borrows from curriculum learning by gradually introducing the more complicated loss functions throughout the course of training.
Increasing reproducibility
Perhaps one of the most fascinating sections of this paper is on reproducibility.
Training runs for these models are rarely reproducible due to factors like random initialization, non-determinism stemming from distributed compute, numerical errors, hardware, and more.
Irreproducibility is hard to detect in training metrics, and may impact downstream R&D - model deployment leads to further divergence, as predictions from deployed models become part of subsequent training and research.
To combat this, Google uses the metric Relative Prediction Difference (PD), which measures the absolute point-wise difference in model predictions for a pair of models. PDs are “as high as 20% for deep models”, and methods such as fixed initialization, regularization, dropout, data augmentation don’t make much of a difference. Ensemble techniques help, but introduce their own forms of technical debt.
Experimentation showed that ReLUs were a contributing factor because the “gradient discontinuity at 0 induces a highly non-convex loss landscape.” Moving to the Smooth ReLU (SmeLU) activation function led to a PD less than 10%, and also improved accuracy by 0.1%.
Generalizing Across UI Treatments
CTR performance of an ad is impacted by the UI it belongs to, so it’s important to be able to tease apart the contributions of the two. To do so, the Google team replaces the single CTR model with 𝜏(𝑄·𝑈), composed of a transfer function 𝜏 and separable models 𝑄 and 𝑈 that output vectorized representations of the Quality and the UI and are combined using an inner-product.
The upshot
The practical advice provided in this paper is well worth understanding, even if you, like more or less every other machine learning team today, operate at a much smaller scale than Google’s CTR model.
You might want to read it alongside another paper we recently summarized on MLOps best practices from a wider range of companies.
Check out the paper here.
Thanks for reading!
Feel free to get in touch if you have any questions: you can message us on socials or simply reply to this email.
The Gantry team