How to measure language model performance

How to measure language model performance

Welcome to Continual Learnings

A weekly newsletter for practitioners building ML-powered products.

What we're reading this week

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Evaluating generative models like LLMs is notoriously difficult because it’s hard to tell which outputs are better without the help of humans. Recently, the research community has explored training auxiliary models to assess the performance of these hard-to-evaluate generative models. This paper demonstrates that this approach works for code generation models

Promptable: build full stack AI apps in typescript

Libraries like Langchain are having a moment. Their purpose is to make it easy to compose language models, vector similarity search, and other operations into language model apps. Promptable is a Langchain-like library for the Javascript ecosystem. That’s exciting because you don’t need ML expertise to build applications with this stack, so removing the dependence on python should let more people create AI apps.

The Wisdom of Hindsight Makes Language Models Better Instruction Followers

Reinforcement learning from human feedback (RLHF) has captured the attention of the ML community because of its role in ChatGPT. However, RLHF is difficult to implement: it requires training an auxiliary model and using a notoriously finicky heavyweight RL algorithm like PPO. This paper shows that, while human feedback is incredibly valuable, the RL part might not be necessary. You just need to be clever about transforming the feedback into a signal that can be used for supervised learning.

Feedback loops, and Google’s home-field advantage with LLMs

This article points out that “native” feedback loop, i.e., one where users give feedback implicitly via outcomes like click data, is more valuable than a feedback loop that requires users to explicitly provide feedback like thumbs up / thumbs down

Notes on ML serving

Hamel Husain’s notes might clarify some of what you find confusing about model serving.

Toolformer: Language Models Can Teach Themselves to Use Tools

This week in “does twitter imitate papers or do papers imitate twitter”, researchers discovered a phenomenon that the LLM community online has known for a while: you can give LLMs access to tools (e.g., APIs) that they can access to solve problems like arithmetic that they are not naturally good at. In all seriousness, this line of work seems like one of the paths to much broader usefulness of language models across tasks. 

Full Stack Deep Learning: LLM Bootcamp

Ok, I’m teaching this one, not reading it. Just like with the original Full Stack Deep Learning back in 2018, we realized that there’s a huge body of knowledge about how to build products with LLMs that is currently being passed from practitioner to practitioner through twitter threads and newsletters like this one. This course is our first attempt to formalize this into a guide to building applications with this exciting new stack.

Production ML papers to know

In this series, we cover important papers to know if you build ML-powered product.

Holistic Evaluation of Language Models

You probably feel like Language Models are advancing at a stunning pace.

But how do we know they really are? And how can we quantify how much better the latest-and-greatest (e.g., GPT-3) is than a less expensive alternative, and how much those differences will matter in the real world?

Today’s paper proposes an approach that might help.

The challenge

Language models (LMs) are becoming ubiquitous in the post-ChatGPT world. But how well do you really understand how the latest models perform? Sure, they have impressive few-shot capabilities and suffer from a tendency to hallucinate. But we’re MLEs, we should be able to quantify that, right?

Typically, researchers assess LMs on a limited subset of their possible applications using a single metric like accuracy. These benchmarks aren’t standardized, making performance comparison hard.

The example of ImageNet showed that AI research benefits from having a generally accepted standard benchmark. Without an equivalent in the LM world, knowledge of the pros and cons of different models is disseminated through word-of-mouth and random twitter threads.

The solution

The main challenge in evaluating LMs is that they are adapted to many different scenarios. This calls for a holistic approach to evaluation.

To that end, the paper proposes HELM (Holistic Evaluation of Language Models), which is based on the following pillars:

  • A core set of scenarios that represent common tasks (such as question answering, information retrieval, summarization, toxicity detection) and domains (such as news, books)
  • Multi-metric assessment to better represent model impact. Assessing models on calibration, robustness, fairness, bias, toxicity, efficiency - as well as accuracy - in the same scenarios they are expected to be deployed in, can make more explicit the potential tradeoffs for model performance
  • Standardization of evaluation, making the object of evaluation the language model and not the scenario-specific system. To this end, the paper benchmarks 30 prominent language models

No alt text provided for this image

Results

So what is the impact of standardizing LM evaluation?

The paper reports that, prior to HELM, models were on average evaluated on only 17.9% of the core scenarios - meaning there was no way of comparing results, and no guarantee that the models were assessed against all the scenarios they might be deployed in, nor against the relevant metrics. 

HELM evaluated 30 prominent language models to improve this coverage to 96%, facilitating direct, head-to-head comparisons. The chart below illustrates the increase in coverage offered by HELM to previous work in the field.

No alt text provided for this image

This benchmark used few-shot prompting with relatively simple, generic prompts.

In addition, HELM introduces a taxonomy to understand the evaluation of future LMs. For example, how many of the core scenarios and metrics have been used for a LM’s evaluation? What is missing, and what risks does this incur for future use of that LM?

The chart below, taken from the paper, shows the structure of the taxonomy, broken down into Scenarios and Metrics.

No alt text provided for this image

Finally, by evaluating a large number of models, the paper validates some of what we know anecdotally about LM performance.

For example, instruction-tuned models tend to perform better than model types. So do ‘non-open’ models over open access ones. There were consistent performance disparities for different demographics across models, while all models also showed significant sensitivity to prompts, particularly to the formatting of the prompt, and to the choice and number of in-context examples

The upshot

Hopefully this paper will lead to some much-needed standardization in LM assessment.

The paper is 90 pages long prior to references, and as such contains much more detail than we covered here.

If you are LM developer, or in any way interested in how the technological impact of AI on society can be better evaluated, then we recommend taking a look. You can find the paper here.

Thanks for reading!

Feel free to get in touch if you have any questions: you can message us on socials or simply reply to this email.

You can also find previous issues on our blog, on twitter and here on LinkedIn.

The Gantry team

To view or add a comment, sign in

More articles by Gantry

Insights from the community

Others also viewed

Explore topics