The Standard for Large Language AI Models Will Be Raised Via a New Benchmark

Josué Batista

Author of "Learn OpenAI Whisper" - Speaker, ML Industrialization, Generative AI Strategy, Machine Teaching, Autonomous AI, Digital Twins, and Quantum Computing - views expressed are my own.

Published Jun 25, 2022

Greetings! I just wanted to share something new I learned over the weekend. There is a new standard in the works for large language AI models: the Beyond the Imitation Game benchmark (BIG-bench). As I understand it, this benchmark covers tasks that people excel at, but which current state-of-the-art models fail to achieve. It was developed by researchers from 132 universities across the globe.

The approach and methodology of BIG-bench are quite unique: The authors chose more than 200 challenges based on ten criteria, including that they had to be "not solvable by memorizing the internet," understandable to humans, and not solvable by current language models. Many of them include solving unusual puzzles, such as figuring out the one chess move that will win a game, deducing a movie's title from a string of emojis, or participating in a fictitious courtroom trial.

There were two key findings out of the challenges:

Recommended by LinkedIn

Harnessing the Potential of Large Language Models…

Dr. Jagreet Kaur 8 months ago

Leveraging Heisenberg's Uncertainty Principle to…

Steffen Reckert 6 months ago

RMAS: How New Classes of Form-Based Multi-Agent…

Patrick Castellani 5 months ago

On every task, the top human did better than any model, regardless of size. However, the best-performing model outperformed the typical human for some tasks. For instance, the best model got around 76 percent on tests about Hindu mythology, the average human scored around 61 percent, and the perfect human earned 100 percent (random chance was 25 percent).
Larger models typically outperformed smaller ones. For instance, BIG-average G's accuracy on three-shot, multiple-choice tasks was close to 33% with a few million parameters but was closer to 42% with over 100 billion parameters.

Why are the researchers and other AI authorities considering these findings important? The designers of BIG-bench contend that benchmarks like SuperGLUE, SQuAD2.0, and GSM8K concentrate on certain talents (I must admit, all these benchmarks are super interesting!). However, the most recent language models exhibit unexpected skills like being able to solve straightforward math problems after pretraining on massive datasets downloaded from the internet. Researchers now have new tools to monitor these emerging abilities as models, data, and training approaches change thanks to BIG-bench's variety of few-shot tasks.

In closing, the hope is that BIG-bench may encourage academics to create algorithms that support complicated kinds of reasoning and generalization, even with little training data, by creating problems that can't be solved by only memorization of the internet. What are your thoughts?

Links with this icon were created by LinkedIn and links without it were added by the author.

The Standard for Large Language AI Models Will Be Raised Via a New Benchmark

Josué Batista

Author of "Learn OpenAI Whisper" - Speaker, ML Industrialization, Generative AI Strategy, Machine Teaching, Autonomous AI, Digital Twins, and Quantum Computing - views expressed are my own.

Recommended by LinkedIn

More articles by this author

Insights from the community

Others also viewed

Soft Skills Are Back.

Reality: brought to you by AI

Mapping the Mind of a Large Language Model

Exploring the Myths and Realities of Artificial Intelligence

Is GenAI Hitting a Plateau? Understanding the Law of Diminishing Returns in Large Language Models

My dAI: Deciphering LLMs

SAMBA - A New Chapter for State Space Models

AI goes beyond predicting next item in text to preliminary skills based 'consciousness'

Top AI/ML Papers of the Week [22/07 - 28/07]

The Silent Revolution: How AI is Reshaping Society in Ways We've Overlooked

Explore topics

Recommended by LinkedIn

OpenAI's Whisper: A Symphony of Languages Made Visible

May 31, 2023

An Introduction to Deep Learning Frameworks

May 18, 2023

EFFICIENT CAPITAL ALLOCATION: Using AI in the VC Investment Process

Mar 6, 2020

How to Best Start Your Digital Transformation Journey

Mar 5, 2020

SMART AUTOMATION: How Robotic Process Automation (RPA) Today is Creating the Leaders of Tomorrow

Aug 13, 2019

The What Behind Amazon Prime and Why Every Business Should Care

Aug 8, 2019

What is Behind the Walmart and Humana Partnership and Why Should We Care?

Aug 6, 2019

Hyperledger Sawtooth: The Right Tool for Rapid Blockchain Development

Apr 2, 2019

Josué Batista – Speaking and Media Appearances

Feb 1, 2019

What and Why First - Entire Video Series

Jan 13, 2019

Insights from the community

Others also viewed

Soft Skills Are Back.

Reality: brought to you by AI

Mapping the Mind of a Large Language Model

Exploring the Myths and Realities of Artificial Intelligence

Is GenAI Hitting a Plateau? Understanding the Law of Diminishing Returns in Large Language Models

My dAI: Deciphering LLMs

SAMBA - A New Chapter for State Space Models

AI goes beyond predicting next item in text to preliminary skills based 'consciousness'

Top AI/ML Papers of the Week [22/07 - 28/07]

The Silent Revolution: How AI is Reshaping Society in Ways We've Overlooked

Explore topics