The Standard for Large Language AI Models Will Be Raised Via a New Benchmark

The Standard for Large Language AI Models Will Be Raised Via a New Benchmark

Greetings! I just wanted to share something new I learned over the weekend. There is a new standard in the works for large language AI models: the Beyond the Imitation Game benchmark (BIG-bench). As I understand it, this benchmark covers tasks that people excel at, but which current state-of-the-art models fail to achieve. It was developed by researchers from 132 universities across the globe.

The approach and methodology of BIG-bench are quite unique: The authors chose more than 200 challenges based on ten criteria, including that they had to be "not solvable by memorizing the internet," understandable to humans, and not solvable by current language models. Many of them include solving unusual puzzles, such as figuring out the one chess move that will win a game, deducing a movie's title from a string of emojis, or participating in a fictitious courtroom trial.

There were two key findings out of the challenges:

  1. On every task, the top human did better than any model, regardless of size. However, the best-performing model outperformed the typical human for some tasks. For instance, the best model got around 76 percent on tests about Hindu mythology, the average human scored around 61 percent, and the perfect human earned 100 percent (random chance was 25 percent).
  2. Larger models typically outperformed smaller ones. For instance, BIG-average G's accuracy on three-shot, multiple-choice tasks was close to 33% with a few million parameters but was closer to 42% with over 100 billion parameters.

Why are the researchers and other AI authorities considering these findings important? The designers of BIG-bench contend that benchmarks like SuperGLUE, SQuAD2.0, and GSM8K concentrate on certain talents (I must admit, all these benchmarks are super interesting!). However, the most recent language models exhibit unexpected skills like being able to solve straightforward math problems after pretraining on massive datasets downloaded from the internet. Researchers now have new tools to monitor these emerging abilities as models, data, and training approaches change thanks to BIG-bench's variety of few-shot tasks.

In closing, the hope is that BIG-bench may encourage academics to create algorithms that support complicated kinds of reasoning and generalization, even with little training data, by creating problems that can't be solved by only memorization of the internet. What are your thoughts?

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics