Evolving AGI benchmarks

Rajeswaran V (PhD)

Generative AI specialist. AI Futures and AI CoE head

Published Dec 29, 2023

We all know the Turing test was considered a solid benchmark for measuring AI capability and this has been well surpassed. Now, there is a race to create AGI and at the same time measure if the "system or model" has reached AGI.

This is important to consider for all of humanity. The saga of OpenAI can be also traced to this important question. Microsoft agreement with OpenAI excludes the models which have reached AGI and so it was important for Microsoft to have Sam Altman in OpenAI so that he can claim that the models they have created have not reached it. OpenAI's six-person board of directors will determine when the company has “attained AGI” — a threshold that will exclude Microsoft (on theory).

Now there are 2 new benchmarks which I find fascinating - BASIS and GPQA.

BASIS

The idea behind this benchmark is created by Mensa researcher and metaphysician Dr Jason Betts to design a suite of test items prioritizing imminent artificial superintelligence (ASI), and also including the lower ceiling of advanced AGI .

The BASIS project ensures that superintelligence can be appropriately assessed against very high human biological intelligence ceilings, and removes workarounds like holding out a subset of catalogued data (Common Crawl, books, journals, Wikipedia) from models. Instead, the testing mechanism is replaced with new, unique, and offline questions.

Recommended by LinkedIn

TAI #109: Cost and Capability Leaders Switching Places…

Towards AI 5 months ago

This AI newsletter is all you need #20

Towards AI 2 years ago

Artificial Intelligence #235

Andriy Burkov 5 months ago

Every question is designed to have an independently verifiable answer by at least one other human. Importantly, the question, answer, and even combination of keywords does not appear on the web and should never have been seen before.

GPQA

This is a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. They ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof").

Evolving AGI benchmarks

Rajeswaran V (PhD)

Generative AI specialist. AI Futures and AI CoE head

BASIS

Recommended by LinkedIn

GPQA

More articles by this author

Insights from the community

Others also viewed

Artificial Intelligence #231

Artificial Intelligence #225

Artificial Intelligence #209

Artificial Intelligence #183

Artificial Intelligence #183

Artificial Intelligence #139

New Book: Building Disruptive AI & LLM Technology from Scratch

AI/ML news summary: week 27

Superfast Vector Search for LLM, GPT, GenAI and More

From $350 Bn to $10 Tn : How AI Agents Are Reshaping the Future of Work

Explore topics

BASIS

Recommended by LinkedIn

GPQA

Scaling laws

Feb 4, 2024

Copy of GenAI/LLM and productivity

Jan 21, 2024

Paper clip maximization

Jan 14, 2024

AI and research

Jan 9, 2024

Moravec's paradox and CV

Jan 5, 2024

AI robustness

Jan 3, 2024

AI for Software Engineering

Jan 2, 2024

AI in 2024 - some predictions

Jan 1, 2024

Dangers of over-simplification

Dec 31, 2023

LLMs and Theory of mind

Dec 31, 2023

Insights from the community

Others also viewed

Artificial Intelligence #231

Artificial Intelligence #225

Artificial Intelligence #209

Artificial Intelligence #183

Artificial Intelligence #183

Artificial Intelligence #139

New Book: Building Disruptive AI & LLM Technology from Scratch

AI/ML news summary: week 27

Superfast Vector Search for LLM, GPT, GenAI and More

From $350 Bn to $10 Tn : How AI Agents Are Reshaping the Future of Work

Explore topics