Evolving AGI benchmarks
From https://lifearchitect.ai/basis/

Evolving AGI benchmarks

We all know the Turing test was considered a solid benchmark for measuring AI capability and this has been well surpassed. Now, there is a race to create AGI and at the same time measure if the "system or model" has reached AGI.

This is important to consider for all of humanity. The saga of OpenAI can be also traced to this important question. Microsoft agreement with OpenAI excludes the models which have reached AGI and so it was important for Microsoft to have Sam Altman in OpenAI so that he can claim that the models they have created have not reached it. OpenAI's six-person board of directors will determine when the company has “attained AGI” — a threshold that will exclude Microsoft (on theory).

Now there are 2 new benchmarks which I find fascinating - BASIS and GPQA.

BASIS

The idea behind this benchmark is created by Mensa researcher and metaphysician Dr Jason Betts to design a suite of test items prioritizing imminent artificial superintelligence (ASI), and also including the lower ceiling of advanced AGI .

The BASIS project ensures that superintelligence can be appropriately assessed against very high human biological intelligence ceilings, and removes workarounds like holding out a subset of catalogued data (Common Crawl, books, journals, Wikipedia) from models. Instead, the testing mechanism is replaced with new, unique, and offline questions.

Every question is designed to have an independently verifiable answer by at least one other human. Importantly, the question, answer, and even combination of keywords does not appear on the web and should never have been seen before.

GPQA

This is a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. They ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof").

From the paper -


To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics