Evolving AGI benchmarks
We all know the Turing test was considered a solid benchmark for measuring AI capability and this has been well surpassed. Now, there is a race to create AGI and at the same time measure if the "system or model" has reached AGI.
This is important to consider for all of humanity. The saga of OpenAI can be also traced to this important question. Microsoft agreement with OpenAI excludes the models which have reached AGI and so it was important for Microsoft to have Sam Altman in OpenAI so that he can claim that the models they have created have not reached it. OpenAI's six-person board of directors will determine when the company has “attained AGI” — a threshold that will exclude Microsoft (on theory).
Now there are 2 new benchmarks which I find fascinating - BASIS and GPQA.
BASIS
The idea behind this benchmark is created by Mensa researcher and metaphysician Dr Jason Betts to design a suite of test items prioritizing imminent artificial superintelligence (ASI), and also including the lower ceiling of advanced AGI .
The BASIS project ensures that superintelligence can be appropriately assessed against very high human biological intelligence ceilings, and removes workarounds like holding out a subset of catalogued data (Common Crawl, books, journals, Wikipedia) from models. Instead, the testing mechanism is replaced with new, unique, and offline questions.
Recommended by LinkedIn
Every question is designed to have an independently verifiable answer by at least one other human. Importantly, the question, answer, and even combination of keywords does not appear on the web and should never have been seen before.
GPQA
This is a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. They ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof").