9 Areas Where Humans Still Outperform AI
.. and big update from Mistral
AI hasn’t fully taken over yet. Humans still excel in certain areas - of course.
These 9 benchmarks outline vital skills and evaluate how AI measures against humans.
(Don’t miss out on the quick AI highlights at the bottom.)
✅ Before we started, we launched a premium version of the newsletter. Subscribing gives you 100% access to all content, exclusive demos, and an ad-free experience. I plan to host AMAs and develop more subscriber-requested demos.
9 areas where humans still have an edge compared to AI
It might feel that humans are losing their edge against AI systems that are increasingly better. What is the value that humans can capture versus air systems?
Turns out there are several fields.
In my research, I stumbled upon these 9 datasets/ evaluations that show that humans still have an incredible edge against AI systems - for some time.
In my research, I stumbled upon these 9 datasets/ evaluations that show that humans still have an incredible edge against AI systems - for some time.
What is WorkArena++?
These are 682 tasks that simulate workflows typical for knowledge workers, testing planning, problem-solving, reasoning, info retrieval, and context understanding. Humans outperform AI thanks to more robust reasoning and contextual grasp.
To have really powerful AI agents, we want them to excel on the benchmark. In 2025, we will see great progress here, where the average human might not have a competitive edge.
What is Simple-bench?
Multiple-choice tasks (200+ questions) test spatio-temporal reasoning, and social intelligence. High schoolers outperform state-of-the-art models, currently.
Humans will maintain a competitive edge for the foreseeable future.
What is ARC-AGI?
Assesses AI's ability to learn new skills and solve open-ended problems via patterns and abstract reasoning. Humans excel due to better generalization and abstract thinking.
A simple concept—covered in a past episode. Humans will likely outperform computers for another 2–3 years.
What is MiniWob?
Web-based tasks test reinforcement learning agents in navigation and interaction. Humans currently lead due to better understanding and adaptability. However, with AI gaining access to web pages via visual, textual, and API channels, the margin is narrowing quickly. By 2025, AI will match or surpass humans in these tasks, and I’m already taking over here.
What is WebArena?
Evaluates complex web tasks like info retrieval and form filling. The gap between average human capabilities and AI is shrinking rapidly, similar to MiniWob. While opportunities remain, for now, AI will soon close the gap entirely.
Recommended by LinkedIn
What is Putnam Bench?
Tests theorem-proving algorithms with problems from the Putnam Mathematical Competition. At the same time, the average human doesn’t have an edge, human experts (PhDs) excel. Interestingly, the AI-human baseline is often mislabeled. Starting next year, AI collaborators will reach parity with human PhDs, significantly accelerating scientific progress.
What is NOCHA?
Evaluates object classification and hierarchical annotation. Humans still outperform AI due to sharper visual perception and contextual understanding. Visual AI has evolved gradually over decades—from early convolutional neural networks to current LLM integrations. For at least the next year, AI won’t surpass the average human in these tasks.
What is GAIA?
Tests generalization across tasks and environments, especially for Internet research. Humans currently excel with natural adaptability. However, AI agents are likely to surpass the average human within 2–3 years. Progress depends not only on smarter AI but also on larger context windows, better comprehension, and improvements in model architecture.
What is Lab-Bench?
Focuses on biology-related lab tasks like experimental design and data analysis. Humans excel with expertise and intuition, but the role is shifting. In the coming years, scientists—biologists, chemists, and physicists—will evolve into research project managers, supported by teams of AI agents handling routine tasks.
Updates from Mistral - Pixtral Large (open source) & Le Chat
Pixtral Large: 124B Parameters of Power
This model is crushing benchmarks.
Want it? It’s free to download on Hugging Face.
Le Chat Also Just Leveled Up
It now does:
And yes, it’s still free. → Le Chat
NVIDIA ALCHEMI Accelerates Sustainable Materials Discovery for EV Batteries and Solar Panels
That’s a wrap! I hope you enjoyed it.
Martin