9 Areas Where Humans Still Outperform AI

9 Areas Where Humans Still Outperform AI

.. and big update from Mistral


AI hasn’t fully taken over yet. Humans still excel in certain areas - of course.

These 9 benchmarks outline vital skills and evaluate how AI measures against humans.

(Don’t miss out on the quick AI highlights at the bottom.)

✅ Before we started, we launched a premium version of the newsletter. Subscribing gives you 100% access to all content, exclusive demos, and an ad-free experience. I plan to host AMAs and develop more subscriber-requested demos.

Try Premium for up to 14 days

9 areas where humans still have an edge compared to AI

It might feel that humans are losing their edge against AI systems that are increasingly better. What is the value that humans can capture versus air systems?

Turns out there are several fields.

In my research, I stumbled upon these 9 datasets/ evaluations that show that humans still have an incredible edge against AI systems - for some time.

In my research, I stumbled upon these 9 datasets/ evaluations that show that humans still have an incredible edge against AI systems - for some time.

What is WorkArena++?

These are 682 tasks that simulate workflows typical for knowledge workers, testing planning, problem-solving, reasoning, info retrieval, and context understanding. Humans outperform AI thanks to more robust reasoning and contextual grasp.

To have really powerful AI agents, we want them to excel on the benchmark. In 2025, we will see great progress here, where the average human might not have a competitive edge.

What is Simple-bench?

Multiple-choice tasks (200+ questions) test spatio-temporal reasoning, and social intelligence. High schoolers outperform state-of-the-art models, currently.

Humans will maintain a competitive edge for the foreseeable future.

What is ARC-AGI?

Assesses AI's ability to learn new skills and solve open-ended problems via patterns and abstract reasoning. Humans excel due to better generalization and abstract thinking.

A simple concept—covered in a past episode. Humans will likely outperform computers for another 2–3 years.

What is MiniWob?

Web-based tasks test reinforcement learning agents in navigation and interaction. Humans currently lead due to better understanding and adaptability. However, with AI gaining access to web pages via visual, textual, and API channels, the margin is narrowing quickly. By 2025, AI will match or surpass humans in these tasks, and I’m already taking over here.

What is WebArena?

Evaluates complex web tasks like info retrieval and form filling. The gap between average human capabilities and AI is shrinking rapidly, similar to MiniWob. While opportunities remain, for now, AI will soon close the gap entirely.

What is Putnam Bench?

Tests theorem-proving algorithms with problems from the Putnam Mathematical Competition. At the same time, the average human doesn’t have an edge, human experts (PhDs) excel. Interestingly, the AI-human baseline is often mislabeled. Starting next year, AI collaborators will reach parity with human PhDs, significantly accelerating scientific progress.

What is NOCHA?

Evaluates object classification and hierarchical annotation. Humans still outperform AI due to sharper visual perception and contextual understanding. Visual AI has evolved gradually over decades—from early convolutional neural networks to current LLM integrations. For at least the next year, AI won’t surpass the average human in these tasks.

What is GAIA?

Tests generalization across tasks and environments, especially for Internet research. Humans currently excel with natural adaptability. However, AI agents are likely to surpass the average human within 2–3 years. Progress depends not only on smarter AI but also on larger context windows, better comprehension, and improvements in model architecture.

What is Lab-Bench?

Focuses on biology-related lab tasks like experimental design and data analysis. Humans excel with expertise and intuition, but the role is shifting. In the coming years, scientists—biologists, chemists, and physicists—will evolve into research project managers, supported by teams of AI agents handling routine tasks.

Updates from Mistral - Pixtral Large (open source) & Le Chat

Pixtral Large: 124B Parameters of Power

This model is crushing benchmarks.

  • Top scores on MathVista, DocVQA, and VQAv2
  • Maintains the strong text skills of Mistral Large 2
  • Built with a 123B decoder + 1B vision encoder
  • 128K token limit for long documents

Want it? It’s free to download on Hugging Face.

Le Chat Also Just Leveled Up

It now does:

  • Web search with sources cited for fact-checking
  • Canvas for brainstorming: Edit, export, create seamlessly
  • Vision upgrades: Reads images & documents
  • Flux Pro for stunning image generation
  • Speculative editing: Predicts & refines text faster than you

And yes, it’s still free. → Le Chat

NVIDIA ALCHEMI Accelerates Sustainable Materials Discovery for EV Batteries and Solar Panels

-> read here the REST <-

That’s a wrap! I hope you enjoyed it.

Martin

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics