Comparing AI vs. Human Performance in Technical Tasks 🧮

What We're Showing
AI systems' performance relative to human baselines for eight AI benchmarks measuring tasks including:
- Image classification
- Visual reasoning
- Medium-level reading comprehension
- English language understanding
- Multitask language understanding
- Competition-level mathematics
- PhD-level science questions
- Multimodal understanding and reasoning
Data comes from the Stanford University 2025 AI Index Report. An AI benchmark is a standardized test used to evaluate the performance and capabilities of AI systems on specific tasks.
AI Models are Surpassing Human Performance
As of the 2025 AI Index report, AI models have exceeded human performance in almost every technical task.
The only task where AI systems still haven't caught up to humans is multimodal understanding and reasoning, which involves processing and reasoning across multiple formats and disciplines, such as images, charts, and diagrams.
However, the gap is closing quickly.
In 2024, the OpenAI's o1 scored 78.2% on MMMU, a benchmark that evaluates models on multi-discipline tasks demanding college-level subject knowledge, just 4.4 percentage points below the human benchmark of 82.6%.
This was major jump from the end of 2023, where Google Gemini scored just 59.4%, highlighting the rapid improvement of AI performance in these technical tasks.