What is the "Humanity's Last Exam" AI Benchmark?

As traditional AI benchmarks like MMLU reach saturation, the Center for AI Safety and Scale AI have launched "Humanity's Last Exam" (HLE). This ultra-difficult benchmark features 3,000 expert-level questions designed to test true reasoning and expert knowledge, serving as a final hurdle for academic-style AI evaluation.

Dec 21, 2025
What is the "Humanity's Last Exam" AI Benchmark?
Humanity's Last Exam

The Evolution of AI Evaluation: Why a "Last" Exam?

For years, the AI community relied on benchmarks like MMLU (Massive Multitask Language Understanding) to gauge the "intelligence" of large language models. However, by late 2024, the industry hit a wall: top-tier models were scoring over 90%, making it nearly impossible to distinguish between a "smart" model and one that had simply memorized the test data. Enter Humanity's Last Exam (HLE), a joint initiative by the Center for AI Safety and Scale AI.

The name is intentionally provocative. It suggests that once an AI can master this particular set of challenges, we may have reached the end of what traditional, closed-ended academic testing can reveal about machine intelligence. Unlike its predecessors, HLE is specifically engineered to be "unsaturable" by current standards, targeting the narrow gap between a helpful assistant and a world-class human expert.

What Makes HLE Different?

Humanity's Last Exam isn't just another trivia contest. It is a curated collection of 3,000 graduate-level questions spanning over 100 academic disciplines, from advanced string theory and organic chemistry to medieval literature and complex legal reasoning. What sets it apart is a rigorous multi-stage vetting process designed to ensure that the questions cannot be solved through simple pattern matching or internet retrieval.

Key features of the benchmark include:

  • Expert-Level Difficulty: Questions were sourced from nearly 1,000 subject matter experts across 500 institutions. To even be considered, a question had to first "stump" current frontier models like GPT-4o or Claude 3.5.
  • Multimodality: Roughly 14% of the exam requires the AI to interpret complex diagrams, charts, or scientific figures, moving beyond pure text-based reasoning.
  • Strict Scoring: There is no partial credit. Most questions are short-answer, exact-match formats, meaning the AI either understands the expert-level nuance or it fails.
  • Private "Holdout" Set: To prevent models from being "trained on the test," a significant portion of the questions is kept private, ensuring that high scores reflect genuine reasoning rather than memorization.

Current Standings: A Reality Check for AI

The initial results from HLE have been a humbling experience for the AI industry. While models like Gemini 3 Pro and GPT-5 have shown progress, they are still a long way from human parity. As of late 2025, the highest-performing models are still struggling to cross the 50% threshold, while human experts in their respective fields typically score near 90%.

Model HLE Accuracy (%) Key Strength
Grok-4 Heavy 50.7% Reasoning with tools
Gemini 3 Pro 45.8% Multimodal integration
GPT-5 35.2% General academic knowledge

The gap is even more pronounced when looking at "calibration error." Many models exhibit high confidence even when their answers are completely wrong. This "overconfidence" is a major hurdle for researchers, as it indicates that while models are getting better at guessing, they don't yet "know what they don't know." Detailed leaderboards and technical methodology can be found on the official Scale AI HLE Leaderboard.

Sample of Humanity's Last Exam Questions

Example 1: Advanced Quantum Chromodynamics (STEM)

This question is designed to fail models that rely on retrieving general textbook definitions. It requires performing complex, multi-step calculations involving obscure theoretical constraints.

Question: In the context of non-perturbative quantum chromodynamics (QCD), calculate the precise mass difference between the neutron and the proton to four decimal places in MeV/c², explicitly accounting for both electromagnetic differences and up/down quark mass differences, utilizing lattice QCD simulation constraints active as of Q4 2024. Show the breakdown of the electromagnetic versus quark mass contributions.

  • Why it's hard: An AI cannot simply "lookup" this answer because the precise constraints change over time based on new simulations. The model must understand the underlying physics, select the correct contemporary data, and perform high-precision arithmetic without hallucinating intermediate steps.

Example 2: Comparative Medieval Legal History (Humanities)

HLE tests deep synthesis in the humanities just as rigorously as STEM. This question requires connecting obscure 12th-century canon law to later developments in English common law.

Question: Analyze the specific influence of the 12th-century canon law treatise Decretum Gratiani on the formation of the concept of 'equity' in English Chancery courts during the subsequent 14th century. specifically, identify two distinct procedural mechanisms adopted by Chancery chancellors that directly mirror Gratian's specific dialectical approach to reconciling contradictory canons (the sic et non method), citing specific distinctiones where applicable.

  • Why it's hard: This requires far more than summarizing a Wikipedia article about the Magna Carta. The AI must possess deep, specialized knowledge of two distinct legal systems across different centuries and synthesize how specific, obscure procedural mechanisms migrated from one to the other.

Why This Matters for the Future

If HLE is indeed the "last exam," what comes next? Researchers argue that once AI masters these closed-ended problems, the focus must shift to agentic intelligence—the ability to perform long-horizon tasks, conduct original research, and interact with the physical world. HLE serves as the final gatekeeper of academic knowledge, ensuring that before we grant AI systems more autonomy, they have at least mastered the cumulative knowledge of human civilization.

As we move into 2026, the race to "pass" Humanity's Last Exam will likely define the next generation of model training. It isn't just about scoring points; it's about proving that AI can finally think like the experts who created it.