Humanitys Last Exam is a Total Joke and Should be Scrapped in 2026
Humanity's Last Exam (HLE), once hailed as the ultimate test of artificial general intelligence, is facing a massive backlash in 2026. Critics and researchers argue that the benchmark is riddled with factual errors, prioritizes obscure trivia over actual reasoning, and has become a "gotcha" game that fails to measure real-world AI utility.
The Fall of the Ultimate AI Benchmark
In the high-stakes world of artificial intelligence, benchmarks are the yardsticks by which we measure our progress toward AGI. For the past year, one name has reigned supreme: Humanity's Last Exam (HLE). Created by the Center for AI Safety and Scale AI, it was designed to be the "final" academic test—a collection of 2,500 questions so difficult that only PhD-level experts could solve them. However, as we move through 2026, the consensus is shifting rapidly. What was supposed to be a groundbreaking evaluation tool is now being called a "total joke" by leading researchers and industry insiders.
The problem isn't that the test is too hard; it's that it is fundamentally broken. As frontier models like Gemini 3 and Grok-4 begin to hit the 50% accuracy mark, the flaws in HLE’s design have become impossible to ignore. From factual inaccuracies to a bizarre obsession with "gotcha" trivia, the benchmark is increasingly seen as a distraction from the real goal: building AI that can actually do useful work.
The FutureHouse Investigation: A 30% Error Rate
The most damning evidence against HLE came from an independent investigation by FutureHouse, a non-profit AI research lab. Their deep dive into the benchmark's biology and chemistry subsets revealed a shocking reality: nearly 30% of the "correct" answers in the dataset were either directly contradicted by peer-reviewed literature or were so ambiguous they were functionally useless. For a test marketed as the "frontier of human knowledge," an error rate this high is catastrophic.
The root of this failure lies in the benchmark's creation protocol. To ensure the test remained "stump-proof," questions were only included if leading AI models initially failed them. This created a perverse incentive for contributors to submit convoluted, obscure, or even incorrect questions just to win prize money. Reviewers were reportedly given only five minutes per question to verify rationales—a timeframe that is laughably insufficient for verifying complex, graduate-level scientific claims. As a result, HLE didn't become a test of intelligence; it became a collection of "scientific hallucinations" enshrined as truth.
Obscure Trivia vs. Real-World Reasoning
Beyond the factual errors, the pedagogical philosophy of HLE is under fire. Critics argue that the exam focuses on "closed-ended academic trivia" rather than the structured reasoning required for scientific research. In 2026, we don't need an AI that knows the rarest noble gas by terrestrial fraction in 1954; we need an AI that can design a better battery or synthesize a new antibiotic.
Former OpenAI developer Andrej Karpathy has noted that these academic benchmarks are suffering from a new version of Moravec's paradox. AI systems can now solve complex rule-based math problems but struggle with simple "intern-level" tasks that require long-horizon planning and common sense. By focusing on HLE, the industry is optimizing for memorization rather than utility. As reported by the official HLE project site, the benchmark organizers have acknowledged these issues and are attempting to move toward a "rolling" version, but many say the damage to the test's reputation is already done.
Moving Toward "Gold Standard" Evaluations
The backlash has sparked a movement toward more rigorous, literature-grounded assessments. New subsets like the "Bio/Chem Gold" set are being developed to replace the tainted questions of HLE. These new standards require every answer-reasoning pair to be validated by both multiple human experts and "Crows"—advanced AI verification agents—ensuring that the ground truth is actually true.
As noted in recent PromptLayer research, the real-world gap between scoring 25% on a trivia test and being a practicing researcher is a "vast gulf." If we continue to use HLE as our primary metric for AGI, we risk building models that are "confident idiots"—systems that can recite obscure facts but lack the judgment to apply them in a lab or a courtroom.
Conclusion: Time to Scrap the Academic Hype
Humanity's Last Exam was a bold experiment, but in 2026, it has become clear that it is a failed one. By prioritizing difficulty over accuracy and trivia over reasoning, HLE has become a target for "benchmark hacking" rather than a true measure of progress. It is time to scrap the academic "gotcha" games and move toward evaluations that measure an AI's ability to solve complex, unstructured, and—most importantly—correct problems. If we want to find the "last exam" for humanity, we shouldn't be looking in a multiple-choice booklet; we should be looking at the world's unsolved scientific mysteries.

