From Pixels to Performance: People are Now Using Super Mario to Benchmark AI

Super Mario and machine learning are the unlikely duo as people now use the popular game to benchmark artificial intelligence.

Mar 4, 2025
From Pixels to Performance: People are Now Using Super Mario to Benchmark AI
Mario

Pokémon has always been tagged as a tough benchmark for AI; however, a group of researchers argue that Super Mario Bros. is even tougher. 

Hao AI Lab is the research organization at the University of California San Diego that conducted the test. On Friday, the research team threw AI into live Super Mario Bros. games, and the results saw Anthropic’s Claude 3.7 perform the best, followed by Claude 3.5. Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled.

Note that it wasn’t quite the same version of Super Mario Bros. as the original 1985 release. Accordingly, the game ran in an emulator and integrated with a framework, GamingAgent, to give the AIs control over Mario.

Unique Results 

GamingAgent, developed in-house by Hao, fed the AI basic instructions, like, “If an obstacle or enemy is near, move/jump left to dodge” and in-game screenshots. The AI then generated inputs in the form of Python code to control Mario. Still, Hao stated that the game forced each model to “learn” to plan complex maneuvers and develop gameplay strategies. 

Interestingly, the lab compared results from different AI models and found that reasoning models like OpenAI’s o1, which “think” through problems step by step to arrive at solutions, performed worse than “non-reasoning” models, despite being generally stronger on most benchmarks.

One primary reason why reasoning models have trouble playing real-time games like this is that they take a while — seconds, usually — to decide on actions, according to the researchers. In Super Mario Bros., timing is everything. A second can mean the difference between a jump safely cleared and a plummet to your death.

Tech Tradition 

Games have been used to benchmark AI for decades, but some experts have questioned the wisdom of drawing connections between AI’s gaming skills and technological advancement. Compared to the real world, games tend to be abstract and relatively simple, providing a theoretically infinite amount of data to train AI.

The recent flashy gaming benchmarks are in line with what Andrej Karpathy, a research scientist and founding member at OpenAI, called an “evaluation crisis.”

“I don’t really know what [AI] metrics to look at right now,” he wrote in a post on X. “TLDR my reaction is I don’t really know how good these models are right now.”