Debates over AI benchmarking have reached Pokémon

Not even Pokémon is safe from AI benchmarking controversy.

Last week, a post on X went viral, claiming that Google’s latest Gemini model surpassed Anthropic’s flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavender Town in a developer’s Twitch stream; Claude was stuck at Mount Moon as of late February.

But what the post failed to mention is that Gemini had an advantage.

As users on Reddit pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify “tiles” in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions.

Now, Pokémon is a semi-serious AI benchmark at best — few would argue it’s a very informative test of a model’s capabilities. But it is an instructive example of how different implementations of a benchmark can influence the results.

For example, Anthropic reported two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model’s coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a “custom scaffold” that Anthropic developed.

More recently, Meta fine-tuned a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The vanilla version of the model scores significantly worse on the same evaluation.

Given that AI benchmarks — Pokémon included — are imperfect measures to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn’t seem likely that it’ll get any easier to compare models as they’re released.

This article originally appeared on TechCrunch at https://techcrunch.com/2025/04/14/debates-over-ai-benchmarking-have-reached-pokemon/

Source link

What's Hot

NASA Starts Final Preparations for Artemis II Moon Mission Launch

JPMorgan Says Bitcoin Is Beating Gold, Silver During The Iran War

JPMorgan Says Bitcoin Is Beating Gold, Silver During The Iran War

Debates over AI benchmarking have reached Pokémon

Rite Aid files for bankruptcy — again

How to Track Driver Performance Without Micromanaging

Ford says its Q1 profit fell by two-thirds and it expects a $1.5 billion hit from tariffs this year

Sabalenka and Rybakina to clash again in Miami semi-final

Transgender athletes barred from female category events at Olympics

PM urged to postpone ‘unconstitutional’ PHF Congress meeting

Players vow to deliver despite empty stands in PSL 11

JPMorgan Says Bitcoin Is Beating Gold, Silver During The Iran War

JPMorgan Says Bitcoin Is Beating Gold, Silver During The Iran War

GameStop Didn’t Sell Bitcoin — What It Did Instead Will Anger BTC Maxis

Subscribe to Updates

What's Hot

Debates over AI benchmarking have reached Pokémon

Related Posts