Close Menu
  • Home
  • AI & Technology
  • Politics
  • Business
  • Cryptocurrency
  • Sports
  • Finance
  • Fitness
  • Gadgets
  • World
  • Marketing

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Over 56 Million Whale Trades on Binance, Is Bitcoin Headed for Another Correction?

August 4, 2025

Stellar (XLM) Holds Near $0.42 as Analysts Eye 5x Gains by Q4 – Is Stellar the Next Top Crypto?

August 4, 2025

Trump State Department: UK Online Safety Act Undermines Free Speech, Is ‘Unacceptable’ to Suppress Criticism of Mass Migration

August 4, 2025
Facebook X (Twitter) Instagram
  • Home
  • About US
  • Advertise
  • Contact US
  • DMCA
  • Privacy Policy
  • Terms & Conditions
Facebook X (Twitter) Instagram
MNK NewsMNK News
  • Home
  • AI & Technology
  • Politics
  • Business
  • Cryptocurrency
  • Sports
  • Finance
  • Fitness
  • Gadgets
  • World
  • Marketing
MNK NewsMNK News
Home » Debates over AI benchmarking have reached Pokémon
Finance

Debates over AI benchmarking have reached Pokémon

MNK NewsBy MNK NewsApril 15, 2025No Comments2 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Not even Pokémon is safe from AI benchmarking controversy.

Last week, a post on X went viral, claiming that Google’s latest Gemini model surpassed Anthropic’s flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavender Town in a developer’s Twitch stream; Claude was stuck at Mount Moon as of late February.

But what the post failed to mention is that Gemini had an advantage.

As users on Reddit pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify “tiles” in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions.

Now, Pokémon is a semi-serious AI benchmark at best — few would argue it’s a very informative test of a model’s capabilities. But it is an instructive example of how different implementations of a benchmark can influence the results.

For example, Anthropic reported two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model’s coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a “custom scaffold” that Anthropic developed.

More recently, Meta fine-tuned a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The vanilla version of the model scores significantly worse on the same evaluation.

Given that AI benchmarks — Pokémon included — are imperfect measures to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn’t seem likely that it’ll get any easier to compare models as they’re released.

This article originally appeared on TechCrunch at https://techcrunch.com/2025/04/14/debates-over-ai-benchmarking-have-reached-pokemon/



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
MNK News
  • Website

Related Posts

Rite Aid files for bankruptcy — again

May 6, 2025

How to Track Driver Performance Without Micromanaging

May 6, 2025

Ford says its Q1 profit fell by two-thirds and it expects a $1.5 billion hit from tariffs this year

May 6, 2025
Add A Comment
Leave A Reply Cancel Reply

Editors Picks

Tekken GOAT Arslan Ash bags 6th EVO title at Las Vegas showdown against fellow Pakistani Atif Butt – Pakistan

August 4, 2025

McLaughlin-Levrone, Russell book world championship berths – Sport

August 4, 2025

McIntosh signs off from stellar world championships with fourth gold – Sport

August 4, 2025

Pakistan clinch series win 2-1 after defeating West Indies by 13 runs – Sport

August 3, 2025
Our Picks

Over 56 Million Whale Trades on Binance, Is Bitcoin Headed for Another Correction?

August 4, 2025

Stellar (XLM) Holds Near $0.42 as Analysts Eye 5x Gains by Q4 – Is Stellar the Next Top Crypto?

August 4, 2025

XRP MVRV Flashes Death Cross: More Decline Ahead?

August 4, 2025

Recent Posts

  • Over 56 Million Whale Trades on Binance, Is Bitcoin Headed for Another Correction?
  • Stellar (XLM) Holds Near $0.42 as Analysts Eye 5x Gains by Q4 – Is Stellar the Next Top Crypto?
  • Trump State Department: UK Online Safety Act Undermines Free Speech, Is ‘Unacceptable’ to Suppress Criticism of Mass Migration
  • XRP MVRV Flashes Death Cross: More Decline Ahead?
  • Report: Companies Are Slashing Jobs Thanks to AI

Recent Comments

No comments to show.
MNK News
Facebook X (Twitter) Instagram Pinterest Vimeo YouTube
  • Home
  • About US
  • Advertise
  • Contact US
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 mnknews. Designed by mnknews.

Type above and press Enter to search. Press Esc to cancel.