⚪️ How Pokémon Helps Benchmark AI Progress
Pokémon Red was released on the Game Boy nearly 30 years ago. However, the game still has a devoted fanbase—including some of the researchers at AI startup Anthropic. In June 2024, the team behind Claude decided to test how well their AI model could play the classic game. What started as a lighthearted experiment quickly became something of a cult phenomenon inside the team.
When Anthropic introduced Claude 3.7 Sonnet, they highlighted its performance in Pokémon. Researcher Diane Penn explained that watching a model play the game says more about AI progress than most standardized benchmarks. "We're at a point where evaluations don't tell the full story of how much more capable each version of these models are," she said.
🎮 Pokémon Is Harder Than Chess for AI
AI has long surpassed humans in chess, Go, and even in complex games like StarCraft. But open-world RPGs like Pokémon Red, with random events and dynamic choices, are a better proxy for real-life tasks. They require more than just knowledge—they demand agentic skills: decision-making, goal-tracking, and interaction with characters. That's what makes them a powerful testbed for real-world AI applications.
So far, Claude 3.7 Sonnet has only defeated a few gym leaders, but that's already a leap forward. The previous version struggled to leave the starting house in Pallet Town.
Even more impressive is Google's Gemini 2.5 Pro. Since early April, an AI enthusiast has been livestreaming the model as it plays Pokémon Blue — a slightly modified version of Red. According to Reddit users, Gemini has already made it much farther than Claude did over the same timeframe. Then again, Gemini does get a little help: a custom mini-map helps it avoid getting stuck.
More on the topic:
🔴 Pokémon Go data is being used to train AI

