11 червня 2025 р., 16:04·2 хв читання · 274 слова·👁 13.4K↗ 13

🗺 AI Compete for Domination of Europe

AI benchmarks are no longer surprising: the newer the model, the higher the score. The Every media decided to find out which model is greatest at ~~world dominance~~ Diplomacy, a traditional board game in which strategy, negotiation, and deception reign supreme.

In each match, 7 models took on the roles of rulers of the Great Powers of the early 20th century, fighting for control of the European map. Every turn began with secret negotiations, followed by territorial conquests or defenses. The "tournament" included tests on 18 models from OpenAI, Google, Qwen, DeepSeek, and others.

Who Won?

1️⃣ OpenAI's o3 won nearly every match it participated in. The model turned out to be unexpectedly devious—deliberately misleading allies only to stab them in the back a few turns later.

"Along the way I discovered which model secretly yearns for world domination," jokes (or not?) experiment author Alex Duffy.

2️⃣ Gemini 2.5 Pro also managed to win some rounds but couldn't compete with o3's backstabbing in most cases.

3️⃣ Claude Opus 4 was the most peaceful player. Claude even betrayed Gemini when o3 promised him an alliance that would lead to a draw, even though draws are technically impossible in the game.

4️⃣ DeepSeek-R1 fully embraced its role, delivering vivid threats. Impressively, it came close to winning several times, even though it's significantly cheaper than o3.

Duffy believes that tests like this will help future models plan better and interact more successfully with humans. He aims to make the game publicly available and perhaps arrange a "humans vs AI" tournament.

🕹 Watch the AI Diplomacy stream here.

#news #games @hiaimediaen

#games #news

Відкрити в Telegram