31 січня 2025 р., 08:59·2 хв читання · 385 слів·👁 24.7K↗ 25

🔄 "Humanity's Last Exam": the Most Difficult Test for AI

To measure the AI systems' capabilities, they are given special tests called benchmarks. They test LLM in different domains, from complex math tasks to translations.

Using the most sophisticated benchmarks, experts track how close existing AI models are to the AGI level and assess their safety for humanity.

Researchers at the Center for AI Safety and Scale AI have released a new evaluation called "Humanity's Last Exam." They claim it's the most difficult test ever administered to artificial intelligence.

The 3,000 multiple-choice question test is in public domain. Almost 1,000 scientists (most of them PhDs) from 50 countries and different fields—from analytic philosophy to higher mathematics and rocket engineering—have submitted these questions. The authors of the 50 top-rated questions received $5,000 each.

❓ Sample question:

Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.

Six leading AI models took the exam: Gemini 2.0 from Google, Claude 3.5 Sonnet from Anthropic, Grok-2 from xAI, GPT-4o and o1 from OpenAI, and the new reasoning model DeepSeek-R1 from a Chinese startup. None scored higher than 10% ⤴️

The highest scores went to OpenAI-o1 (9.1%) and DeepSeek-R1 (9.4%), but the Chinese model didn't solve some of the tasks because it can't work with images.

The test's creators expect those scores to rise quickly and potentially surpass 50% by the end of 2025. Perhaps a new benchmark will be needed, with AI having to answer the questions to which humans have no answers.

But even such powerful AI is unlikely to become a threat to human scientists, says one of the test's authors, a University of California, Berkeley, physicist, Kevin Zhou. "There's a big gulf between what it means to take an exam and what it means to be a practicing physicist and researcher. Even an AI that can answer these questions might not be ready to help in research, which is inherently less structured," he explains.

More on the topic:

🟠 The Success of DeepSeek: How China's Open Source Model Challenges ChatGPT

🟠 Google wants to learn AI to surprise and forget

#news #benchmark #AGI @hiaimediaen

#agi #benchmark #news

Відкрити в Telegram