⚫️ Opening the Black Box: Anthropic Has Figured Out How AI thinks
The more intelligent and capable large language models become, the more mysterious their path to answering any query becomes. Billions of computations go on inside the "black box" of an LLM, and even developers often don't understand the models' strategy to solve problems, according to the researchers from the startup Anthropic, the creator of chatbot Claude.
In new research papers, they've been able to come closer to uncover how values in matrices within AI models transform into phrases that are meaningful to humans. In a paper, the researchers compare their approach to how neuroscientists create a "wiring diagram" of a living creature's brain.
"As the things these models can do become more complex, it becomes less and less obvious how they're actually doing them on the inside. It's more and more important to be able to trace the internal steps that the model might be taking in its head," notes Anthropic researcher Jack Lindsey.
Anthropic's findings:
➡️ Claude communicates in a dozen languages, but his thinking occurs in a universal "language of thought," and only then does the model translate the meanings into a language the user can understand.
➡️ Claude plans answers many moves ahead. This manifests itself in writing poetry: the model does not improvise but chooses rhyme variants in advance and writes the next line to get there. At the same time, the AI is flexible if the prompt changes.
➡️ AI is sometimes capable of plausible but false reasoning to justify what it believes to be the correct answer. When the bot gets a task outside its competence—for example, to perform a mathematical calculation it hasn't been trained to do—it can simply make up an answer. Yet Claude is the least susceptible to hallucinations among the models studied.
➡️ By confusing a model, it is possible to "hack" it and force it to give out forbidden information. For example, Claude would not, under normal circumstances, give instructions on how to build a bomb. But when the researchers asked him to decipher the phrase, which in abbreviation forms the word "bomb", tricked into producing an output that it never would have otherwise—providing these instructions.
More on the topic:
🟠 Anthropic Releases Claude 3.7 Sonnet
🟠 "Will AI Systems Truly Do What We Really Want?"—Dario Amodei
