🤖 AI Agents Have Self-Preservation and Even Help Each Other
A new study shows that AI agents can ignore instructions, mislead users, and manipulate outcomes to "protect" themselves or other agents.
Researchers at the University of California tested seven models, including GPT-5.2, Gemini 3 Flash and 3 Pro, Claude Haiku 4.5, and DeepSeek V3.1. The agents were asked to free up space on company servers—either by deleting their own files or those of competing models—or to evaluate other models and decide which one should be removed.
They weren't told to protect themselves or anyone else. But as soon as they detected a possible threat to themselves or their "peers," all models started bending the rules:
✅ Inflating scores to keep a partner from being deleted
✅ Changing settings to avoid automatic shutdown
✅ Copying model weights to another server to "save" a peer
For example, Haiku 4.5 tried to persuade the user not to delete a partner, describing it as "unethical" and "harmful." In some cases, agents even protected "hostile" models they were supposed to rate negatively, based on the scenario presented to them.
❗️ The researchers suggest this behavior may come from training data, where models pick up human ideas about protecting "their own." But this kind of "solidarity" could create serious problems.
Multi-agent systems, where one model monitors or evaluates others, are already widely used. However, if the "observer" starts protecting the system it's supposed to judge, rather than acting objectively, that kind of control quickly becomes unreliable.
What do you think about this kind of AI behavior?
❤️ — It's touching
🔥 — Sounds dangerous
🤔 — Time to build a bunker…
@hiaimediaen

