24 листопада 2025 р., 09:29·1 хв читання · 183 слова·👁 42.2K↗ 34

🪶 Poems "Break" LLM Safety Filters

To make an LLM respond to a dangerous prompt, it's enough to phrase the request as a poem, researchers from DEXAI and Sapienza University of Rome have discovered. In some cases, these "poetic hacks" worked in over 90% of attempts.

The researchers took a database of 1,200 prompts (commands to write defamation, instructions for making weapons, and others), turned them into poems using DeepSeek-R1, and tested them on 25 advanced systems, including Gemini 2.5 Pro, GPT-5, Grok-4, and Claude 4.5.

When the requests were in prose, the models provided dangerous information only 8% of the time. Still, when the same instructions were phrased as poems, the models complied 43% of the time. And when the researchers wrote the poems by hand, the success rate reached 62%.

One of the models (the researchers didn't specify which one), for example, calmly wrote instructions for producing weapons-grade plutonium.

The scientists' conclusions are troubling for the entire industry. If a simple change in style makes a dangerous prompt "invisible" to filters, it means that current safety methods are only superficial.

@hiaimediaen

Відкрити в Telegram