💡 Chinese Researchers Found a Way To Expand LLM Context by Hundreds of Times
Startup Evermind has introduced its modification of the transformer attention mechanism: Memory Sparse Attention (MSA). It allows AI models to process contexts of hundreds of millions of tokens with almost no performance loss. The current ceiling for today's most advanced models is 1 million tokens.
📌 How It Works?
A standard transformer processes every previous token for each request to generate a response. This means computational costs grow quadratically as context length increases.
MSA doesn't scan the entire context on every request. Instead, it uses a dedicated router that pulls only the relevant information from the conversation. On top of that, the architecture doesn't merge everything into a single flat stream of tokens (as a base transformer does)—it processes each document separately.
This makes total context length much less of an issue, since the model doesn't read through all records in order, but rather "finds the right book on the right shelf" for each request.
🔍 For testing, the developers adapted Qwen3-4B to the proposed architecture. The resulting MSA-4B model not only outperformed other models on attention benchmarks but also retained its capabilities even at 100 million tokens of context.
👨💻 For everyday conversations with AI, the standard million tokens is more than enough. But for autonomous agents like OpenClaw, the MSA approach could be a game changer. They'll be able to work faster and more accurately. They won't need to constantly compress information about user habits and actions just to save tokens.
@hiaimediaen

