Researchers build language models that don't choke on 2-million-token contexts without drowning in memory

What happened

A new neural network architecture replaces the attention mechanism that current AI language models use with a continuous wave-based system that processes information in linear time instead of quadratic time. This means AI models can handle vastly longer documents or conversations while using the same amount of computer memory — the practical limit jumps from thousands of tokens to millions.

Why it matters

For five years, every large language model has hit the same wall: the math that lets them understand context gets exponentially slower and hungrier for memory as documents get longer. This happens because standard attention compares every word to every other word — fine for a 4,000-token document, impossible for a 2-million-token one. The architecture here sidesteps that entirely by using phase accumulation instead of matrix comparisons, which means the memory cost stays flat no matter how long the input gets. What becomes possible: models that can ingest entire codebases, legal discovery databases, or multi-year conversation histories without exploding the hardware budget. What's being tested: whether a 150-million-parameter prototype can actually retrieve specific information accurately across those ultra-long contexts, and whether the continuous streaming approach doesn't lose information the way earlier linear-time alternatives do.

The signal

The question is whether this architecture's performance on long-context retrieval (they claim accurate search across 2 million tokens) holds up when scaled to production-size models and tested against the standard benchmarks that currently favor Transformer-based systems.