The world is being quietly rearranged by people who write very long documents.


The title they went with To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining Noisy translates that to

Study maps how much data language models should memorize vs. retrieve from external sources


Researchers measured the trade-off between training data size and retrieval system size across different AI model scales, finding that the optimal balance depends heavily on model size and task type. This provides practical guidance for building AI systems more efficiently — you don't always need massive training datasets if you can retrieve relevant information at runtime.
Most AI systems today either memorize everything during training (expensive, inflexible) or retrieve everything at runtime (slow, expensive infrastructure). This work quantifies where the seam actually is — revealing that smaller models benefit more from retrieval than larger ones, and different tasks have different sweet spots. That matters because it means someone building a language model system now has data to make allocation decisions: spend your budget on training data, retrieval infrastructure, or both, based on what you're actually trying to build.
Whether deployed language model systems in the next 12–18 months shift toward smaller parametric models with larger retrieval stores, or whether companies continue over-investing in massive pretraining datasets regardless of task requirements.

If you insist
Read the original →