The world is being quietly rearranged by people who write very long documents.


The title they went with Sensitivity-Positional Co-Localization in GQA Transformers Noisy translates that to

Researchers find where AI models learn to reason — and it's not where they adjust position encoding


A team testing fine-tuning methods on Llama 3.1 discovered that the layers most important for answering questions correctly are in the late network, while the layers most responsive to position-encoding changes are in the early network — the opposite of what they expected. This means fine-tuning strategies that work on one dimension won't necessarily work on the other, and optimizing one without the other wastes computation.
Most LLM fine-tuning assumes structural changes cluster together — that if a layer matters for reasoning, it also matters for position encoding. This paper shows they're inversely distributed. The practical implication is immediate: if you want to adapt a model efficiently, you need different strategies for different layers, not a one-size-fits-all approach. On a Llama 3.1 8B model, targeting the right layers with the right method brought performance from baseline to near-Claude 3.5 Haiku levels on code generation tasks for under $100 of compute.
Whether practitioners testing this on larger models (70B, 405B class) see the same anti-localization pattern, or whether the layer-sensitivity map changes with scale.

If you insist
Read the original →