The world is being quietly rearranged by people who write very long documents.


The title they went with TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization Noisy translates that to

AI training method cuts rewards to useful signals instead of final outcomes


Researchers found that training AI systems on whether they got the right final answer misses what happened in between — the actual thinking steps where good reasoning happens. They built a method that rewards the moment an AI first figures out the right answer, not just the end result, and got 13-24% better performance on smaller models.
Current AI training treats reasoning like a black box: you feed in a question, check the final answer, and adjust accordingly. This means the AI learns nothing about whether its intermediate steps were smart or lucky. The new method breaks that open by rewarding the instant the AI stumbles onto the right answer, preserving signals about how it got there. That's useful because it tells you not just what the model knows, but which path through its reasoning actually led somewhere.
Whether larger language models (the ones actually deployed in products) show the same 13-24% gains when trained this way, or whether the improvement only appears on smaller test models.

If you insist
Read the original →