The world is being quietly rearranged by people who write very long documents.


The title they went with Rethinking Exploration in RLVR: From Entropy Regularization to Refinement via Bidirectional Entropy Modulation Noisy translates that to

AI researchers find that making language models explore more solutions requires cutting noise, not just turning up randomness


A new approach to training language models on reasoning tasks separates useful exploration from pointless randomness. Instead of randomly pushing the model to try different paths, the method now preserves only the diverse reasoning patterns that actually work, while discarding the noise that degrades problem-solving.
Language models trained to solve step-by-step problems kept converging to the same narrow solutions, and the standard fix — adding randomness — barely worked and required constant tuning. This paper shows that the problem isn't exploration itself, but that previous methods couldn't tell good diversity from useless noise. The practical implication: if this approach scales, it means language models could be trained to reason through harder problems without the current wall of diminishing returns that hits most reasoning models.
Whether AsymGRPO or similar entropy-refinement methods get adopted in the next generation of open-source reasoning models, and whether downstream tasks like mathematics or code verification show measurable gains over entropy regularization baselines.

If you insist
Read the original →