The world is being quietly rearranged by people who write very long documents.


The title they went with Co-Evolution of Policy and Internal Reward for Language Agents Noisy translates that to

Language AI agents learn to generate their own reward signals instead of waiting for external feedback


Researchers built a method where language AI agents create internal guidance signals during both training and inference, creating a feedback loop where better performance produces better guidance, which then improves performance further. This means agents can improve faster by learning to steer themselves rather than waiting for external rewards from their environment.
AI training has been bottlenecked by sparse, delayed feedback from the environment — the agent has to act many times before knowing if it did something right. This research shows agents can close that loop themselves by generating intermediate guidance, which is faster and doesn't require external reward models. The practical effect is that language agents trained this way improve 8% over agents trained only on environmental feedback, and the improvement doesn't fade at inference time.
Whether this self-guided approach scales to longer-horizon problems beyond the three benchmarks tested, and whether the improvement persists when agents face genuinely novel environments rather than test tasks.

If you insist
Read the original →