The world is being quietly rearranged by people who write very long documents.


The title they went with Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization Noisy translates that to

AI agents learn to fix their own reasoning mistakes — by comparing what worked to what failed


Researchers built a method that helps AI agents improve at multi-step reasoning tasks by treating successful and failed attempts as a tree rather than independent chains. Instead of rewarding all steps equally, the system identifies which specific steps matter most and learns from contrasting what succeeded against what broke — making AI agents better at tasks like planning and long-form problem-solving.
Right now, training AI agents to reason through hard problems is slow and expensive because most training signals are noisy and sparse. This paper shows a measurable improvement on existing benchmarks, with the largest gains on tasks requiring extended reasoning chains. The method matters because AI agents that can self-correct and learn from their own failure patterns are closer to autonomous systems that don't need constant human oversight — but this is still a laboratory finding with no evidence of real-world deployment.
Check whether T-STAR or similar tree-based credit assignment methods get adopted in actual AI agent deployment — deployed systems using this for reasoning tasks, measured against baseline performance in production rather than benchmarks.

If you insist
Read the original →