The world is being quietly rearranged by people who write very long documents.


The title they went with ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget Noisy translates that to

You can now train AI search agents without paying for expensive APIs — a dataset proves it works


Researchers built a 20,000-question dataset for training AI search agents without using paid API services, proving that synthetic data generated cheaply can work as well as expensive human-annotated datasets. This means smaller labs and companies can now train their own search-capable AI models instead of relying on expensive proprietary systems.
Training data for AI search agents has been locked behind expensive human annotation or proprietary APIs — making it impossible for anyone without deep pockets to build competitive systems. This dataset shows that you can generate training data for free using a pipeline of automated reasoning, self-checking, and web verification, and the resulting trained models perform competitively. That opens the door to smaller organizations building their own search-capable AI instead of buying access from the handful of companies that can afford annotation at scale.
Watch whether other research groups successfully replicate this approach on different domains, or whether the 20K dataset proves too small or narrow to generalize beyond the specific benchmark tasks they tested it on.

If you insist
Read the original →