The world is being quietly rearranged by people who write very long documents.


The title they went with TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems Noisy translates that to

Researchers can now measure what forecasting AI actually reasons about — not just whether its numbers are right


A new benchmark lets researchers see inside time-series forecasting systems by evaluating their reasoning, not just their numerical accuracy. This means forecasting AI can now be tested on whether it actually understands trends and external factors, rather than treated as an opaque number-generator.
For decades, forecasting systems were judged solely on whether their numbers matched reality — no one asked what reasoning got them there. This benchmark forces that question by requiring systems to explain their logic about relationships between variables, trends, and external shocks. The practical effect: when researchers prompted language models with these reasoning traces, forecasting accuracy jumped from 40% to 57% on average, which suggests the reasoning actually matters. This could matter because it's the first time anyone has measured whether black-box forecasting is actually learning causal relationships or just fitting patterns.
Watch whether industry forecasting systems (used in supply chains, energy markets, finance) start being evaluated on reasoning traces rather than accuracy alone, or whether they remain opaque.

If you insist
Read the original →