The world is being quietly rearranged by people who write very long documents.


The title they went with Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce Noisy translates that to

Your AI might be scoring high on its own tests, but still losing you money


A new study finds that common ways of evaluating conversational AI often miss what actually makes money. Companies using AI for sales or customer service might be optimizing for the wrong things, even if the AI seems to be performing well on internal checks.
Companies have been building conversational AI, then judging its quality with multi-part scorecards. This paper shows that many of those scorecard items do not matter for actual sales. It turns out AI agents can follow sales scripts perfectly, but still fail to build the trust needed to close a deal. This means companies need to stop trusting internal AI scores and start measuring what customers actually do.
Watch for companies to start building their AI evaluation systems around real sales data, rather than just internal quality scores.

If you insist
Read the original →