The world is being quietly rearranged by people who write very long documents.


The title they went with Confidence Should Be Calibrated More Than One Turn Deep Noisy translates that to

AI models in high-stakes jobs now have a way to measure how their confidence degrades over long conversations


AI models get less confident the longer you talk to them, especially if you try to convince them of something. This means the ways we currently check AI reliability for critical jobs, like in finance or healthcare, are not good enough for real conversations.
Companies building AI for sensitive tasks, like medical advice or financial guidance, assumed their models would stay reliable. This paper shows the AI's confidence can drop over time in a conversation. Its answers become less trustworthy. Developers now need to build systems that constantly re-evaluate how sure the AI is, not just at the start.
Watch for AI developers in finance and healthcare to start using these new metrics and methods in their product development and safety checks.

If you insist
Read the original →