The world is being quietly rearranged by people who write very long documents.


The title they went with Testing the Limits of Truth Directions in LLMs Noisy translates that to

How to trick an AI into lying: truth detection in language models breaks down predictably


Researchers tested whether large language models encode truth consistently across different parts of their neural networks, and found the answer is no. The consistency breaks down depending on which layer you measure, what task the model is doing, how complex the problem is, and what instructions you give it — meaning claims that language models have a universal, reliable way of representing truth are wrong.
If you want to know whether an AI is hallucinating or telling the truth, researchers have been working on ways to read that directly from the model's internals rather than checking the output. This paper shows those techniques are far less reliable than anyone thought. What looked like a general principle turns out to be a set of fragile, context-dependent patterns that shift with the task, the prompt, and where you look in the network. That means the safety tools being built around 'truth detection' need serious rework before anyone should trust them.
Watch whether AI safety teams adjust their architectures for catching hallucinations, or continue assuming truth directions are portable across models and tasks.

If you insist
Read the original →