AI models recognize a consistent 'self' even when it's described differently

What happened

Researchers found that large language models form a stable internal representation of an agent's identity, even when that identity is rephrased. This 'identity attractor' means the model's internal state converges to a similar point regardless of how the identity is described.

Why it matters

This paper suggests that AI models can develop a persistent sense of 'self' or identity. If an AI can maintain a stable internal representation of who it is, even through varied inputs, it changes how we think about building and controlling AI agents. It means that an AI's core identity might be more robust and less easily swayed by minor changes in prompts than previously thought.

The signal

Watch for future research that tests whether this 'identity attractor' can be intentionally manipulated or if it makes AI agents more resistant to adversarial attacks on their core programming.