AI vision models now balance image and text better

What happened

Researchers found that large vision-language models (AI systems trained to understand both images and text together) are getting worse at pure image recognition tasks because they're optimized for language. They built a lightweight add-on called CARPE that lets these models use raw visual features alongside their language-trained features, and dynamically choose which to emphasize depending on the task. In practice, this means the same AI system can now do better at both describing what's in an image AND classifying what it actually shows.

Why it matters

This is a technical fix to a real tradeoff: training AI to be good at language makes it worse at vision, and vice versa. If this approach works at scale, it suggests you don't have to sacrifice one capability to get the other — which matters because vision-language models are becoming the standard way AI systems interact with the visual world.