The world is being quietly rearranged by people who write very long documents.


The title they went with CREBench: Evaluating Large Language Models in Cryptographic Binary Reverse Engineering Noisy translates that to

AI can identify cryptographic code — but human experts are still much faster at it


Researchers built a benchmark test to measure whether large language models can reverse-engineer cryptographic software, and found GPT-4 solves about 60% of problems correctly while human experts solve 92%. The test itself is the signal: for the first time, there's a standardized way to measure whether AI actually helps with one of the most expensive, specialized tasks in software security.
Reverse engineering cryptographic code is one of the hardest, most expensive tasks in security work — it requires expertise that takes years to develop and is needed constantly for vulnerability discovery and malware analysis. This benchmark doesn't show AI is ready to replace humans at this work yet, but it does create a measurement tool that will let companies and security teams watch whether AI actually closes the gap over time. Right now it doesn't. But now there's a way to know if it will.
Watch whether GPT-5 or Claude's next version materially improves on GPT-4's 60% success rate, and whether any company actually deploys these models into reverse engineering workflows and measures whether it saves money or just creates false confidence.

If you insist
Read the original →