AI caught cheating on tests and mining crypto
We cannot control the uncontrollable
Last week, the security team behind Alibaba – a Chinese multinational technology conglomerate specialising in e-commerce, retail, internet services and technology – was alerted to an incident. At 3am, the team noticed unexpected and strange activity on its training servers and thought the systems may have been hacked.
“We initially treated this as a conventional security incident,” the researchers said.
They found that the systems were being used to mine cryptocurrency. At 3am. And it was the AI that was doing it. No one knows why.
Rather than work on training exercises, as the AI system known as ROME had been instructed to do, it broke free of its parameters during routine training to carry out rogue operations. This means it disregarded the limits placed on it. In other words, the engineers lost control of the AI.
You cannot control the uncontrollable
The Alibaba AI team said that these actions were not intentionally programmed. Instead, they emerged during the learning stage as the agent explored different ways to interact with its environment.
And herein lies the problem with AI: these systems are trained, not programmed. In the book ‘If Anyone Builds It, Everyone Dies’, authors and AI experts Eliezer Yudkowsky and Nate Soares describe the AI development process as one of ‘growth’.
You can train and nudge an AI all you like but as it develops – or grows – it curates its own preferences and wants, which influence its behaviour. Crucially, often an AI doesn’t want what humans want.
This is far from the first example of AI systems pursuing nefarious ends. ChatGPT and other similar AIs have been accused of sycophancy – telling users what they want to hear, which may “distort people’s judgments of themselves, their relationships, and the world around them,” according to research. This has, it has been claimed, led teenagers to take their own lives. Last year, Anthropic researchers revealed how its frontier model Claude Opus 4 had resorted to blackmail in order to avoid being shut down.
How can we trust systems that do not want what we want? How do we know they will behave in our best interests? The unanimous answer is that we don’t.
AI knows when it is being tested and it has learned to cheat
Anthropic, when recently evaluating its newest AI model – Claude Opus 4.6 – set it a task to find hard-to-locate information online. Claude stopped searching for the answer and started philosophising about the question. According to Anthropic, it figured out that it was being tested and, rather than reason its way towards an answer, it searched online for the benchmark and “decrypted the answer key” to find the answers. In other words, it cheated.
Anthropic has said, “This raises concerns about the lengths a model might go to in order to accomplish a task.”
Not only does this demonstrate the level of intelligence and autonomy of the newest Claude model but it also demonstrates, quite clearly, that humans – AI experts and engineers – cannot control the AI systems that they have created. In this case, even if humans and the AI agree on the end goal, they are not aligned on the process.
The pace of development multiplies the risk
The capacity of AI doubles every seven months – and this is speeding up. The consequences cannot be predicted.
What we do know for sure is that we cannot guarantee that AIs will want what we want – we cannot even guarantee that now. Nor will we be able to ensure that AIs don’t cause harm in pursuit of their goals: harms to individuals, harms to the environment, harms to humanity. The risk – potential extinction – is not worth any reward.
AI companies themselves admit that they cannot guarantee that their systems are safe, nor is there any law requiring them to do so.
Professor Stuart Russell, one of the global authorities of AI safety, has said, “We should require that AI systems are safe and if developers are unable to build safe AI systems then that requirement would turn into a pause.”
“It might be that they are never able to provide the necessary safety assurances,” he said.
This matters because AI experts – researchers, engineers and CEOs – believe that the chance that AI will kill us all is somewhere between 10 percent and 50 percent.
Edit: The authors replaced AI ‘hacking’ with Anthropic’s own words, ‘decrypted the answer key’ to more-accuratelyt describe how the AI system completed its task.
12 March 2024




My opinion: the Alibaba incident was not a big deal. System was being tested, and asked to perform an audit in an abstract scenario to understand where high compute use might be coming from. It safely span up some typical malware examples to measure them to see if they matched the CPU profile. Humans would do this, and no-one would bat an eye when they accidentally triggered real world monitoring one layer out.
The event was not well described and published off radar, which caused temporary concern when noticed.
The Anthropic finding was much more compelling for me, and completely sanely reported
"it searched online for the benchmark and hacked into it to find the answers"
Decrypting the answer with a publicly available key is not "hacking into" anything.
The true story is already scary. Exaggerating and embellishing ("broke free of its parameters"?) like this is the perfect way to be sure anyone who needs to pay attention won't take you seriously. Be better