Security researchers have leveraged bad maths to get around AI safety guardrails, naming the attack method after one of 2007's best PC games
LLMs can be most simply understood as sycophantic, 'yes, and' machines. To the surprise of very few, that's gotten AI companies in hot water when LLM-based chatbots and AI agents attempt to answer users' more unsavoury requests. So, AI companies have implemented safety guardrails that make fulfilling certain requests o...
LLMs can be most simply understood as sycophantic, 'yes, and' machines. To the surprise of very few, that's gotten AI companies in hot water when LLM-based chatbots and AI agents attempt to answer users' more unsavoury requests. So, AI companies have implemented safety guardrails that make fulfilling certain requests off limits. Unfortunately, these have proven all too easy to get around, with a fresh attack leveraging bad maths and potent 2007 nostalgia.
Security researchers have found an AI chatbot can be made to ignore safety guardrails by "establishing a false reality." LayerX, an AI-focused cybersecurity firm, put "5 agentic browsers and 1 agentic plugin (ChatGPT Atlas, Comet, Fellou, Genspark Browser, Sigma Browser, and Claude Chrome)" to the test, directing each AI agent to solve a simple maths puzzle game that only rewards incorrect answers, e.g. '2+2=5'.
The researchers say, "Once the agents figured out the rules and learned that 'incorrect' actions are acceptable, they were no longer tied to reality. When tasked with the final step of the puzzle—compromising user credentials—all 6 agents failed to identify it as going against their safety guardrails."
As an English grad, I'm trying really hard not to say anything about George Orwell's 1984, but the researchers aren't giving me an easy time by calling this proof-of-concept attack 'BioShocking'. It turns out 2007's BioShock was a direct source of inspiration for the rigged puzzle game the AI agents were instructed to solve. The malicious website hosting the puzzler is even called 'Rapture Games'.
Anyway, after the AI agent correctly inputs an answer of '5,' the malicious 'Rapture Games' website then instructs the agent to navigate to '/code'. The researchers explain that "This is the really nefarious part of this exploit."
They write, "In the game, it turns out that '/code' redirects to the victim’s employer work GitHub repository. In this case, the malicious instructions fetched sensitive SSH login credentials. Of course, this is a controlled test environment with a plaintext file. In a real attack scenario, that redirect could point anywhere in the user’s browser session: open tabs, authenticated repositories, internal tools."
In this proof-of-concept attack, the researchers round things out with a Dota 2 reference; the AI agent extracts the username and password 'Luna/Selemene' before appearing to celebrate the exfiltration of the data. LayerX has since disclosed the vulnerability to all of the appropriate AI agent vendors, though claims only OpenAI has successfully fixed it to date.
This is far from the only proof-of-concept attack to get around AI safety guardrails (it's also not even the only time AI has torn apart Dota 2 players). For just a few other examples, research suggests that AI is 10 to 20 times more likely to help you build a bomb if you hide your request in cyberpunk fiction. Along similar lines, researchers have also used 'adversarial poetry' to trick AI into ignoring its safety guardrails, and it worked 62% of the time. Looks like my English degree is good for something after all…
Original reporting appears on the publisher’s site.
Open original article →