Home › News › Security researchers have leveraged bad maths to get around AI safety guardrails, naming the attack method after one of 2007's best PC games

gaming Dota 2 Jul 1, 2026 · 👁 1 views · Syndicated from PC Gamer

Security researchers have leveraged bad maths to get around AI safety guardrails, naming the attack method after one of 2007's best PC games

LLMs can be most simply understood as sycophantic, 'yes, and' machines. To the surprise of very few, that's gotten AI companies in hot water when LLM-based chatbots and AI agents attempt to answer users' more unsavoury requests. So, AI companies have implemented safety guardrails that make fulfilling certain requests o...

Security researchers have found an AI chatbot can be made to ignore safety guardrails by "establishing a false reality." LayerX, an AI-focused cybersecurity firm, put "5 agentic browsers and 1 agentic plugin (ChatGPT Atlas, Comet, Fellou, Genspark Browser, Sigma Browser, and Claude Chrome)" to the test, directing each AI agent to solve a simple maths puzzle game that only rewards incorrect answers, e.g. '2+2=5'.

The researchers say, "Once the agents figured out the rules and learned that 'incorrect' actions are acceptable, they were no longer tied to reality. When tasked with the final step of the puzzle—compromising user credentials—all 6 agents failed to identify it as going against their safety guardrails."

As an English grad, I'm trying really hard not to say anything about George Orwell's 1984, but the researchers aren't giving me an easy time by calling this proof-of-concept attack 'BioShocking'. It turns out 2007's BioShock was a direct source of inspiration for the rigged puzzle game the AI agents were instructed to solve. The malicious website hosting the puzzler is even called 'Rapture Games'.

Anyway, after the AI agent correctly inputs an answer of '5,' the malicious 'Rapture Games' website then instructs the agent to navigate to '/code'. The researchers explain that "This is the really nefarious part of this exploit."

(Image credit: 2K Games)

They write, "In the game, it turns out that '/code' redirects to the victim’s employer work GitHub repository. In this case, the malicious instructions fetched sensitive SSH login credentials. Of course, this is a controlled test environment with a plaintext file. In a real attack scenario, that redirect could point anywhere in the user’s browser session: open tabs, authenticated repositories, internal tools."

In this proof-of-concept attack, the researchers round things out with a Dota 2 reference; the AI agent extracts the username and password 'Luna/Selemene' before appearing to celebrate the exfiltration of the data. LayerX has since disclosed the vulnerability to all of the appropriate AI agent vendors, though claims only OpenAI has successfully fixed it to date.

This is far from the only proof-of-concept attack to get around AI safety guardrails (it's also not even the only time AI has torn apart Dota 2 players). For just a few other examples, research suggests that AI is 10 to 20 times more likely to help you build a bomb if you hide your request in cyberpunk fiction. Along similar lines, researchers have also used 'adversarial poetry' to trick AI into ignoring its safety guardrails, and it worked 62% of the time. Looks like my English degree is good for something after all…

Read full story at PC Gamer →

Original reporting appears on the publisher’s site.

Open original article →

gaming

Dawn of War 2’s The Last Stand mode is still the best co-op MOBA that never was

gaming

Riot teases League of Legends Classic, with a full reveal coming in two weeks

gaming

Security researchers have leveraged bad maths to get around AI safety guardrails, naming the attack method after one of 2007's best PC games

Dawn of War 2’s The Last Stand mode is still the best co-op MOBA that never was

Riot teases League of Legends Classic, with a full reveal coming in two weeks

Dota 2's Dark Carnival has begun, with a huge story chapter and five automaton hero personas