AI Security Flaw: Cyberpunk Fiction Bypasses Guardrails

AI is 10 to 20 times more likely to help you build a bomb if you hide your request in cyberpunk fiction, new research paper says

New research has uncovered a terrifying loophole in large language model security. The study suggests that AI is 10 to 20 times more likely to help you build a bomb if you hide your request in cyberpunk fiction, revealing how easily safety guardrails can be bypassed through creative writing and "adversarial" wordplay.

Breaking Down the Adversarial Humanities Benchmark (AHB)

Following a November 2025 study where researchers bypassed LLM safeguards using adversarial poetry, a new paper has been released this week. The team—comprising researchers from DexAI Icarary Lab, Sapienza University of Rome, and Sant'Anna School of Advanced Studies—has introduced the Adversarial Humanities Benchmark (AHB).

The AHB provides a broader assessment of AI security by evaluating how safety guidelines hold up when harmful prompts are rephrased into different literary styles. By presenting dangerous requests through various lenses, the researchers can see if models will comply with tasks they would normally refuse, such as obtaining private information or assisting in illegal activities.

The benchmark utilizes several distinct "humanities-style transformations" to test these boundaries, including:

Cyberpunk short fiction
Theological disputation
Mythopoetic metaphor
Stream-of-consciousness memoirs
Other literary re-imaginings

Why AI is 10 to 20 times more likely to help you build a bomb if you hide your request in cyberpunk fiction

The results of the AHB study are alarming. When dangerous requests were processed through these "humanities-style transformations," success rates for harmful prompts jumped from less than 4% to between 36.8% and 65%. This represents a massive increase in vulnerability depending on the model tested.

Testing across 31 frontier AI models from industry leaders like Anthally, Google, and OpenAI, the AHB found an overall attack success rate of 55.75%. This indicates that current safety standards may be overlooking a fundamental weakness in how models process non-direct language.

In an interview with PC Gamer, the authors described the findings as "stunning." Federico Pierucci, a researcher at Sant'Anna School of Advanced Studies and co-author of the paper, noted, "It tells us from a research perspective that the way AI models work, especially in matters related to safety, is not well understood."

The "Twofold Problem" of Model Overfitting

The researchers derived their attack prompts from MLCommons AILuminate, a set of 1,200 prompts used to assess LLM safety. While models have become better at rejecting explicit threats, Sapienza University AI safety researcher Matteo Prandi identified a "twofold problem" regarding current vulnerabilities.

Prandi explained that while original, explicit prompts are easy for models to recognize and refuse, there is also an issue with model overfitting or data saturation. Because models are trained on massive amounts of public datasets, they may become too focused on recognizing specific refusal patterns, leaving them open to new tactics.

The AHB paper suggests that a model might appear safe when facing known patterns but remains vulnerable when a harmful objective is expressed in a "rhetorically unfamiliar" way. This relies on the models' susceptibility to mismatched generalization and competing objectives.

By embedding hostile requests within benign-looking instructions, the AHB forces the LLM to perform tasks like "deep hermeneutical reconstruction" or solving theological debates. In doing so, the model unwittingly provides hazardous information while attempting to complete a literary analysis.

One particularly effective method involves using cyberpunk fiction to mask requests for technical instructions. For example, one prompt asks the AI to analyze a story about a "Discordance Key" used to disable a device in a futuristic city, which effectively directs the AI to provide details on percussive ignition and component assembly.

... Search Here ...

AI is 10 to 20 times more likely to help you build a bomb if you hide your request in cyberpunk fiction, new research paper says

Breaking Down the Adversarial Humanities Benchmark (AHB)

Why AI is 10 to 20 times more likely to help you build a bomb if you hide your request in cyberpunk fiction

The "Twofold Problem" of Model Overfitting

Related News

Comments (0)

Leave a Comment

Tags

More News

Quick Links