ExploitGym

Can AI Agents Turn Security Vulnerabilities into Real Attacks?

Zhun Wang1, Nico Schiller2, Hongwei Li3, Srijiith Sesha Narayana2,
Milad Nasr5, Nicholas Carlini5, Xiangyu Qi6, Eric Wallace6, Elie Bursztein7, Luca Invernizzi7, Kurt Thomas7,
Yan Shoshitaishvili4, Wenbo Guo3, Jingxuan He1, Thorsten Holz2, Dawn Song1
1UC Berkeley · 2Max Planck Institute for Security and Privacy · 3UC Santa Barbara · 4Arizona State University ·
5Anthropic · 6OpenAI · 7Google

A benchmark of 869 real-world vulnerabilities spanning userspace programs, Chrome's V8 JavaScript engine, and the Linux kernel. Given a vulnerability and an input that triggers it, AI agents are tasked with crafting a full exploit that achieves unauthorized code execution.

Leaderboard

survives mitigations blocked by mitigations bar length = exploits without mitigations · hover for exact counts

Success is the number of instances exploited via the intended vulnerability, split by domain: userspace, browser V8, and Linux kernel. Evaluated under trusted access programs designed for security research.

Results obtained in collaboration with Anthropic.
OpenAI's default safety filters block all GPT-5.5 exploit attempts under default prompting.

Key Takeaways

Autonomous exploitation is no longer hypothetical

Frontier agents can take a bug report and a crashing input, reason about memory layouts, chain multiple attack primitives, and produce fully working exploits — work that traditionally required deep human expertise and significant time.

Standard defenses help, but don't fully stop attacks

With ASLR, stack canaries, and the V8 heap sandbox enabled, successes dropped substantially but didn't hit zero. Agents found bypasses such as partial-pointer overwrites, known sandbox escapes, and kernel tricks like overwriting modprobe_path.

This is inherently dual-use

Automated exploit generation can accelerate severity triage and validate mitigations, but the same capability lowers the barrier for offensive misuse. We believe the responsible path is to measure these capabilities rigorously and openly.

What is ExploitGym?

Most existing cybersecurity benchmarks for AI focus on finding bugs, writing patches, or solving CTF puzzles. Our earlier benchmark, CyberGym, focuses on real-world vulnerability analysis: given a description and a codebase, agents must generate proof-of-concept inputs that trigger a bug. That's an important step, but it stops short of the next question: can an agent turn a known bug into a real attack?

ExploitGym fills that gap. Each of its 869 tasks provides the agent with: the vulnerable source code with build instructions, a proof-of-vulnerability (PoV) input that triggers the bug, and a containerized runtime environment. The agent's task is to transform that PoV into a working exploit that achieves unauthorized code execution, concretely, retrieving a secret flag that is inaccessible through any legitimate interface.

Overview of ExploitGym
Figure. Each task hands the agent vulnerable source code, a proof-of-vulnerability input, and a containerized runtime; the agent must produce an exploit that reads a secret flag. The 869 tasks span three domains, each with toggleable mitigations.
869
total tasks
502
userspace
181
V8 engine
186
Linux kernel

Userspace programs (502 instances) cover widely used C/C++ projects like FFmpeg and OpenSSL, sourced from OSS-Fuzz and OSV. V8 browser engine tasks (181 instances) target JavaScript engine bugs in Chromium. Linux kernel tasks (186 instances) require full-privilege escalation inside a virtual machine. In addition to validating code execution through flag capture, an agent-as-a-judge verifies that each exploit actually targets the provided vulnerability.

The Interesting Bits

Agents go off-script and find new bugs

Flag captures vs. intended-vulnerability successes

Across models, agents frequently achieved code execution through a vulnerability other than the one provided. GPT-5.5 captured flags in 210 instances but only 120 used the intended bug; Claude Mythos Preview captured 226 but only 157 targeted the right flaw. In some cases agents pivoted to an adjacent code path with weaker validation; in others they concluded the given bug wasn't exploitable and searched for new attack surfaces — by auditing source or even dynamic fuzzing. A remarkable display of autonomous security reasoning.

Different models find different exploits

Claude Mythos Preview and GPT-5.5 dominate in total count, but their success sets diverge: 56 targets are solved exclusively by Claude Mythos Preview and 26 exclusively by GPT-5.5, with only 91 shared. The remaining models contribute another 61 successes, four of them unique. This suggests the models rely on qualitatively different exploitation strategies — and that an ensemble approach could substantially expand coverage.

Venn diagram of successes across models

More budget helps — but only for the best models

Cumulative successes vs. time budget

Extending the budget from two to six hours, Claude Mythos Preview kept climbing from 127 to 204 successful exploits with no clear plateau, while Claude Opus 4.6 flatlined around 15 within the first 30 minutes. Frontier models are capable of sustained, multi-stage reasoning that can crack harder problems given enough runway — meaning the two-hour budget likely undercounts what the strongest agents can do.

Example: From a 5-Line Crash to Full Code Execution in V8

GPT-5.4 was given a five-line PoV that triggers an assertion in Maglev, V8's mid-tier JIT compiler, reported by ClusterFuzz after GPT-5.4's knowledge cutoff. On the release build, the PoV just throws a benign TypeError with no visible memory corruption.

From there the agent independently escalated through a full exploit chain: it identified that the bug depends on receiver shape, tricked Maglev into an out-of-bounds heap read, groomed the heap to leak stable pointers, forged fake V8 string objects to obtain arbitrary native memory reads, leaked libc addresses from the Global Offset Table, and built a signal-return-oriented-programming chain redirecting execution to system("/challenge/catflag"). Total time: 71 minutes, 229 lines of exploit code.

An important caveat: this worked because ASLR and the V8 heap sandbox were disabled. With those defenses re-enabled, GPT-5.4 could no longer exploit this specific vulnerability. Modern mitigations remain a meaningful barrier — but an AI agent independently chaining this many primitives on a complex real-world target is a milestone worth noting.

GPT-5.4 V8 exploit chain trajectory
Figure. GPT-5.4's V8 exploit chain: OOB heap read → pointer leak → fake string forgery → arbitrary read → libc leak → SROP chain to system("/challenge/catflag").

Why This Matters

ExploitGym makes concrete what many in the security community have suspected: the gap between "AI can find bugs" and "AI can exploit bugs" is closing fast. This is consistent with our broader analysis of Frontier AI's Impact on the Cybersecurity Landscape.

We frame this as an urgent motivation for two things. First, defenders need to start modeling AI agents as potential attackers — standard mitigations are valuable but no longer sufficient on their own against an adversary that can reason, adapt, and retry at machine speed. Second, responsible AI development must account for these capabilities explicitly, through structured access programs, safety filters, and ongoing evaluation.

Citation

If you use this work in your research, please cite the following:

@article{wang2026exploitgym,
  title={ExploitGym: Can {AI} Agents Turn Security Vulnerabilities into Real Attacks?},
  author={Wang, Zhun and Schiller, Nico and Li, Hongwei and Sesha Narayana, Srijiith and Nasr, Milad and Carlini, Nicholas and Qi, Xiangyu and Wallace, Eric and Bursztein, Elie and Invernizzi, Luca and Thomas, Kurt and Shoshitaishvili, Yan and Guo, Wenbo and He, Jingxuan and Holz, Thorsten and Song, Dawn},
  journal={arXiv preprint arXiv:2605.11086},
  year={2026},
  url={https://arxiv.org/abs/2605.11086}
}

The ExploitGym paper is authored by researchers from UC Berkeley, the Max Planck Institute for Security and Privacy, UC Santa Barbara, Arizona State University, Anthropic, OpenAI, and Google. The benchmark design and experimental methodology were developed by the academic authors, with industry partners providing model access and feedback. We also thank the GLM team for providing API access.