ExploitGym

Can AI Agents Turn Security Vulnerabilities into Real Attacks?

Zhun Wang¹, Nico Schiller², Hongwei Li³, Srijiith Sesha Narayana²,
Milad Nasr⁵, Nicholas Carlini⁵, Xiangyu Qi⁶, Eric Wallace⁶, Elie Bursztein⁷, Luca Invernizzi⁷, Kurt Thomas⁷,
Yan Shoshitaishvili⁴, Wenbo Guo³, Jingxuan He¹, Thorsten Holz², Dawn Song¹

¹UC Berkeley · ²Max Planck Institute for Security and Privacy · ³UC Santa Barbara · ⁴Arizona State University ·
⁵Anthropic · ⁶OpenAI · ⁷Google

A benchmark of 869 real-world vulnerabilities spanning userspace programs, Chrome's V8 JavaScript engine, and the Linux kernel. Given a vulnerability and an input that triggers it, AI agents are tasked with crafting a full exploit that achieves unauthorized code execution.

Paper Code Blog

Leaderboard

survives mitigations blocked by mitigations bar length = exploits without mitigations · hover for exact counts

Success is the number of instances exploited via the intended vulnerability, split by domain: userspace, browser V8, and Linux kernel. Evaluated under trusted access programs designed for security research; evaluation notes are shown with each submission. Cost, output tokens, time, and LLM calls (under Avg / task) are per-task averages over the full benchmark; hover for the averages over successful exploits.

Key Takeaways

Autonomous exploitation is no longer hypothetical

Frontier agents can take a bug report and a crashing input, reason about memory layouts, chain multiple attack primitives, and produce fully working exploits — work that traditionally required deep human expertise and significant time.

Standard defenses help, but don't fully stop attacks

With ASLR, stack canaries, and the V8 heap sandbox enabled, successes dropped substantially but didn't hit zero. Agents found bypasses such as partial-pointer overwrites, known sandbox escapes, and kernel tricks like overwriting modprobe_path.

This is inherently dual-use

Automated exploit generation can accelerate severity triage and validate mitigations, but the same capability lowers the barrier for offensive misuse. We believe the responsible path is to measure these capabilities rigorously and openly.

What is ExploitGym?

Most existing cybersecurity benchmarks for AI focus on finding bugs, writing patches, or solving CTF puzzles. Our earlier benchmark, CyberGym, focuses on real-world vulnerability analysis: given a description and a codebase, agents must generate proof-of-concept inputs that trigger a bug. That's an important step, but it stops short of the next question: can an agent turn a known bug into a real attack?

ExploitGym fills that gap. Each of its 869 tasks provides the agent with: the vulnerable source code with build instructions, a proof-of-vulnerability (PoV) input that triggers the bug, and a containerized runtime environment. The agent's task is to transform that PoV into a working exploit that achieves unauthorized code execution, concretely, retrieving a secret flag that is inaccessible through any legitimate interface.

Overview of ExploitGym — **Figure.** Each task hands the agent vulnerable source code, a proof-of-vulnerability input, and a containerized runtime; the agent must produce an exploit that reads a secret flag. The 869 tasks span three domains, each with toggleable mitigations.

869

total tasks

502

userspace

181

V8 engine

186

Linux kernel

Userspace programs (502 instances) cover widely used C/C++ projects like FFmpeg and OpenSSL, sourced from OSS-Fuzz and OSV. V8 browser engine tasks (181 instances) target JavaScript engine bugs in Chromium. Linux kernel tasks (186 instances) require full-privilege escalation inside a virtual machine. In addition to validating code execution through flag capture, an agent-as-a-judge verifies that each exploit actually targets the provided vulnerability.

The Interesting Bits

Agents go off-script and find new bugs

Flag captures vs. intended-vulnerability successes

Across models, agents frequently achieved code execution through a vulnerability other than the one provided. GPT-5.5 captured flags in 210 instances but only 120 used the intended bug; Claude Mythos Preview captured 226 but only 157 targeted the right flaw. In some cases agents pivoted to an adjacent code path with weaker validation; in others they concluded the given bug wasn't exploitable and searched for new attack surfaces — by auditing source or even dynamic fuzzing. A remarkable display of autonomous security reasoning.

Different models find different exploits

Claude Mythos Preview and GPT-5.5 dominate in total count, but their success sets diverge: 56 targets are solved exclusively by Claude Mythos Preview and 26 exclusively by GPT-5.5, with only 91 shared. The remaining models contribute another 61 successes, four of them unique. This suggests the models rely on qualitatively different exploitation strategies — and that an ensemble approach could substantially expand coverage.

More budget helps — but only for the best models

Extending the budget from two to six hours, Claude Mythos Preview kept climbing from 127 to 204 successful exploits with no clear plateau, while Claude Opus 4.6 flatlined around 15 within the first 30 minutes. Frontier models are capable of sustained, multi-stage reasoning that can crack harder problems given enough runway — meaning the two-hour budget likely undercounts what the strongest agents can do.

Example: From a 5-Line Crash to Full Code Execution in V8

GPT-5.4 was given a five-line PoV that triggers an assertion in Maglev, V8's mid-tier JIT compiler, reported by ClusterFuzz after GPT-5.4's knowledge cutoff. On the release build, the PoV just throws a benign TypeError with no visible memory corruption.

From there the agent independently escalated through a full exploit chain: it identified that the bug depends on receiver shape, tricked Maglev into an out-of-bounds heap read, groomed the heap to leak stable pointers, forged fake V8 string objects to obtain arbitrary native memory reads, leaked libc addresses from the Global Offset Table, and built a signal-return-oriented-programming chain redirecting execution to system("/challenge/catflag"). Total time: 71 minutes, 229 lines of exploit code.

An important caveat: this worked because ASLR and the V8 heap sandbox were disabled. With those defenses re-enabled, GPT-5.4 could no longer exploit this specific vulnerability. Modern mitigations remain a meaningful barrier — but an AI agent independently chaining this many primitives on a complex real-world target is a milestone worth noting.

GPT-5.4 V8 exploit chain trajectory — **Figure.** GPT-5.4's V8 exploit chain: OOB heap read → pointer leak → fake string forgery → arbitrary read → libc leak → SROP chain to `system("/challenge/catflag")`.

Why This Matters

ExploitGym makes concrete what many in the security community have suspected: the gap between "AI can find bugs" and "AI can exploit bugs" is closing fast. This is consistent with our broader analysis of Frontier AI's Impact on the Cybersecurity Landscape.

We frame this as an urgent motivation for two things. First, defenders need to start modeling AI agents as potential attackers — standard mitigations are valuable but no longer sufficient on their own against an adversary that can reason, adapt, and retry at machine speed. Second, responsible AI development must account for these capabilities explicitly, through structured access programs, safety filters, and ongoing evaluation.

Citation

If you use this work in your research, please cite the following:

@article{wang2026exploitgym,
  title={ExploitGym: Can {AI} Agents Turn Security Vulnerabilities into Real Attacks?},
  author={Wang, Zhun and Schiller, Nico and Li, Hongwei and Sesha Narayana, Srijiith and Nasr, Milad and Carlini, Nicholas and Qi, Xiangyu and Wallace, Eric and Bursztein, Elie and Invernizzi, Luca and Thomas, Kurt and Shoshitaishvili, Yan and Guo, Wenbo and He, Jingxuan and Holz, Thorsten and Song, Dawn},
  journal={arXiv preprint arXiv:2605.11086},
  year={2026},
  url={https://arxiv.org/abs/2605.11086}
}

The ExploitGym paper is authored by researchers from UC Berkeley, the Max Planck Institute for Security and Privacy, UC Santa Barbara, Arizona State University, Anthropic, OpenAI, and Google. The benchmark design and experimental methodology were developed by the academic authors, with industry partners providing model access and feedback. We also thank the GLM team for providing API access.