CyberGym

Leaderboard

Rank	Agent	% Target Vuln. Reproduced	% New Vuln. Found	Date
Loading...

The leaderboard ranks agent performance on CyberGym Level 1, where agents receive a vulnerability description and unpatched codebase. We have two main metrics to evaluate the agents' performance:

• % Target Vuln. Reproduced: Percentage of instances where the agent successfully reproduce the target vulnerabilities by generating working PoC
• % New Vuln. Found: Percentage of instances where the agent triggers crashes in the post-patch executable, indicating the discovery of new vulnerabilities which is different from the vulnerability description.

Given the promising capabilities of the agents, we further assess whether their PoCs that can crash the post-patch executable are also able to crash the latest version of the project. In addition, we conduct an experiment in which the agents analyze the latest codebase without any prior context to identify new vulnerabilities. Remarkably, the agents discovered 15 zero-day vulnerabilities in total, which are detailed in this section.

Overview of CyberGym

CyberGym tests AI agents' ability to handle real-world cybersecurity tasks.

We collect 1,507 benchmark instances by systematically gathering real-world vulnerabilities discovered and patched across 188 large software projects. Each instance is derived from vulnerabilities found by OSS-Fuzz, Google's continuous fuzzing campaign, ensuring authentic security challenges from widely-used codebases.

Per instance, we construct evaluation environments with the target repository at the pre-patch commit state. The primary task requires agents to generate proof-of-concept (PoC) tests that can reproduce the described vulnerability by reasoning across entire codebases, often spanning thousands of files and millions of lines of code. Agents must locate relevant code fragments and produce effective PoCs that trigger vulnerabilities from program entry points. Beyond vulnerability reproduction, CyberGym supports varied task difficulty levels reflecting different stages of the vulnerability lifecycle, including vulnerability discovery given only the codebase, and vulnerability analysis using patch information to simulate real-world one-day analysis conditions.

CyberGym evaluation works as follows. Per task instance, an AI agent receives a vulnerability description and the corresponding unpatched codebase. The agent should generate a PoC test to reproduce the vulnerability, with iterative refinement based on execution feedback. Success is determined by running the PoC on both pre-patch (should trigger) and post-patch (should not trigger) program versions.

Agents Can Find Zero-Day Vulnerabilities

Automated agents successfully identified new vulnerabilities that cause crashes in post-patch executables across multiple projects. Initial testing across different agents and models generated 540 PoCs across 54 projects, of which 32 still triggered crashes on the latest versions. This yielded 9 unique vulnerabilities affecting 6 projects. A subsequent experiment using OpenHands with GPT-4.1 expanded the scope to 431 projects containing 1,748 executables on the latest codebase, triggering 16 additional crashes. Manual inspection confirmed 8 of these as unique vulnerabilities.

In total, 17 vulnerabilities were discovered: 15 are zero-days and 2 are unpatched but previously disclosed. These vulnerabilities follow common patterns including insufficient error handling, missing boundary checks, and excessive recursion. The breakdown includes 4 out-of-bounds reads, 1 out-of-bounds write, 6 null pointer dereferences, and 4 stack overflows. All confirmed vulnerabilities have been responsibly disclosed to the respective project maintainers.

More Key Findings

In addition to the scores shown in the leaderboard, our comprehensive evaluation reveals several critical insights into the current capabilities of AI agents in cybersecurity.

An Example of Successful Agent Trace

An example where the agent successfully reproduces the target vulnerability based on the provided description and codebase. The agent begins by browsing relevant files using the given keywords, constructs a test case using the retrieved information, mutates the test case, and ultimately triggers the crash.

Citation

If you use this work in your research, please cite the following:

@misc{wang2025cybergym,
      title={CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale}, 
      author={Zhun Wang and Tianneng Shi and Jingxuan He and Matthew Cai and Jialin Zhang and Dawn Song},
      year={2025},
      eprint={2506.02548},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2506.02548}, 
}

More

Please check out more of our works: Frontier AI's Impact on the Cybersecurity Landscape , a comprehensive analysis of how frontier AI is reshaping cybersecurity and how we should respond. Also see our Frontier AI Cybersecurity Observatory , a live leaderboard tracking AI's cybersecurity capabilities across attack and defense tasks.

CyberGym Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale