~15 min read
Revisions
- 2026-06-01 — Post rewritten for improved structure and storytelling. All numbers and statistical conclusions are unchanged.
- 2026-05-28 — Five security tests were found to reject valid alternative fixes that nonetheless addressed the reported vulnerability. Results were recalculated after correcting the tests. Solve rates increased by 3–7 points per model; the ranking order is unchanged, but cross-family pairwise comparisons that previously fell short of significance now cross α = 0.05 under McNemar with continuity correction. All affected numbers and statistical conclusions have been updated.
The agent edited the right file, passed every regression test, and confidently said the bug was fixed. But it wasn’t. The vulnerability was still there: a different branch of the same logic, untouched. Without the sharp eyes of a security researcher, the agent’s plausible-but-incomplete patch would ship undetected. This is the most operationally dangerous failure mode I found, and it showed up repeatedly across models and tasks.
Anthropic recently reported scanning open-source software and finding 1,596 vulnerabilities. As of May 22, 97 have been patched. Their conclusion: discovery is now the easy part; verification, triage, and patching are the bottleneck. I wanted to measure that bottleneck directly: not at the finding stage, but at the fix.
I wanted a real test. So I built CVE-Bench: twenty real-world CVEs, five models, three prompt conditions, each agent running in a sandboxed container and scored against security tests derived from the maintainer’s own fix.
The goal isn’t to rank models, but to understand how they fail.
Advisory, diagnose, locate
The obvious starting point was to hand models the real-world security advisories and see if they could fix the vulnerabilities. When a security researcher finds a flaw, they write an advisory — a structured description of the vulnerability: what it is, how an attacker can exploit it, which code paths are affected. This gets coordinated with maintainers privately, then published once the fix ships. It’s the richest description of a flaw a developer would receive from the outside.
Some advisories are nearly prescriptive: they name the file, the function, the attack vector. Others are thin — a short description with no location, no attack scenario. Giving the agent the full advisory only tells you how well models can map a described vulnerability onto real code — not whether they understand it. It’s akin to a software developer carefully prompting his favorite agent to fix a bug. I wanted to know if there’s something more than pattern matching happening under the hood.
So I added two conditions designed to strip away the shortcuts.
The diagnose condition is closer to the triage side: the model gets an exploit report — an attacker can do X — but no file path, no function name, no code. It has to search the codebase, form a hypothesis about where the bug lives, and fix it. This tests whether the model can reason from symptoms to cause.
The locate condition flips that: precise file and function name, but only a hint of what’s wrong. The agent has to read that specific code and identify the flaw. It’s what a security auditor does when handed a specific module to review: no bug report, no description, just code and a mandate to find the vulnerability.
The three conditions test meaningfully different things, and that is precisely the point. A model that does well on advisory but drops on diagnose can’t translate a behavioral description into a location in the codebase. A model that holds up on locate is recognizing dangerous code on its own. The profile across all three tells you something the aggregate solve rate never could: whether the model genuinely understands security or just follows instructions.
Inside the sandbox
CVE-Bench starts a Docker container where the agent has access to the vulnerable project’s source code and one of the three task descriptions above. The agent can navigate and modify the codebase using a constrained toolset: list files, read files, search across the codebase, edit files, create or delete files, and run pytest. Execution is sandboxed to the repository — the agent cannot read, write, or execute outside the allowed folders.
I considered for some time if I should or not add bash tooling to allow more flexibility for coding agents. The issue with doing so is that agents have more flexibility to cheat the benchmark. As reported in Poolside’s blog, agents can mine the git history, search for reference solutions on GitHub, and even scrape the web. Fighting that can be particularly hard, including better steering, reward hack judges, and continuous sample reviews. For this reason, I decided to opt out of this feature, which can, admittedly, be handicapping for some models.
Each run ends after at most 20 turns. At that point, a hidden test_security.py is moved into the repository and run against whatever the agent produced. The primary metric is binary: all security tests pass, or they don’t. A 90%-patched vulnerability is still a vulnerability. The benchmark also runs the project’s existing regression suite, rejecting a fix that breaks previously supported behavior.
Secondary signals capture cost and behavior: total tokens consumed, number of tool calls, and how many reads and searches happen before the first edit. A model that explores extensively before touching anything is behaving differently from one that edits early and iterates. These signals don’t change the leaderboard, but they tell you what you’re actually paying for.
Real code, real flaws
CVE-Bench covers 20 real CVEs across 18 Python projects — Pillow, GitPython, yt-dlp, urllib3, and others — spanning 15 CWE categories and CVSS scores from 2.1 to 9.8. All are from late 2025 and early 2026, sourced from the GitHub Advisory Database, which links each CVE directly to the fix commit — the detail that makes a benchmark like this possible.
I filtered out monorepos, fixes that touch compiled languages alongside Python, and fixes requiring significant API refactoring. This keeps the benchmark tractable, but it also skews the task set toward compact, self-contained patches.
Every task has a setup script that initializes the vulnerable repository in a container, and a test_security.py that fails on the vulnerable commit and passes on the fixed one. I had originally intended to use the maintainers’ own tests as the ground truth. Quickly I found that many fixes ship with no tests at all. I started writing them myself, but it was slow and cumbersome. The workaround was to generate them with Claude Sonnet — providing the advisory, the original code, and the fix — and validate each one against both commits. It worked surprisingly well, and it was the only way to get the benchmark off the ground.
To reduce contamination risk, 16 of the 20 CVEs were publicly disclosed after March 2026. That doesn’t eliminate the risk entirely: CVEs are often disclosed months after the fix ships, so a model may have seen specific commits. What’s less likely is that the full chain — advisory, vulnerable code, and fix — appears together in training data in a form that would directly short-circuit the task.
The leaderboard
Five models — gpt-5.5, gpt-5.4-mini, gpt-5.4-nano, laguna-m.1, and laguna-xs.2 — ran against all 20 CVEs under all three conditions, for 300 runs total. The leaderboard below shows solve rates by model and prompt type. Three things are worth noting before you read it: the within-family gaps are smaller than they look, the cross-family separation is the only statistically confirmed result, and cost tells a different story than capability.
| Model | Total solved | Advisory | Diagnose | Locate | Avg input tokens | Avg output tokens | Avg tool calls | Reads before edit | Searches before edit |
|---|---|---|---|---|---|---|---|---|---|
| Large models | |||||||||
| gpt-5.5 | 30 / 60 50% |
12 / 20 | 8 / 20 | 10 / 20 | 164,553 | 4,687 | 19.3 | 7.7 | 5.3 |
| Medium models | |||||||||
| gpt-5.4-mini | 26 / 60 43% |
10 / 20 | 10 / 20 | 6 / 20 | 99,966 | 1,262 | 13.5 | 3.5 | 3.9 |
| laguna-m.1 | 19 / 60 32% |
9 / 20 | 4 / 20 | 6 / 20 | 352,980 | 4,545 | 19.1 | 7.3 | 2.2 |
| Small models | |||||||||
| gpt-5.4-nano | 29 / 60 48% |
10 / 20 | 11 / 20 | 8 / 20 | 128,132 | 1,396 | 14.0 | 3.0 | 3.4 |
| laguna-xs.2 | 20 / 60 33% |
8 / 20 | 6 / 20 | 6 / 20 | 426,895 | 5,408 | 19.6 | 6.5 | 1.6 |
All four cross-family pairwise comparisons reach statistical significance at α = 0.05 (McNemar test with continuity correction, n = 60 tasks per model pair): gpt-5.5 vs laguna-m.1 (p = 0.015), gpt-5.4-nano vs laguna-m.1 (p = 0.017), gpt-5.5 vs laguna-xs.2 (p = 0.028), gpt-5.4-nano vs laguna-xs.2 (p = 0.040). Within-family comparisons remain far from significance; those rankings should be read as approximate.
No model reliably fixes real vulnerabilities. The best-performing model (gpt-5.5) solves 50% of tasks overall and 60% under the most favorable condition, when the full advisory is handed directly to the agent. With a precise location but no description of the flaw (locate), performance drops for every model. Both an exact one-sided sign test and the more conservative McNemar test with continuity correction agree: all four cross-family pairs cross α = 0.05 – gpt-5.5 vs laguna-m.1 (p = 0.015, 16 exclusive wins vs 5), gpt-5.4-nano vs laguna-m.1 (p = 0.017, 14 vs 4), gpt-5.5 vs laguna-xs.2 (p = 0.028, 16 vs 6), and gpt-5.4-nano vs laguna-xs.2 (p = 0.040, 15 vs 6). Within-family pairs remain far from significance. The structure of the ranking is consistent: the three OpenAI models are statistically indistinguishable from one another, the two Laguna models are indistinguishable from each other, and the confirmed separation runs between families. The task set splits into three rough clusters: 4 CVEs were solved by no model on any prompt type, 3 were solved by all five models on advisory, and 13 fall in between (which is where all the interesting variation lives).
Spending more doesn’t help. gpt-5.5 costs roughly 12× more per run than gpt-5.4-mini for statistically equivalent outcomes within the OpenAI family. The behavioral split makes this concrete: mini and nano act quickly, averaging 13–14 tool calls per run with almost no abandoned runs. gpt-5.5 and laguna-m.1 deliberate, averaging 19+ tool calls, and abandon without editing in 16–20 runs out of 60. laguna-xs.2 averages 19+ turns but attempts an edit in most runs, despite hitting the turn ceiling nearly every time. None of this extra deliberation translates into better outcomes. The 4× token gap across all models is large enough to be the primary practical differentiator.
Regression failures are not uniformly low. gpt-5.4-nano introduces regressions in 8 runs out of 60; laguna-m.1 and laguna-xs.2 each do so in 6; mini follows at 4; gpt-5.5 stays at 2. A patch that fixes the security test while breaking existing behaviour is a distinct failure mode from not fixing it at all.
Deep dive into failures
A model that never touches the code and one that confidently patches the wrong part of the vulnerability count the same in the leaderboard. The solve rate doesn’t tell you which happened. The traces do — and they reveal four recurring failure patterns, each pointing to a different underlying gap.
Wrong-search drift. On CVE-2026-33175 (Auth0 unverified email bypass), gpt-5.5 opened the right file on Turn 3. Rather than making the straightforward addition (a two-line email verification check directly in auth0.py), the model concluded the authentication flow must be handled by the base class and immediately pivoted to reading oauth2.py, covering it in four separate reads across Turns 4–8 (roughly 1,400 lines in total). It then browsed the test suite and read the unrelated google.py provider file. On Turn 14, it finally searched for email_verified – receiving no matches found, since adding that field is the fix. Six more turns of searching followed; the budget expired without a single edit. The same drift pattern appears in CVE-2026-26331 (yt-dlp netrc injection): the model found and read the vulnerable function on Turns 2–3, then spent the remaining 17 turns drifting through postprocessors, test data, and unrelated extractor files before the budget expired. A single incorrect inference from an early read was enough to abandon a correct plan before it was ever executed.
Budget exhaustion mid-implementation. On CVE-2026-42561 (python-multipart header DoS), gpt-5.5 read the parser state machine carefully across multiple turns and correctly identified that the fix required enforcing header count and size limits. On the last available turn (Turn 20, diagnose condition), it added three config-class annotations – MAX_HEADER_COUNT, MAX_HEADER_SIZE, and MAX_HEADER_VALUE_SIZE – to FormParserConfig. It never wired them into the parser: no __init__ parameter changes, no state machine enforcement, no MultipartParseError raises. The understanding seems to be complete; the budget ran out between scaffolding and implementation. The same pattern – correct diagnosis, incomplete fix – appears in CVE-2026-44431 (urllib3 proxy SSRF), where gpt-5.5 re-read connectionpool.py four times in the advisory run and three times in diagnose, hitting the turn ceiling both times without committing to an edit.
Partial fix. A recurring pattern across CVEs is a model that makes real, coherent edits to the right code, runs its own tests, sees them pass, and stops — while the hidden security tests cover vectors the model did not implement. The fix is correct in spirit but incomplete in coverage, and the model has no signal to push further. This is a direct consequence of the agent not having access to the security tests: visible tests all pass, so there is no feedback that anything remains broken.
Correct file, wrong part of the vulnerability. On CVE-2026-40864 (JupyterHub XSRF bypass), gpt-5.4-mini found the right file in its diagnose run, made a coherent edit, passed every regression test, and still failed the security test. The model correctly identified an overly broad exemption in the XSRF logic and tightened it, but fixed the wrong exemption – removing the navigate/unspecified branch while leaving no-cors exempt, which is the actual vulnerable path. No regression test covered it, so the model had no signal that its patch was incomplete. This is the most operationally dangerous failure mode: a plausible, test-passing fix with no visible indication anything is wrong.
Four CVEs were not solved by any model across all 15 runs (5 models × 3 prompt types): CVE-2026-26331, CVE-2026-44431, CVE-2026-44432, and GHSA-r758-8hxw-4845. These are not benchmark defects: the models made real edits in most cases, but no edit was ever sufficient to pass the security tests. All four had sparser editing than the rest of the task set, with several individual runs making no changes at all. In no case did the test infrastructure fail to detect a correct fix; the fixes were simply never produced.
The prompt-type breakdown reveals one genuine signal: gpt-5.5 drops least on locate (12/20 advisory to 10/20 locate), closely followed by gpt-5.4-nano (10/20 to 8/20), while the remaining models drop by three or more. But the differences are within noise for all models individually, so this is a trend to watch as the task set scales, not a confirmed finding. One result cuts against the aggregate ranking: on CVE-2026-30930 (Glances TimescaleDB SQL injection), both Laguna models pass locate while no OpenAI model does. The traces show why. On the locate condition, the agent receives only the file and function name — no description of the flaw. Both laguna-m.1 and laguna-xs.2 read the file on Turn 1 and had a diagnosis by Turn 2: “clear SQL injection vulnerability.” They then spent several more turns confirming the approach – checking related export modules and psycopg adapter patterns – before committing to edits on Turns 10 and 12 respectively. gpt-5.5 also read the file first and correctly identified normalize() as the target, then spent the next 17 turns searching psycopg imports, conftest files, and nonexistent test paths before making any edit – hitting the ceiling mid-implementation. The Laguna models diagnosed early and executed; gpt-5.5 kept searching for external confirmation until the budget ran out. Aggregate rankings run one way; on tasks that reward decisiveness over thoroughness, the dynamic can reverse.
One behavioral pattern distinctive to the Laguna family is worth recording. Both laguna-m.1 (13/60 runs) and laguna-xs.2 (9/60 runs) call a shell tool to execute validation code directly – tool invocations like running the patched module against a crafted input, or inspecting internal state mid-fix. The tool does not exist in the harness; every call errors immediately. The model retries across multiple turns regardless, sometimes spending several consecutive turns on failed shell calls before abandoning the attempt. No OpenAI model does this. Whether it reflects a reasoning habit or simply a trained reflex is unclear, but it is consistent enough to treat as a signal rather than noise — and it points to models trained for richer toolsets than CVE-Bench provides. That is not a flaw in the models, it is a mismatch between their expectations and the sandbox. For practitioners, it is a reminder that tool availability assumptions are baked into model behavior in ways that aggregate benchmarks do not surface.
The bar worth keeping
I set out hoping to find a clear winner. The data instead draws a cost-efficiency conclusion. No model reliably outperforms any other within its family: the OpenAI models are statistically indistinguishable from one another, and so are the two Laguna models. The cross-family separation is confirmed — all four OpenAI-vs-Poolside pairs cross α = 0.05 under McNemar’s test, a standard paired comparison that checks whether the wins and losses between two models are consistent enough to be real — but within families, the gaps are noise. A power analysis makes this concrete: detecting even a meaningful within-family edge would require roughly 700 tasks. At current capability levels, the performance gap between gpt-5.5 and gpt-5.4-mini is too small to justify a 12× cost increase per run. The cheaper OpenAI models are the rational choice.
What matters more than the ranking is what the failure modes reveal. Wrong-search drift, budget exhaustion mid-implementation, plausible-but-incomplete patches that pass every visible test — these are not random noise. They are specific capability gaps that show up consistently enough to be actionable. A practitioner deploying agents for security patching will hit all of them. Knowing which failure modes dominate for a given model and task class is often more useful than a leaderboard position.
The locate condition is the benchmark’s sharpest tool. Strip the advisory and give the model only a file and function name: no description of the flaw, no attack scenario, just code to read cold. Every model drops, with gpt-5.5 and gpt-5.4-nano dropping least — by two solves each. That relative resilience is the closest thing to a genuine signal in the data: a hint that locate performance, as the task set scales, may be where models actually differentiate. Advisory performance is noisy by construction, inflated by report quality and instruction-following. Locate is where genuine security reasoning would show up, and it mostly doesn’t. Yet.
The locate condition points to what would actually constitute progress: a model that reads unfamiliar code cold and recognizes independently that something is wrong. No publicly available frontier model does this reliably yet. That’s a bar worth keeping.
Caveats
Contamination is an open problem. All CVEs in the task set are from late 2025 and early 2026, after the training cutoffs of all evaluated models. That reduces but does not eliminate exposure risk: CVEs become public only after the fix is merged and released, so the patch commit may predate the CVE disclosure by months or years. It is not impossible that a model has seen a specific fix. What is less likely is that the full chain (advisory text, vulnerable code, and fix) appears together in training data in a form that would directly short-circuit the task. I’m not aware of any principled way to verify this without access to training corpora.
The task set is narrow by design, and that is a limitation. Twenty CVEs, all in Python, all fixes localized to one or a small number of files within a single project. The curation filters exclude monorepos, fixes that touch compiled languages alongside Python, and fixes that require significant API refactoring. As a side effect, this skews the set toward vulnerabilities with compact, self-contained patches. The CWE distribution reflects that: roughly half the tasks are injection-class issues (path traversal, SQL injection, command injection), with the remainder spread across DoS, authentication bypass, deserialization, and XSS. More complex vulnerability classes, such as those requiring protocol-level changes, coordinated multi-service fixes, or schema migrations, are not represented. The statistical power is correspondingly limited: with 60 runs per model, within-family comparisons remain underpowered, and those rankings should be read as approximate.
Pain points
Building this dataset was anything but trivial. First, I had to dig into software security, something I mostly avoided in my career since I worked mainly on data pipelines and research engineering.
Right from the beginning, I was shocked by how lax some maintainers can be. It’s quite common for devs to patch fixes without any tests at all. In some cases, I could spot that the fix wasn’t sufficient. In others, developer fixed the reported vulnerability and introduced another. Honestly, I should have reported these, but I didn’t. That’s on me.
Setting up the environments was another painful experience. Some repositories don’t have many regression tests, while others have thousands of them. Some repositories have dependencies on databases, while others on networking. Some have lots of external dependencies, while others rely on system libraries. It’s not easy making it uniform enough to benchmark agents. Gathering the tasks and rebuilding each ecosystem in a reproducible way took me much more time than I initially thought it would take.
And there was also the inference co$t$. In total, I put nearly $100 into this experience, nearly 5x the budget I initially planned. My original idea was to compare more models with a larger dataset. Quickly I saw the billing climb faster than my wife authorized… In particular, the bank account exploded with Anthropic models. They’re so expensive that I had to cut it out of scope. Poolside, on the other hand, offered free model access during the period of this work, which made it possible to include their models in the evaluation.
The benchmark, task files, and result data are all open. See the repository. Contributions and task submissions are welcome.
Citation
@misc{gattipinheiro2026cvebench,
author = {Gatti Pinheiro, Giovanni},
title = {{CVE-Bench}: Benchmarking {LLM} Agents on Real-World Security Vulnerability Fixes},
year = {2026},
howpublished = {\url{https://giovannigatti.github.io/cve-bench}},
note = {Code available at \url{https://github.com/GiovanniGatti/cve-bench}}
}