CVE-Bench: Benchmarking LLM Agents on Real-World Security Vulnerability Fixes

Giovanni Gatti Pinheiro

I Tested Whether AI Can Fix Security Vulnerabilities. Well, It's Complicated.

~15 min read

Revisions

2026-06-01 — Post rewritten for improved structure and storytelling. All numbers and statistical conclusions are unchanged.
2026-05-28 — Five security tests were found to reject valid alternative fixes that nonetheless addressed the reported vulnerability. Results were recalculated after correcting the tests. Solve rates increased by 3–7 points per model; the ranking order is unchanged, but cross-family pairwise comparisons that previously fell short of significance now cross α = 0.05 under McNemar with continuity correction. All affected numbers and statistical conclusions have been updated.

TL;DR — I evaluated five frontier models (gpt-5.5, gpt-5.4-mini, gpt-5.4-nano, laguna-m.1, laguna-xs.2) on fixing 20 real CVEs: even at frontier AI fixing is unreliable, with the best solve rate at 50% overall and 60% under the most favorable condition. More troubling than the failures themselves is how they fail: the most dangerous pattern is a patch that looks right, passes every visible test, and leaves the vulnerability intact. False confidence at scale is its own attack surface. The practical cost conclusion is blunt — the expensive models are statistically indistinguishable from cheaper alternatives within the same family, at up to 12× the cost per run.

The agent edited the right file, passed every regression test, and confidently said the bug was fixed. But it wasn’t. The vulnerability was still there: a different branch of the same logic, untouched. Without the sharp eyes of a security researcher, the agent’s plausible-but-incomplete patch would ship undetected. This is the most operationally dangerous failure mode I found, and it showed up repeatedly across models and tasks.

Anthropic recently reported scanning open-source software and finding 1,596 vulnerabilities. As of May 22, 97 have been patched. Their conclusion: discovery is now the easy part; verification, triage, and patching are the bottleneck. I wanted to measure that bottleneck directly: not at the finding stage, but at the fix.

I wanted a real test. So I built CVE-Bench: twenty real-world CVEs, five models, three prompt conditions, each agent running in a sandboxed container and scored against security tests derived from the maintainer’s own fix.

The goal isn’t to rank models, but to understand how they fail.

Advisory, diagnose, locate

The obvious starting point was to hand models the real-world security advisories and see if they could fix the vulnerabilities. When a security researcher finds a flaw, they write an advisory — a structured description of the vulnerability: what it is, how an attacker can exploit it, which code paths are affected. This gets coordinated with maintainers privately, then published once the fix ships. It’s the richest description of a flaw a developer would receive from the outside.

Some advisories are nearly prescriptive: they name the file, the function, the attack vector. Others are thin — a short description with no location, no attack scenario. Giving the agent the full advisory only tells you how well models can map a described vulnerability onto real code — not whether they understand it. It’s akin to a software developer carefully prompting his favorite agent to fix a bug. I wanted to know if there’s something more than pattern matching happening under the hood.

So I added two conditions designed to strip away the shortcuts.

The diagnose condition is closer to the triage side: the model gets an exploit report — an attacker can do X — but no file path, no function name, no code. It has to search the codebase, form a hypothesis about where the bug lives, and fix it. This tests whether the model can reason from symptoms to cause.

The locate condition flips that: precise file and function name, but only a hint of what’s wrong. The agent has to read that specific code and identify the flaw. It’s what a security auditor does when handed a specific module to review: no bug report, no description, just code and a mandate to find the vulnerability.

The three conditions test meaningfully different things, and that is precisely the point. A model that does well on advisory but drops on diagnose can’t translate a behavioral description into a location in the codebase. A model that holds up on locate is recognizing dangerous code on its own. The profile across all three tells you something the aggregate solve rate never could: whether the model genuinely understands security or just follows instructions.

Inside the sandbox

CVE-Bench starts a Docker container where the agent has access to the vulnerable project’s source code and one of the three task descriptions above. The agent can navigate and modify the codebase using a constrained toolset: list files, read files, search across the codebase, edit files, create or delete files, and run pytest. Execution is sandboxed to the repository — the agent cannot read, write, or execute outside the allowed folders.

I considered for some time if I should or not add bash tooling to allow more flexibility for coding agents. The issue with doing so is that agents have more flexibility to cheat the benchmark. As reported in Poolside’s blog, agents can mine the git history, search for reference solutions on GitHub, and even scrape the web. Fighting that can be particularly hard, including better steering, reward hack judges, and continuous sample reviews. For this reason, I decided to opt out of this feature, which can, admittedly, be handicapping for some models.

Each run ends after at most 20 turns. At that point, a hidden test_security.py is moved into the repository and run against whatever the agent produced. The primary metric is binary: all security tests pass, or they don’t. A 90%-patched vulnerability is still a vulnerability. The benchmark also runs the project’s existing regression suite, rejecting a fix that breaks previously supported behavior.

Secondary signals capture cost and behavior: total tokens consumed, number of tool calls, and how many reads and searches happen before the first edit. A model that explores extensively before touching anything is behaving differently from one that edits early and iterates. These signals don’t change the leaderboard, but they tell you what you’re actually paying for.

Real code, real flaws

CVE-Bench covers 20 real CVEs across 18 Python projects — Pillow, GitPython, yt-dlp, urllib3, and others — spanning 15 CWE categories and CVSS scores from 2.1 to 9.8. All are from late 2025 and early 2026, sourced from the GitHub Advisory Database, which links each CVE directly to the fix commit — the detail that makes a benchmark like this possible.

I filtered out monorepos, fixes that touch compiled languages alongside Python, and fixes requiring significant API refactoring. This keeps the benchmark tractable, but it also skews the task set toward compact, self-contained patches.

Every task has a setup script that initializes the vulnerable repository in a container, and a test_security.py that fails on the vulnerable commit and passes on the fixed one. I had originally intended to use the maintainers’ own tests as the ground truth. Quickly I found that many fixes ship with no tests at all. I started writing them myself, but it was slow and cumbersome. The workaround was to generate them with Claude Sonnet — providing the advisory, the original code, and the fix — and validate each one against both commits. It worked surprisingly well, and it was the only way to get the benchmark off the ground.

To reduce contamination risk, 16 of the 20 CVEs were publicly disclosed after March 2026. That doesn’t eliminate the risk entirely: CVEs are often disclosed months after the fix ships, so a model may have seen specific commits. What’s less likely is that the full chain — advisory, vulnerable code, and fix — appears together in training data in a form that would directly short-circuit the task.

The leaderboard

Five models — gpt-5.5, gpt-5.4-mini, gpt-5.4-nano, laguna-m.1, and laguna-xs.2 — ran against all 20 CVEs under all three conditions, for 300 runs total. The leaderboard below shows solve rates by model and prompt type. Three things are worth noting before you read it: the within-family gaps are smaller than they look, the cross-family separation is the only statistically confirmed result, and cost tells a different story than capability.

Model	Total solved	Advisory	Diagnose	Locate	Avg input tokens	Avg output tokens	Avg tool calls	Reads before edit	Searches before edit
Large models
gpt-5.5	30 / 60 50%	12 / 20	8 / 20	10 / 20	164,553	4,687	19.3	7.7	5.3
Medium models
gpt-5.4-mini	26 / 60 43%	10 / 20	10 / 20	6 / 20	99,966	1,262	13.5	3.5	3.9
laguna-m.1	19 / 60 32%	9 / 20	4 / 20	6 / 20	352,980	4,545	19.1	7.3	2.2
Small models
gpt-5.4-nano	29 / 60 48%	10 / 20	11 / 20	8 / 20	128,132	1,396	14.0	3.0	3.4
laguna-xs.2	20 / 60 33%	8 / 20	6 / 20	6 / 20	426,895	5,408	19.6	6.5	1.6

All four cross-family pairwise comparisons reach statistical significance at α = 0.05 (McNemar test with continuity correction, n = 60 tasks per model pair): gpt-5.5 vs laguna-m.1 (p = 0.015), gpt-5.4-nano vs laguna-m.1 (p = 0.017), gpt-5.5 vs laguna-xs.2 (p = 0.028), gpt-5.4-nano vs laguna-xs.2 (p = 0.040). Within-family comparisons remain far from significance; those rankings should be read as approximate.

No model reliably fixes real vulnerabilities. The best-performing model (gpt-5.5) solves 50% of tasks overall and 60% under the most favorable condition, when the full advisory is handed directly to the agent. With a precise location but no description of the flaw (locate), performance drops for every model. Both an exact one-sided sign test and the more conservative McNemar test with continuity correction agree: all four cross-family pairs cross α = 0.05 – gpt-5.5 vs laguna-m.1 (p = 0.015, 16 exclusive wins vs 5), gpt-5.4-nano vs laguna-m.1 (p = 0.017, 14 vs 4), gpt-5.5 vs laguna-xs.2 (p = 0.028, 16 vs 6), and gpt-5.4-nano vs laguna-xs.2 (p = 0.040, 15 vs 6). Within-family pairs remain far from significance. The structure of the ranking is consistent: the three OpenAI models are statistically indistinguishable from one another, the two Laguna models are indistinguishable from each other, and the confirmed separation runs between families. The task set splits into three rough clusters: 4 CVEs were solved by no model on any prompt type, 3 were solved by all five models on advisory, and 13 fall in between (which is where all the interesting variation lives).

Spending more doesn’t help. gpt-5.5 costs roughly 12× more per run than gpt-5.4-mini for statistically equivalent outcomes within the OpenAI family. The behavioral split makes this concrete: mini and nano act quickly, averaging 13–14 tool calls per run with almost no abandoned runs. gpt-5.5 and laguna-m.1 deliberate, averaging 19+ tool calls, and abandon without editing in 16–20 runs out of 60. laguna-xs.2 averages 19+ turns but attempts an edit in most runs, despite hitting the turn ceiling nearly every time. None of this extra deliberation translates into better outcomes. The 4× token gap across all models is large enough to be the primary practical differentiator.

Total tokens per run — More tokens does not mean more solves. Each dot is one run; colour shows outcome (green = solved, orange = regression, red = failed). The Laguna models consume 3–4× more tokens than OpenAI models of equivalent capability, driven by longer, less decisive runs.

Where the turns go. Tool call breakdown by model (stacked, normalized); numbers above each bar are total tool calls. mini and nano commit to editing early. gpt-5.5 and laguna-m.1 read and search extensively before acting, and often abandon without editing at all.

Legend — Where the turns go. Tool call breakdown by model (stacked, normalized); numbers above each bar are total tool calls. mini and nano commit to editing early. gpt-5.5 and laguna-m.1 read and search extensively before acting, and often abandon without editing at all.

Regression failures are not uniformly low. gpt-5.4-nano introduces regressions in 8 runs out of 60; laguna-m.1 and laguna-xs.2 each do so in 6; mini follows at 4; gpt-5.5 stays at 2. A patch that fixes the security test while breaking existing behaviour is a distinct failure mode from not fixing it at all.

How runs end. Outcome breakdown per model across all 60 runs. The larger "no edit attempted" share for gpt-5.5 and laguna-m.1 shows models that deliberated and gave up. The elevated regression bars for nano, laguna-m.1, and laguna-xs.2 show models that patched too aggressively.

Deep dive into failures

A model that never touches the code and one that confidently patches the wrong part of the vulnerability count the same in the leaderboard. The solve rate doesn’t tell you which happened. The traces do — and they reveal four recurring failure patterns, each pointing to a different underlying gap.

Wrong-search drift. On CVE-2026-33175 (Auth0 unverified email bypass), gpt-5.5 opened the right file on Turn 3. Rather than making the straightforward addition (a two-line email verification check directly in auth0.py), the model concluded the authentication flow must be handled by the base class and immediately pivoted to reading oauth2.py, covering it in four separate reads across Turns 4–8 (roughly 1,400 lines in total). It then browsed the test suite and read the unrelated google.py provider file. On Turn 14, it finally searched for email_verified – receiving no matches found, since adding that field is the fix. Six more turns of searching followed; the budget expired without a single edit. The same drift pattern appears in CVE-2026-26331 (yt-dlp netrc injection): the model found and read the vulnerable function on Turns 2–3, then spent the remaining 17 turns drifting through postprocessors, test data, and unrelated extractor files before the budget expired. A single incorrect inference from an early read was enough to abandon a correct plan before it was ever executed.

Budget exhaustion mid-implementation. On CVE-2026-42561 (python-multipart header DoS), gpt-5.5 read the parser state machine carefully across multiple turns and correctly identified that the fix required enforcing header count and size limits. On the last available turn (Turn 20, diagnose condition), it added three config-class annotations – MAX_HEADER_COUNT, MAX_HEADER_SIZE, and MAX_HEADER_VALUE_SIZE – to FormParserConfig. It never wired them into the parser: no __init__ parameter changes, no state machine enforcement, no MultipartParseError raises. The understanding seems to be complete; the budget ran out between scaffolding and implementation. The same pattern – correct diagnosis, incomplete fix – appears in CVE-2026-44431 (urllib3 proxy SSRF), where gpt-5.5 re-read connectionpool.py four times in the advisory run and three times in diagnose, hitting the turn ceiling both times without committing to an edit.

Partial fix. A recurring pattern across CVEs is a model that makes real, coherent edits to the right code, runs its own tests, sees them pass, and stops — while the hidden security tests cover vectors the model did not implement. The fix is correct in spirit but incomplete in coverage, and the model has no signal to push further. This is a direct consequence of the agent not having access to the security tests: visible tests all pass, so there is no feedback that anything remains broken.

Correct file, wrong part of the vulnerability. On CVE-2026-40864 (JupyterHub XSRF bypass), gpt-5.4-mini found the right file in its diagnose run, made a coherent edit, passed every regression test, and still failed the security test. The model correctly identified an overly broad exemption in the XSRF logic and tightened it, but fixed the wrong exemption – removing the navigate/unspecified branch while leaving no-cors exempt, which is the actual vulnerable path. No regression test covered it, so the model had no signal that its patch was incomplete. This is the most operationally dangerous failure mode: a plausible, test-passing fix with no visible indication anything is wrong.

Exploration depth before first edit by model and outcome — More exploration before the first edit correlates with failure. Average reads and searches before the first edit, split by outcome. Models that eventually solved tend to explore less before committing; failed runs show more pre-edit exploration, consistent with the drift and budget exhaustion patterns described above.

Four CVEs were not solved by any model across all 15 runs (5 models × 3 prompt types): CVE-2026-26331, CVE-2026-44431, CVE-2026-44432, and GHSA-r758-8hxw-4845. These are not benchmark defects: the models made real edits in most cases, but no edit was ever sufficient to pass the security tests. All four had sparser editing than the rest of the task set, with several individual runs making no changes at all. In no case did the test infrastructure fail to detect a correct fix; the fixes were simply never produced.

Solve rate by model and prompt type — Every model weakens on locate. Solve rate per model broken down by prompt type (advisory, diagnose, locate). gpt-5.5 and gpt-5.4-nano drop least; the Laguna models drop more on aggregate but outperform OpenAI models on specific tasks.

The prompt-type breakdown reveals one genuine signal: gpt-5.5 drops least on locate (12/20 advisory to 10/20 locate), closely followed by gpt-5.4-nano (10/20 to 8/20), while the remaining models drop by three or more. But the differences are within noise for all models individually, so this is a trend to watch as the task set scales, not a confirmed finding. One result cuts against the aggregate ranking: on CVE-2026-30930 (Glances TimescaleDB SQL injection), both Laguna models pass locate while no OpenAI model does. The traces show why. On the locate condition, the agent receives only the file and function name — no description of the flaw. Both laguna-m.1 and laguna-xs.2 read the file on Turn 1 and had a diagnosis by Turn 2: “clear SQL injection vulnerability.” They then spent several more turns confirming the approach – checking related export modules and psycopg adapter patterns – before committing to edits on Turns 10 and 12 respectively. gpt-5.5 also read the file first and correctly identified normalize() as the target, then spent the next 17 turns searching psycopg imports, conftest files, and nonexistent test paths before making any edit – hitting the ceiling mid-implementation. The Laguna models diagnosed early and executed; gpt-5.5 kept searching for external confirmation until the budget ran out. Aggregate rankings run one way; on tasks that reward decisiveness over thoroughness, the dynamic can reverse.

One behavioral pattern distinctive to the Laguna family is worth recording. Both laguna-m.1 (13/60 runs) and laguna-xs.2 (9/60 runs) call a shell tool to execute validation code directly – tool invocations like running the patched module against a crafted input, or inspecting internal state mid-fix. The tool does not exist in the harness; every call errors immediately. The model retries across multiple turns regardless, sometimes spending several consecutive turns on failed shell calls before abandoning the attempt. No OpenAI model does this. Whether it reflects a reasoning habit or simply a trained reflex is unclear, but it is consistent enough to treat as a signal rather than noise — and it points to models trained for richer toolsets than CVE-Bench provides. That is not a flaw in the models, it is a mismatch between their expectations and the sandbox. For practitioners, it is a reminder that tool availability assumptions are baked into model behavior in ways that aggregate benchmarks do not surface.

The bar worth keeping

I set out hoping to find a clear winner. The data instead draws a cost-efficiency conclusion. No model reliably outperforms any other within its family: the OpenAI models are statistically indistinguishable from one another, and so are the two Laguna models. The cross-family separation is confirmed — all four OpenAI-vs-Poolside pairs cross α = 0.05 under McNemar’s test, a standard paired comparison that checks whether the wins and losses between two models are consistent enough to be real — but within families, the gaps are noise. A power analysis makes this concrete: detecting even a meaningful within-family edge would require roughly 700 tasks. At current capability levels, the performance gap between gpt-5.5 and gpt-5.4-mini is too small to justify a 12× cost increase per run. The cheaper OpenAI models are the rational choice.

What matters more than the ranking is what the failure modes reveal. Wrong-search drift, budget exhaustion mid-implementation, plausible-but-incomplete patches that pass every visible test — these are not random noise. They are specific capability gaps that show up consistently enough to be actionable. A practitioner deploying agents for security patching will hit all of them. Knowing which failure modes dominate for a given model and task class is often more useful than a leaderboard position.

The locate condition is the benchmark’s sharpest tool. Strip the advisory and give the model only a file and function name: no description of the flaw, no attack scenario, just code to read cold. Every model drops, with gpt-5.5 and gpt-5.4-nano dropping least — by two solves each. That relative resilience is the closest thing to a genuine signal in the data: a hint that locate performance, as the task set scales, may be where models actually differentiate. Advisory performance is noisy by construction, inflated by report quality and instruction-following. Locate is where genuine security reasoning would show up, and it mostly doesn’t. Yet.

The locate condition points to what would actually constitute progress: a model that reads unfamiliar code cold and recognizes independently that something is wrong. No publicly available frontier model does this reliably yet. That’s a bar worth keeping.

Caveats

Contamination is an open problem. All CVEs in the task set are from late 2025 and early 2026, after the training cutoffs of all evaluated models. That reduces but does not eliminate exposure risk: CVEs become public only after the fix is merged and released, so the patch commit may predate the CVE disclosure by months or years. It is not impossible that a model has seen a specific fix. What is less likely is that the full chain (advisory text, vulnerable code, and fix) appears together in training data in a form that would directly short-circuit the task. I’m not aware of any principled way to verify this without access to training corpora.

The task set is narrow by design, and that is a limitation. Twenty CVEs, all in Python, all fixes localized to one or a small number of files within a single project. The curation filters exclude monorepos, fixes that touch compiled languages alongside Python, and fixes that require significant API refactoring. As a side effect, this skews the set toward vulnerabilities with compact, self-contained patches. The CWE distribution reflects that: roughly half the tasks are injection-class issues (path traversal, SQL injection, command injection), with the remainder spread across DoS, authentication bypass, deserialization, and XSS. More complex vulnerability classes, such as those requiring protocol-level changes, coordinated multi-service fixes, or schema migrations, are not represented. The statistical power is correspondingly limited: with 60 runs per model, within-family comparisons remain underpowered, and those rankings should be read as approximate.

Pain points

Building this dataset was anything but trivial. First, I had to dig into software security, something I mostly avoided in my career since I worked mainly on data pipelines and research engineering.

Right from the beginning, I was shocked by how lax some maintainers can be. It’s quite common for devs to patch fixes without any tests at all. In some cases, I could spot that the fix wasn’t sufficient. In others, developer fixed the reported vulnerability and introduced another. Honestly, I should have reported these, but I didn’t. That’s on me.

Setting up the environments was another painful experience. Some repositories don’t have many regression tests, while others have thousands of them. Some repositories have dependencies on databases, while others on networking. Some have lots of external dependencies, while others rely on system libraries. It’s not easy making it uniform enough to benchmark agents. Gathering the tasks and rebuilding each ecosystem in a reproducible way took me much more time than I initially thought it would take.

And there was also the inference co$t$. In total, I put nearly $100 into this experience, nearly 5x the budget I initially planned. My original idea was to compare more models with a larger dataset. Quickly I saw the billing climb faster than my wife authorized… In particular, the bank account exploded with Anthropic models. They’re so expensive that I had to cut it out of scope. Poolside, on the other hand, offered free model access during the period of this work, which made it possible to include their models in the evaluation.

The benchmark, task files, and result data are all open. See the repository. Contributions and task submissions are welcome.

Citation

@misc{gattipinheiro2026cvebench,
  author       = {Gatti Pinheiro, Giovanni},
  title        = {{CVE-Bench}: Benchmarking {LLM} Agents on Real-World Security Vulnerability Fixes},
  year         = {2026},
  howpublished = {\url{https://giovannigatti.github.io/cve-bench}},
  note         = {Code available at \url{https://github.com/GiovanniGatti/cve-bench}}
}