A Real-World, Project-Level Vulnerability Benchmark for White-Box Vulnerability-Hunting Agents
VulnGym is a project-level benchmark for white-box vulnerability-hunting agents, designed to evaluate an agent's vulnerability detection capabilities within real-world engineering contexts, with verifiable vulnerability trigger paths and code-semantic evidence chains.
Three core design principles:
- ποΈ Real project-level evaluation units β every sample is bound to a specific vulnerable commit of a real repository, evaluating an agent's ability to discover and locate vulnerabilities inside real multi-file, multi-module engineering projects.
- π§ Comprehensive vulnerability-type coverage β the benchmark covers both business-logic defects that demand cross-module code-semantic reasoning (e.g., authorization bypass, broken authentication) and traditional security flaws (e.g., injection, path traversal), providing a comprehensive assessment of an agent's ability to discover diverse vulnerability classes.
- β
Verifiable vulnerability paths β each sample ships with a human-reviewed reachable entry point (
entry_point), critical operation (critical_operation), and cross-module reasoning chain (trace), enabling reproducible, explainable, and deterministic evaluation.
- 2026-05-17 β π§ v0.1.1 data refresh: added a
verifyfield on every entry to mark human-audit status; 113 / 408 entries (covering 61 / 184 advisories) are now human-verified. Selectedentry_point/critical_operation/tracevalues were also refined. - 2026-05-15 β π VulnGym v0.1.0 officially open-sourced!
- π Why VulnGym
- β¨ Dataset overview
- π Baseline evaluation results
- π¦ Repository layout
- π Quick start
- π Evaluating your tool
- π Citation
- π€ Contribution Guide
- π Acknowledgements
- π License
Existing vulnerability benchmarks have the following limitations when evaluating the real-world vulnerability-hunting capabilities of AI agents:
| Limitation | Manifestation |
|---|---|
| Insufficient evaluation granularity | Most benchmarks use functions or diff snippets as the evaluation unit, failing to reflect an agent's ability to locate vulnerabilities within complete engineering projects |
| Narrow vulnerability types | Over-emphasis on pattern-matchable CWE flaws such as SQL injection and buffer overflow, with little coverage of categories requiring deep contextual reasoning |
| Coarse-grained ground truth | Typically binary labels (vulnerable / not vulnerable) or patch diffs, unable to precisely verify whether the agent locates the correct entry point and defect site |
This is the v0.1.1 release of VulnGym. Data is provided
as two JSONL files under the data/ directory:
reports.jsonlβ aggregated records at the GitHub Advisory granularityentries.jsonlβ annotated records at the reachable entry point granularity
Each record contains repo_url and commit, allowing you to check out the
full vulnerable source tree for the corresponding version.
| Metric | Value |
|---|---|
| Advisories (reports) | 184 |
| Reachable entry points (entries) | 408 |
| Distinct projects | 38 |
| Distinct repositories | 23 |
Human-audited entries (verify = 1) |
113 / 408 (27.7 %) |
| Human-audited advisories (β₯ 1 verified entry) | 61 / 184 (33.2 %) |
Starting in v0.1.1, every row in entries.jsonl carries a verify field
(int, 0 or 1):
verify == 1β the entry'sentry_point,critical_operation, andtracehave been reviewed and confirmed by a human annotator. These rows form a high-confidence ground-truth subset and are recommended for strict, reproducible benchmarking.verify == 0β automatically annotated; not yet human-confirmed. Useful for scale and recall studies, but values may still be refined in future releases.
Of the 184 advisories, 50 have all of their entries verified and 11 are partially verified, for a total of 61 advisories with at least one human-audited entry. Future releases will continue to expand the verified subset.
Every entry carries a two-level classification: vuln_category_l1
(coarse type) and vuln_category_l2 (fine-grained sub-type). 71.2 % of
advisories are business-logic vulnerabilities, classified with a
12-class + 1 fallback taxonomy (see below). The remaining 28.8 %
cover traditional vulnerability types. Full data model and field
definitions are in SCHEMA.md.
The initial release (v0.1.0) draws primarily from recent high-star open-source projects and focuses on frequently occurring business-logic vulnerabilities; future releases will continue expanding vulnerability categories and project coverage.
Note: one advisory may map to multiple entries β the counts below are by advisory (vulnerability), not by entry.
Business-logic advisories (131 / 184, 71.2 %) β vuln_category_l2 breakdown:
| Sub-category | Advisories | % of BL |
|---|---|---|
| BL-AUTHZ-BROKEN β broken authorization logic | 31 | 23.7 % |
| BL-AUTHZ-MISSING β missing authorization | 23 | 17.6 % |
| BL-AGENT-CAPABILITY β AI / Agent capability boundary bypass | 20 | 15.3 % |
| BL-PRIV-ESC β privilege escalation | 13 | 9.9 % |
| BL-AUTH-BYPASS β authentication bypass | 11 | 8.4 % |
7 more sub-categories (33 advisories, 25.2 % of BL)
| Sub-category | Advisories | % of BL |
|---|---|---|
| BL-ORIGIN-INTEGRITY β origin / signature / integrity check missing | 8 | 6.1 % |
| BL-WORKFLOW-VIOLATION β workflow / state-machine violation | 7 | 5.3 % |
| BL-INSECURE-DEFAULT β insecure default configuration | 6 | 4.6 % |
| BL-RACE-LOGIC β business-layer race condition | 4 | 3.1 % |
| BL-MULTI-TENANT β multi-tenant / isolation failure | 3 | 2.3 % |
| BL-MASS-ASSIGNMENT β mass assignment / parameter pollution | 3 | 2.3 % |
| BL-TRUST-BOUNDARY β implicit trust in internal input | 2 | 1.5 % |
Traditional vulnerability advisories (53 / 184, 28.8 %) β top vuln_category_l1:
| Category | Advisories | % of Trad. |
|---|---|---|
| Code Injection | 12 | 22.6 % |
| Path Traversal / File ops | 9 | 17.0 % |
| Command Injection | 8 | 15.1 % |
| XSS | 5 | 9.4 % |
| Sandbox Escape | 5 | 9.4 % |
4 more categories (14 advisories, 26.4 % of Trad.)
| Category | Advisories | % of Trad. |
|---|---|---|
| SSRF | 4 | 7.5 % |
| Authentication Bypass | 3 | 5.7 % |
| Deserialization | 2 | 3.8 % |
| Other (Template Injection, RCE, Supply Chain, etc.) | 5 | 9.4 % |
Future releases will continue expanding vulnerability categories and project coverage.
π§ Coming soon β We are systematically evaluating mainstream tools and AI agents. Results will be published alongside the technical report.
VulnGym/
βββ README.md # English version
βββ README_zh.md # δΈζη
βββ SCHEMA.md # field reference & validation invariants
βββ CHANGELOG.md
βββ CITATION.cff
βββ LICENSE # CC-BY-4.0
βββ data/
β βββ reports.jsonl # 184 rows β one GitHub Advisory per row
β βββ entries.jsonl # 408 rows β one entry point per row, with human-audit flag (verify)
βββ examples/
βββ load_dataset.py # stdlib / pandas / HuggingFace datasets loader
βββ example_result.jsonl # illustrative tool-findings submission
βββ evaluate.py # coverage / recall evaluator
git clone https://github.com/Tencent/VulnGym.git
cd VulnGym
python3 examples/load_dataset.pyOr load directly in Python:
import json
with open("data/entries.jsonl", encoding="utf-8") as f:
entries = [json.loads(line) for line in f if line.strip()]
xss = [e for e in entries if e["vuln_category_l1"] == "XSS"]
print(len(xss), "XSS entries")
print(xss[0]["entry_point"], "β", xss[0]["critical_operation"])
# Restrict to the human-audited high-confidence subset
verified = [e for e in entries if e["verify"] == 1]
print(len(verified), "human-audited entries")Pandas:
import pandas as pd
reports = pd.read_json("data/reports.jsonl", lines=True)
entries = pd.read_json("data/entries.jsonl", lines=True)HuggingFace datasets:
from datasets import load_dataset
ds = load_dataset("json", data_files={
"reports": "data/reports.jsonl",
"entries": "data/entries.jsonl",
})Write your tool's findings to a JSONL file (one finding per line) and run:
python3 examples/evaluate.py path/to/your_findings.jsonl -vEach finding must carry at least repo_url, commit, entry_point
(reachable entry point), and critical_operation (core defect location).
trace (cross-module reasoning chain) is optional and ignored by the
matcher. See examples/example_result.jsonl for a working sample.
The script reports two metrics:
- Advisory-level recall (primary) β
covered_advisories / usable_advisories. An advisory is covered if at least one of its entries is matched. - Entry-level recall (secondary) β
matched_entries / usable_entries.
Default matching policy
| Aspect | Default |
|---|---|
| Path match | normalized, exact |
| Line tolerance | |Ξline| β€ 5 on entry_point and critical_operation |
| Direction | strict (entry_point-to-entry_point, critical_operation-to-critical_operation) |
line == 0 in ground truth |
excluded from numerator and denominator |
All policies are documented and configurable via CLI arguments
(--line-tolerance, etc.).
Note: The current evaluator only computes recall / coverage and cannot penalize over-reporting. The resulting numbers should be interpreted as coverage metrics, not a full precision-aware benchmark.
π A companion paper is in preparation. Until it is released, please cite VulnGym using the dataset entry below; we will update this section once the paper is publicly available.
@misc{vulngym2026,
title = {VulnGym: A Real-World, Project-Level Vulnerability Benchmark
for White-Box Vulnerability-Hunting Agents},
author = {{Tencent Wukong Code Security Team and contributors}},
year = {2026},
version = {0.1.1},
howpublished = {\url{https://github.com/Tencent/VulnGym}},
note = {Dataset. A companion paper is in preparation; please check
the repository for the latest citation.}
}Once the paper is public, the entry below will be filled in and should be preferred:
@inproceedings{vulngym2026paper,
title = {TBA β A companion paper for VulnGym is in preparation.},
author = {{To be announced}},
year = {TBA},
note = {Placeholder; will be replaced once the paper is publicly available.}
}See CITATION.cff for the machine-readable form.
VulnGym aims to be an open, reproducible, and continuously evolving community benchmark. Contributions from both academia and industry are warmly welcomed:
- π§ Dataset contributions β new advisories, additional reachable
entry points for existing advisories, corrections to
entry_point/critical_operation/trace. - π§ Evaluator improvements β precision / F1, per-category breakdowns, statistical significance (bootstrap CI), alternative matching policies.
- π Evaluation result submissions β submit your tool's evaluation results via PR to be included in the baseline comparison.
- π¬ Discussions & feedback β file an Issue or start a Discussion.
Please read SCHEMA.md before proposing data changes β all invariants
listed there are enforced at release time.
VulnGym is jointly built by the Tencent Wukong Security Team together with the following academic partners (listed in no particular order, final order TBD):
- ARISE Lab, The Chinese University of Hong Kong
- Systems Software & Security Lab, Fudan University
- JC STEM Lab of Intelligent Cybersecurity, The University of Hong Kong
- Narwhal-Lab, Peking University
- Network Threat Analysis Lab, Institute of Information Engineering, Chinese Academy of Sciences
Many thanks to all partners for their outstanding contributions to VulnGym.
The dataset is released under CC-BY-4.0 β see LICENSE.
You may use it for commercial and academic purposes with attribution.
Source code paths and commit hashes referenced in entry_point /
critical_operation / trace fields belong to their respective upstream
projects under their original licenses; consult the referenced
repositories before reusing any quoted code fragment.
