Skip to content

Tencent/VulnGym

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VulnGym

GitHub Stars GitHub Forks License

A Real-World, Project-Level Vulnerability Benchmark for White-Box Vulnerability-Hunting Agents

Give VulnGym a Star

VulnGym is a project-level benchmark for white-box vulnerability-hunting agents, designed to evaluate an agent's vulnerability detection capabilities within real-world engineering contexts, with verifiable vulnerability trigger paths and code-semantic evidence chains.

Three core design principles:

  • πŸ—οΈ Real project-level evaluation units β€” every sample is bound to a specific vulnerable commit of a real repository, evaluating an agent's ability to discover and locate vulnerabilities inside real multi-file, multi-module engineering projects.
  • 🧠 Comprehensive vulnerability-type coverage β€” the benchmark covers both business-logic defects that demand cross-module code-semantic reasoning (e.g., authorization bypass, broken authentication) and traditional security flaws (e.g., injection, path traversal), providing a comprehensive assessment of an agent's ability to discover diverse vulnerability classes.
  • βœ… Verifiable vulnerability paths β€” each sample ships with a human-reviewed reachable entry point (entry_point), critical operation (critical_operation), and cross-module reasoning chain (trace), enabling reproducible, explainable, and deterministic evaluation.

πŸ“’ What's New

  • 2026-05-17 β€” πŸ”§ v0.1.1 data refresh: added a verify field on every entry to mark human-audit status; 113 / 408 entries (covering 61 / 184 advisories) are now human-verified. Selected entry_point / critical_operation / trace values were also refined.
  • 2026-05-15 β€” πŸŽ‰ VulnGym v0.1.0 officially open-sourced!

Table of Contents


πŸ” Why VulnGym

Existing vulnerability benchmarks have the following limitations when evaluating the real-world vulnerability-hunting capabilities of AI agents:

Limitation Manifestation
Insufficient evaluation granularity Most benchmarks use functions or diff snippets as the evaluation unit, failing to reflect an agent's ability to locate vulnerabilities within complete engineering projects
Narrow vulnerability types Over-emphasis on pattern-matchable CWE flaws such as SQL injection and buffer overflow, with little coverage of categories requiring deep contextual reasoning
Coarse-grained ground truth Typically binary labels (vulnerable / not vulnerable) or patch diffs, unable to precisely verify whether the agent locates the correct entry point and defect site

✨ Dataset overview

This is the v0.1.1 release of VulnGym. Data is provided as two JSONL files under the data/ directory:

  • reports.jsonl β€” aggregated records at the GitHub Advisory granularity
  • entries.jsonl β€” annotated records at the reachable entry point granularity

Each record contains repo_url and commit, allowing you to check out the full vulnerable source tree for the corresponding version.

Data scale

Metric Value
Advisories (reports) 184
Reachable entry points (entries) 408
Distinct projects 38
Distinct repositories 23
Human-audited entries (verify = 1) 113 / 408 (27.7 %)
Human-audited advisories (β‰₯ 1 verified entry) 61 / 184 (33.2 %)

Human audit status

Starting in v0.1.1, every row in entries.jsonl carries a verify field (int, 0 or 1):

  • verify == 1 β€” the entry's entry_point, critical_operation, and trace have been reviewed and confirmed by a human annotator. These rows form a high-confidence ground-truth subset and are recommended for strict, reproducible benchmarking.
  • verify == 0 β€” automatically annotated; not yet human-confirmed. Useful for scale and recall studies, but values may still be refined in future releases.

Of the 184 advisories, 50 have all of their entries verified and 11 are partially verified, for a total of 61 advisories with at least one human-audited entry. Future releases will continue to expand the verified subset.

Vulnerability type distribution

Every entry carries a two-level classification: vuln_category_l1 (coarse type) and vuln_category_l2 (fine-grained sub-type). 71.2 % of advisories are business-logic vulnerabilities, classified with a 12-class + 1 fallback taxonomy (see below). The remaining 28.8 % cover traditional vulnerability types. Full data model and field definitions are in SCHEMA.md.

The initial release (v0.1.0) draws primarily from recent high-star open-source projects and focuses on frequently occurring business-logic vulnerabilities; future releases will continue expanding vulnerability categories and project coverage.

Note: one advisory may map to multiple entries β€” the counts below are by advisory (vulnerability), not by entry.

Business-logic advisories (131 / 184, 71.2 %) β€” vuln_category_l2 breakdown:

Sub-category Advisories % of BL
BL-AUTHZ-BROKEN β€” broken authorization logic 31 23.7 %
BL-AUTHZ-MISSING β€” missing authorization 23 17.6 %
BL-AGENT-CAPABILITY β€” AI / Agent capability boundary bypass 20 15.3 %
BL-PRIV-ESC β€” privilege escalation 13 9.9 %
BL-AUTH-BYPASS β€” authentication bypass 11 8.4 %
7 more sub-categories (33 advisories, 25.2 % of BL)
Sub-category Advisories % of BL
BL-ORIGIN-INTEGRITY β€” origin / signature / integrity check missing 8 6.1 %
BL-WORKFLOW-VIOLATION β€” workflow / state-machine violation 7 5.3 %
BL-INSECURE-DEFAULT β€” insecure default configuration 6 4.6 %
BL-RACE-LOGIC β€” business-layer race condition 4 3.1 %
BL-MULTI-TENANT β€” multi-tenant / isolation failure 3 2.3 %
BL-MASS-ASSIGNMENT β€” mass assignment / parameter pollution 3 2.3 %
BL-TRUST-BOUNDARY β€” implicit trust in internal input 2 1.5 %

Traditional vulnerability advisories (53 / 184, 28.8 %) β€” top vuln_category_l1:

Category Advisories % of Trad.
Code Injection 12 22.6 %
Path Traversal / File ops 9 17.0 %
Command Injection 8 15.1 %
XSS 5 9.4 %
Sandbox Escape 5 9.4 %
4 more categories (14 advisories, 26.4 % of Trad.)
Category Advisories % of Trad.
SSRF 4 7.5 %
Authentication Bypass 3 5.7 %
Deserialization 2 3.8 %
Other (Template Injection, RCE, Supply Chain, etc.) 5 9.4 %

Future releases will continue expanding vulnerability categories and project coverage.

πŸ“ˆ Baseline evaluation results

🚧 Coming soon β€” We are systematically evaluating mainstream tools and AI agents. Results will be published alongside the technical report.

πŸ“¦ Repository layout

VulnGym/
β”œβ”€β”€ README.md                    # English version
β”œβ”€β”€ README_zh.md                 # δΈ­ζ–‡η‰ˆ
β”œβ”€β”€ SCHEMA.md                    # field reference & validation invariants
β”œβ”€β”€ CHANGELOG.md
β”œβ”€β”€ CITATION.cff
β”œβ”€β”€ LICENSE                      # CC-BY-4.0
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ reports.jsonl            # 184 rows β€” one GitHub Advisory per row
β”‚   └── entries.jsonl            # 408 rows β€” one entry point per row, with human-audit flag (verify)
└── examples/
    β”œβ”€β”€ load_dataset.py          # stdlib / pandas / HuggingFace datasets loader
    β”œβ”€β”€ example_result.jsonl     # illustrative tool-findings submission
    └── evaluate.py              # coverage / recall evaluator

πŸš€ Quick start

git clone https://github.com/Tencent/VulnGym.git
cd VulnGym
python3 examples/load_dataset.py

Or load directly in Python:

import json
with open("data/entries.jsonl", encoding="utf-8") as f:
    entries = [json.loads(line) for line in f if line.strip()]

xss = [e for e in entries if e["vuln_category_l1"] == "XSS"]
print(len(xss), "XSS entries")
print(xss[0]["entry_point"], "β†’", xss[0]["critical_operation"])

# Restrict to the human-audited high-confidence subset
verified = [e for e in entries if e["verify"] == 1]
print(len(verified), "human-audited entries")

Pandas:

import pandas as pd
reports = pd.read_json("data/reports.jsonl", lines=True)
entries = pd.read_json("data/entries.jsonl", lines=True)

HuggingFace datasets:

from datasets import load_dataset
ds = load_dataset("json", data_files={
    "reports": "data/reports.jsonl",
    "entries": "data/entries.jsonl",
})

πŸ“Š Evaluating your tool

Write your tool's findings to a JSONL file (one finding per line) and run:

python3 examples/evaluate.py path/to/your_findings.jsonl -v

Each finding must carry at least repo_url, commit, entry_point (reachable entry point), and critical_operation (core defect location). trace (cross-module reasoning chain) is optional and ignored by the matcher. See examples/example_result.jsonl for a working sample.

The script reports two metrics:

  • Advisory-level recall (primary) β€” covered_advisories / usable_advisories. An advisory is covered if at least one of its entries is matched.
  • Entry-level recall (secondary) β€” matched_entries / usable_entries.

Default matching policy

Aspect Default
Path match normalized, exact
Line tolerance |Ξ”line| ≀ 5 on entry_point and critical_operation
Direction strict (entry_point-to-entry_point, critical_operation-to-critical_operation)
line == 0 in ground truth excluded from numerator and denominator

All policies are documented and configurable via CLI arguments (--line-tolerance, etc.).

Note: The current evaluator only computes recall / coverage and cannot penalize over-reporting. The resulting numbers should be interpreted as coverage metrics, not a full precision-aware benchmark.

πŸ“– Citation

πŸ“š A companion paper is in preparation. Until it is released, please cite VulnGym using the dataset entry below; we will update this section once the paper is publicly available.

@misc{vulngym2026,
  title        = {VulnGym: A Real-World, Project-Level Vulnerability Benchmark
                  for White-Box Vulnerability-Hunting Agents},
  author       = {{Tencent Wukong Code Security Team and contributors}},
  year         = {2026},
  version      = {0.1.1},
  howpublished = {\url{https://github.com/Tencent/VulnGym}},
  note         = {Dataset. A companion paper is in preparation; please check
                  the repository for the latest citation.}
}

Once the paper is public, the entry below will be filled in and should be preferred:

@inproceedings{vulngym2026paper,
  title     = {TBA β€” A companion paper for VulnGym is in preparation.},
  author    = {{To be announced}},
  year      = {TBA},
  note      = {Placeholder; will be replaced once the paper is publicly available.}
}

See CITATION.cff for the machine-readable form.


🀝 Contribution Guide

VulnGym aims to be an open, reproducible, and continuously evolving community benchmark. Contributions from both academia and industry are warmly welcomed:

  • 🧠 Dataset contributions β€” new advisories, additional reachable entry points for existing advisories, corrections to entry_point / critical_operation / trace.
  • πŸ”§ Evaluator improvements β€” precision / F1, per-category breakdowns, statistical significance (bootstrap CI), alternative matching policies.
  • πŸ“Š Evaluation result submissions β€” submit your tool's evaluation results via PR to be included in the baseline comparison.
  • πŸ’¬ Discussions & feedback β€” file an Issue or start a Discussion.

Please read SCHEMA.md before proposing data changes β€” all invariants listed there are enforced at release time.


πŸ™ Acknowledgements

VulnGym is jointly built by the Tencent Wukong Security Team together with the following academic partners (listed in no particular order, final order TBD):

  • ARISE Lab, The Chinese University of Hong Kong
  • Systems Software & Security Lab, Fudan University
  • JC STEM Lab of Intelligent Cybersecurity, The University of Hong Kong
  • Narwhal-Lab, Peking University
  • Network Threat Analysis Lab, Institute of Information Engineering, Chinese Academy of Sciences

Many thanks to all partners for their outstanding contributions to VulnGym.


πŸ“„ License

The dataset is released under CC-BY-4.0 β€” see LICENSE. You may use it for commercial and academic purposes with attribution. Source code paths and commit hashes referenced in entry_point / critical_operation / trace fields belong to their respective upstream projects under their original licenses; consult the referenced repositories before reusing any quoted code fragment.

About

VulnGym: A Real-World, Project-Level Vulnerability Benchmark for White-Box Vulnerability-Hunting Agents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors