CleanVul: Toward High-Quality Function-Level Vulnerability Datasets via LLM-Based Noise Reduction

📜 Overview

CleanVul introduces VulSifter, an innovative methodology that combines Large Language Models (LLMs) with heuristic to automatically detect vulnerability-fixing changes within vulnerability-fixing commits (VFCs). This approach has enabled us to create two high-quality datasets: the primary CleanVul dataset containing 8,092 functions with 90.6% correctness, and a more precise variant comprising 6,051 functions with 97.3% correctness. Both datasets demonstrate quality comparable to or exceeding established benchmarks such as SVEN (94.0%) and PrimeVul (86.0%).

Our approach addresses the significant noise (40-75%) in existing vulnerability datasets caused by indiscriminate labeling of all modifications in vulnerability-fixing commits as vulnerability-related.

Vulnerability Fix Identification in VFC

✨ LLM-Based Analysis: Uses state-of-the-art LLMs to comprehend code semantics and contextual information for identifying genuine vulnerability fixes
✨ Heuristic Enhancement: Custom filtering rules to eliminate test-related changes
✨ High Accuracy: Achieves F1-score of 0.77 in identifying genuine vulnerability fixes

Better Vulnerability Dataset

✨ High Quality: Maximal 97.3% correctness rate for identifying genuine vulnerability fixes, comparable to manually curated datasets
✨ Scale: Contains over 6,051 function pairs across multiple programming languages
✨ Language Coverage: Includes Java, Python, C, JavaScript, C#, and C++ code
✨ Diverse Sources: Derived from analysis of 5.3M commits across 127K GitHub repositories

📚 CleanVul Dataset

Dataset Statistics

The dataset provides different versions based on confidence thresholds:

Threshold	With Heuristics	Without Heuristics	Correctness (With Heuristics)	Correctness (Without Heuristics)
1	21,187	23,652	43.1%	37.5%
2	10,536	11,745	49.4%	57.7%
3	8,092	8,979	90.6%	76.5%
4	6,051	6,616	97.3%	78.0%

Understanding Thresholds

The thresholds represent confidence levels in our VulSifter methodology:

Threshold 1: Lowest confidence level, capturing the broadest set of potential vulnerability fixes but with the highest false positive rate (43.1% correctness with heuristics)
Threshold 2: Moderate confidence level, offering a balance between dataset size and accuracy (49.4% correctness with heuristics)
Threshold 3: High confidence level, recommended for most applications, providing excellent balance between dataset size and quality (90.6% correctness with heuristics)
Threshold 4: Highest confidence level, prioritizing precision over recall, offering near-perfect correctness (97.3% with heuristics) but with a smaller dataset size

The "With Heuristics" versions apply additional filtering rules to remove test-related changes and other non-vulnerability modifications, resulting in significantly higher correctness rates compared to versions without these heuristics.

Important detail about Threshold:

In the dataset, pairs are stored separately by score:
- vulnerability_score_3.csv contains pairs with score = 3.
- vulnerability_score_4.csv contains pairs with score = 4.
To reconstruct Threshold 3 (score ≥ 3), you need to combine vulnerability_score_3.csv and vulnerability_score_4.csv, giving a total of 8,092 items (with heuristic filtering).
We keep them separate in the dataset to avoid duplication and to let users choose between strict (score=4 only) or inclusive (score ≥ 3) subsets.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
.gitignore		.gitignore
README.md		README.md
vulnerability_score_0.csv		vulnerability_score_0.csv
vulnerability_score_2.csv		vulnerability_score_2.csv
vulnerability_score_3.csv		vulnerability_score_3.csv
vulnerability_score_4.csv		vulnerability_score_4.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CleanVul: Toward High-Quality Function-Level Vulnerability Datasets via LLM-Based Noise Reduction

📜 Overview

Vulnerability Fix Identification in VFC

Better Vulnerability Dataset

📚 CleanVul Dataset

Dataset Statistics

Understanding Thresholds

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CleanVul: Toward High-Quality Function-Level Vulnerability Datasets via LLM-Based Noise Reduction

📜 Overview

Vulnerability Fix Identification in VFC

Better Vulnerability Dataset

📚 CleanVul Dataset

Dataset Statistics

Understanding Thresholds

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages