VulScribeR

Official repository for our paper:

VulScribeR: Exploring RAG-based Vulnerability Augmentation with LLMs

If you find this project useful in your research, please consider citing:

@article{VulScribeR,
author = {Daneshvar, Seyed Shayan and Nong, Yu and Yang, Xu and Wang, Shaowei and Cai, Haipeng},
title = {VulScribeR: Exploring RAG-based Vulnerability Augmentation with LLMs},
year = {2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {1049-331X},
url = {https://doi.org/10.1145/3760775},
doi = {10.1145/3760775},
abstract = {Detecting vulnerabilities is vital for software security, yet deep learning-based vulnerability detectors (DLVD) face a data shortage, which limits their effectiveness. Data augmentation can potentially alleviate the data shortage, but augmenting vulnerable code is challenging and requires a generative solution that maintains vulnerability. Previous works have only focused on generating samples that contain single statements or specific types of vulnerabilities. Recently, large language models (LLMs) have been used to solve various code generation and comprehension tasks with inspiring results, especially when fused with retrieval augmented generation (RAG). Therefore, we propose VulScribeR, a novel LLM-based solution that leverages carefully curated prompt templates to augment vulnerable datasets. More specifically, we explore three strategies to augment both single and multi-statement vulnerabilities, with LLMs, namely Mutation, Injection, and Extension. Our extensive evaluation across four vulnerability datasets and DLVD models, using three LLMs, show that our approach beats two SOTA methods Vulgen and VGX, and Random Oversampling (ROS) by 27.48\%, 27.93\%, and 15.41\% in f1-score with 5K generated vulnerable samples on average, and 53.84\%, 54.10\%, 69.90\%, and 40.93\% with 15K generated vulnerable samples. Our approach demonstrates its feasibility for large-scale data augmentation by generating 1K samples at as cheap as US$ 1.88.},
note = {Just Accepted},
journal = {ACM Trans. Softw. Eng. Methodol.},
month = aug,
keywords = {Vulnerability Augmentation, Deep Learning, Vulnerability Generation, Program Generation, Vulnerability Injection}
}

Datasets

Our Generated Vulnerable Samples

Filtered Datasets for RQs(1-3),
Unfiltered Datasets for RQs(1-3),
Unfiltered Datasets for RQ4

The unfiltered dataset contains samples from the Generator and hasn't gone through the Verification phase. They also include extra metadata that shows which clean_vul pair was used for generation, plus the vul lines.

How to use?

See here

How to train DLVD models

Go to the models directory, the readme for each model explains how to use each of the models

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
code		code
models		models
readme.md		readme.md
requirements-devign-env.txt		requirements-devign-env.txt
requirements-linevul-env.txt		requirements-linevul-env.txt
requirements-reveal-env.txt		requirements-reveal-env.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VulScribeR

Datasets

Primary Datasets

VGX and Vulgen (used as baselines)

Retriever's output

Our Generated Vulnerable Samples

How to use?

How to train DLVD models

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

shayandaneshvar/VulScribeR

Folders and files

Latest commit

History

Repository files navigation

VulScribeR

Datasets

Primary Datasets

VGX and Vulgen (used as baselines)

Retriever's output

Our Generated Vulnerable Samples

How to use?

How to train DLVD models

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages