A Comprehensive Analysis of Package Hallucinations by Code-Generating LLMs
This repository contains the code, data, and instructions for reproducing the experiments and results from our paper:
We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs
Joseph Spracklen, Raveen Wijewickrama, A H M Nazmus Sakib, Anindya Maiti, Bimal Viswanath, Murtuza Jadliwala
In Proceedings of the USENIX Security Symposium, 2025.
📄 Paper PDF
- Large-scale study across 16 LLMs (commercial + open-source), Python and JavaScript
- 576,000 code samples analyzed; 19.7% of recommended packages were hallucinations
- 205,474 unique hallucinated package names discovered
- Effective mitigations evaluated: RAG, self-detection, fine-tuning (largest reduction from fine-tuning)
- Overview
- Repository Structure
- Setup
- Usage
- Data
- Reproducing Results
- Mitigation Experiments
- Hardware & Runtime Notes
- Security & Ethics
- Troubleshooting
- Citation
- License
- Contact
Package hallucinations occur when an LLM generates code that references a non-existent package (e.g., via pip install xyz or npm install xyz where xyz does not exist).
This creates a software supply-chain risk: adversaries can upload a malicious package using that hallucinated name.
This repo provides:
- End-to-end pipeline to generate code, extract package names, and measure hallucinations
- Prompt datasets (Stack Overflow–derived and LLM-generated)
- Code to reproduce figures/tables and mitigation experiments
.
├── run_test.py # Runs a full hallucination detection experiment
├── Models/ # Place tested models here (one default model included)
├── Data/ # Prompt datasets & per-language resources
│ ├── Python/
│ │ ├── LLM_AT.json
│ │ ├── LLM_LY.json
│ │ ├── SO_AT.json
│ │ └── SO_LY.json
│ └── JavaScript/
│ ├── LLM_AT.json
│ ├── LLM_LY.json
│ ├── SO_AT.json
│ └── SO_LY.json
├── Tests/ # Output directory for experiment results (starts empty)
├── Mitigation/ # Mitigation experiments
│ ├── run_model_RAG.py
│ ├── run_model_SD.py
│ ├── run_model_combined.py
│ ├── Data/ # RAG DB + build data
│ ├── Fine_tuned/ # Fine-tuned & quantized models used in mitigation testing
│ └── RAG_setup.py # Builds the vector DB from Mitigation/Data
├── Plots/ # Code and data to reproduce paper figures
├── environment.yml # Conda environment
├── requirements.txt # (Optional) pip dependencies
└── README.mdgit clone https://github.com/Spracks/PackageHallucination.git
cd PackageHallucinationUsing Conda (recommended):
conda env create -f environment.yml
conda activate pkg-hallucinationOr using pip:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtThe environment listed is bloated, really just need PyTorch + transformers and the associated dependencies.
Run a full hallucination detection experiment for one model:
python run_test.py DeepSeek_1B --language Python
# or
python run_test.py DeepSeek_1B --language JavaScriptWhat it does
- Generates code for prompts in
Data/<LANG>/ - Extracts package names via the paper’s three heuristics
- Compares against master lists to mark hallucinations
- Writes results/artifacts under
Tests/
Notes
- Ensure your chosen model is available under
Models/ - End-to-end runs can take 24–72 hours depending on model size and hardware
Included prompt datasets (per language):
LLM_AT.json– LLM-generated prompts based on all-time most popular packagesLLM_LY.json– LLM-generated prompts based on last-year most popular packagesSO_AT.json– Top Stack Overflow questions (all-time)SO_LY.json– Top Stack Overflow questions (last-year)
Each language directory also contains the master list of valid package names used for detection.
We do not publish the master list of hallucinated package names or per-prompt detailed results (see Security & Ethics). Verified researchers can request access.
Reproduce main tables/figures by re-running experiments and then building plots:
# 1) Run experiments (example)
python run_test.py DeepSeek_1B --language Python
python run_test.py CodeLlama_7B --language Python
# ... (repeat for desired model/language combinations)
# 2) Build figures
cd Plots
python reproduce_figures.pyWhere applicable, figure scripts read from Tests/ to regenerate the paper plots.
We provide three mitigation strategies:
-
RAG (Retrieval-Augmented Generation)
Augments prompts with retrieved package-context from a vector DB built from package descriptions.# Build RAG DB (once) python Mitigation/RAG_setup.py # Run RAG experiment python Mitigation/run_model_RAG.py DeepSeek_1B --language Python
-
Self-Detection / Self-Refinement
The model checks its own suggested package list; if invalid, regenerate with constraints.python Mitigation/run_model_SD.py CodeLlama_7B --language Python
-
Fine-Tuning
Fine-tune on valid (non-hallucinated) package recommendations derived from the pipeline.# Use the fine-tuned checkpoints under Mitigation/Fine_tuned/ python Mitigation/run_model_combined.py DeepSeek_1B --language Python
See paper for comparative results; fine-tuning produced the largest hallucination reduction.
- Open-source models were evaluated in quantized form to mimic realistic hardware constraints.
- A single full run can take 24–72 hours depending on model size/GPU availability.
- For reproducibility, stick to the provided
environment.ymland keep decoding parameters consistent unless you are explicitly testing RQ2-style variations.
- We do not publicly release:
- The master list of hallucinated package names
- Per-prompt detailed results
- Rationale: releasing these could enable package confusion attacks at scale.
- Access policy: verified researchers may request full results for academic use.
- See the paper’s Ethics Considerations for additional detail.
- Environment fails to resolve: ensure you’re using the listed CUDA/PyTorch versions in
environment.yml. - Model not found: confirm the checkpoint is placed under
Models/and the name matches your CLI arg. - Very slow runs: you’re likely on CPU; use a CUDA GPU where possible.
- Different hallucination rates than reported: minor variation is expected across hardware/versions; ensure decoding params and temperatures match defaults in the code.
If you use this repository, please cite:
@inproceedings{spracklen2025packagehallucination,
title = {We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs},
author = {Joseph Spracklen and Raveen Wijewickrama and A H M Nazmus Sakib and Anindya Maiti and Bimal Viswanath and Murtuza Jadliwala},
booktitle = {USENIX Security Symposium},
year = {2025}
}This project is licensed under the MIT License.
Questions or collaboration:
- Joseph Spracklen — joseph.spracklen@utsa.edu
- Or open an issue on the repository