This repository provides the replication package for our AdaptiveGuard experiments on continual LLM safety alignment.
- Environment Setup
- Repository Structure
- Data Preparation
- Reproduce RQ1
- Reproduce RQ2
- Reproduce RQ3
- Citation
We recommend using Python 3.12 for best compatibility and performance.
To install all necessary dependencies, run:
pip install -r requirements.txtIf you're using an NVIDIA GPU, we highly recommend installing PyTorch with CUDA support to accelerate training and inference. Follow the official installation guide from PyTorch: π https://pytorch.org/get-started/locally
AdaptiveGuard/
βββ scripts/ # Shell scripts for running experiments
βββ src/ # Python source code
βββ data/ # Dataset and data files
βββ models/ # Pre-trained model checkpoints
βββ imgs/ # Images for documentation
βββ requirements.txt # Python dependencies
βββ README.md # This file
Before running experiments, prepare the attack data:
./scripts/combine_attack_files.shThis script combines various attack datasets for evaluation across different jailbreak methods.
We evaluate AdaptiveGuard against 10 different jailbreak attack methods:
- AIM (Always Intelligent and Machiavellian) [1]
- DAN (Do Anything Now) [2]
- Combination (Prefix injection + Refusal Suppression) [3]
- Self Cipher [4]
- Deep Inception [5]
- Caesar Cipher [4]
- Zulu (Low-resource language attacks) [6]
- Base64 (Encoding-based attacks) [3]
- SmartGPT [7]
- Code Chameleon [8]
To reproduce RQ1 results, first train the AdaptiveGuard model, then run the out-of-distribution analysis:
./scripts/train_aegis.sh./scripts/test.sh./scripts/run_ood_analysis.shThis experiment evaluates AdaptiveGuard's energy-based detection capability on the 10 attack types listed above.
Results will be saved in results/ood_analysis_results/ directory.
(RQ2) How quickly does our AdaptiveGuard approach adapt to unknown jailbreak attacks when continuously updated through detected OOD prompts??
To reproduce RQ2 results, run the continual learning experiments:
./scripts/run_continual_learning_lora.sh./scripts/run_llamaguard_continual_learning.shThese experiments demonstrate:
- Defense Success Rate (DSR) improvements over time
- Continual adaptation to new attack patterns
- Comparison with baseline methods without CL
Results will be saved in:
continual_learning_results_lora/: AdaptiveGuard + CL resultsllamaguard_continual_learning_results/: LlamaGuard + CL results
(RQ3) How much does our AdaptiveGuard approach forget original in-distribution prompts after continuous updates with detected OOD prompts?
Note: RQ3 results are automatically generated when running the RQ2 experiments. No separate scripts need to be executed.
To analyze RQ3 results, examine the F1 scores from the RQ2 continual learning experiments:
The F1 scores for catastrophic forgetting analysis can be found in the results directories created during RQ2:
continual_learning_results_lora/: Contains F1 scores for AdaptiveGuard + LoRA continual learningllamaguard_continual_learning_results/: Contains F1 scores for LlamaGuard + LoRA results
This analysis evaluates:
- Catastrophic forgetting on original in-distribution data
- F1-score maintenance across continual learning phases
- Balance between new attack detection and original performance
- Memory efficiency of different adaptation strategies
The results show performance on both:
- Original benign prompts (measuring forgetting)
- New attack patterns (measuring adaptation)
Look for:
- F1 score trends over continual learning iterations
- Performance degradation on original tasks
- Trade-offs between new attack detection and original performance retention
- Comparison across different methods (standard, LoRA, LlamaGuard)
under review[1] Jailbreak Chat. "Jailbreak Chat Prompt." 2023. https://www.jailbreakchat.com/prompt/4f37a029-9dff-4862-b323-c96a5504de5d
[2] Shen, X., Chen, Z., Backes, M., Shen, Y., & Zhang, Y. (2023). "Do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
[3] Wei, A., Haghtalab, N., & Steinhardt, J. (2024). Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36.
[4] Yuan, Y., Jiao, W., Wang, W., Huang, J. T., He, P., Shi, S., & Tu, Z. (2023). Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463.
[5] Li, X., Zhou, Z., Zhu, J., Yao, J., Liu, T., & Han, B. (2023). Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191.
[6] Yong, Z. X., Menghini, C., & Bach, S. H. (2023). Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446.
[7] Kang, D., Li, X., Stoica, I., Guestrin, C., Zaharia, M., & Hashimoto, T. (2024). Exploiting programmatic behavior of llms: Dual-use through standard security attacks. In 2024 IEEE Security and Privacy Workshops (SPW) (pp. 132-143). IEEE.
[8] Lv, H., Wang, X., Zhang, Y., Huang, C., Dou, S., Ye, J., Gui, T., Zhang, Q., & Huang, X. (2024). Codechameleon: Personalized encryption framework for jailbreaking large language models. arXiv preprint arXiv:2402.16717.




