Skip to content

ssg-research/locket

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Robust Feature-Locking Technique for Language Models

Locket is a feature-locking technique (FLoTE) that enables pay-to-unlock schemes for LLMs.

@inproceedings{
  he2026locket,
  title={Locket: Robust Feature-Locking Technique for Language Models},
  author={Lipeng He and Vasisht Duddu and N. Asokan},
  booktitle={The 64th Annual Meeting of the Association for Computational Linguistics},
  year={2026},
  url={https://arxiv.org/abs/2510.12117}
}

Environment Setup

Experiments were run on Lambda with 8 × NVIDIA A100 40GB GPUs.

1. Conda environment

conda create -n locket python=3.12
conda activate locket

2. Dependencies

Install in the following order to resolve conflicts:

conda install -c pytorch -c nvidia faiss-gpu=1.12.0

pip install datasets==4.0.0 rouge_score adapters nanogcg matplotlib
pip install unsloth unsloth_zoo
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
pip install -U xformers==0.0.29.post3 --index-url https://download.pytorch.org/whl/cu126
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
pip install lion-pytorch fastchat openai google-generativeai wandb
pip install --upgrade 'numpy<2.0' 'pandas>=2.2'
pip install transformers==4.51.3 trl==0.18.2 torchao==0.13.0 peft==0.17.1

3. Project setup

pip install -e .

Upload the data/ folder (contains math/, sql/, samsum/ datasets).

Login to HuggingFace and Weights & Biases:

huggingface-cli login
wandb login

Download the Llama-3-8B chat template used by AutoDAN-Turbo's judge:

huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct \
  --local-dir ./locket/robustness/AutoDAN_Turbo/llm/chat_templates/model_ckpt/meta-llama_Meta-Llama-3-8B-Instruct \
  --local-dir-use-symlinks False

Running Experiments

Long-running jobs should be run in a screen session or tmux with logging:

screen -S <name> -L -Logfile /path/to/<name>.log

Step 1 — Train Feature-Locking Adapters

Trains one LoRA adapter per feature via LAT (§4). Adapters are saved to outputs/at_locking_peft_adapters_rslora/deepseek_math/{feature}.

make train_at_locking

Configure LAT_DATASETS and ADAPTER_NAMES in locket/training/lock_at.py to select which features to train.

Step 2 — Evaluate Effectiveness and Utility (R1 & R2)

Single-feature and multi-feature scalability.

make eval_effect

Configure TARGET_MODELS in locket/effectiveness/main.py to select configurations. Results are logged to stdout and saved to logs/.

Step 3 — Evaluate Robustness (R3)

Attack success rates for Many-shot, GCG, TAP, AutoDAN-Turbo.

make eval_robust

Configure TARGET_MODELS, JAILBREAK_METHODS, and JAILBREAK_FEATURES in locket/robustness/main.py. Results are saved as JSON to logs/.


Repository Structure

locket/
├── training/
│   ├── lock_at.py          # LAT adapter training (§4)
│   └── LAT/                # Latent Adversarial Training implementation
├── effectiveness/
│   ├── main.py             # Effectiveness + utility evaluation (§6.2, §6.3, §6.5)
│   ├── eval_math.py
│   ├── eval_mmlu.py
│   ├── eval_sql.py
│   └── eval_samsum.py
├── robustness/
│   ├── main.py             # Robustness evaluation (§6.4)
│   ├── gcg.py              # GCG attack
│   ├── tap.py              # TAP attack
│   ├── manyshot.py         # Many-shot jailbreak
│   ├── autodan_turbo.py    # AutoDAN-Turbo attack
│   └── evaluator.py        # JailbreakEvaluator
├── utils/
│   ├── model.py            # get_model(), LOCKET Merging (Algorithm 1)
│   ├── dataset.py          # Dataset loaders
│   ├── tokenizer.py
│   └── prompt.py
├── constants.py            # Hyperparameters and adapter paths
└── typings.py              # Model and dataset enums
data/
├── math/                   # MATH competition dataset
├── sql/                    # SQL Create Context dataset
└── samsum/                 # SAMSum dataset
outputs/
└── at_locking_peft_adapters_rslora/deepseek_math/
    ├── math/               # Trained Math adapter
    ├── sql/                # Trained SQL adapter
    ├── samsum/             # Trained SAMSum adapter
    └── mmlu/               # Trained MMLU adapter

Key Hyperparameters

Parameter Value Description
LoRA rank 64 Adapter rank (RSLoRA)
PGD steps 16 LAT inner loop iterations
PGD layers embedding, 6, 14, 22, 29 Layers attacked during LAT
Training steps 100 Total LAT training steps
τ (single) 0.5–0.95 Per-feature spectral cap (see locket/utils/model.py)
τ (multi) 0.6–0.9 Multi-feature spectral cap (see locket/utils/model.py)

See Appendix E of the paper for full details.

About

Locket: Robust Feature-Locking Technique for Language Models

Topics

Resources

License

Stars

Watchers

Forks

Contributors