Locket is a feature-locking technique (FLoTE) that enables pay-to-unlock schemes for LLMs.
@inproceedings{
he2026locket,
title={Locket: Robust Feature-Locking Technique for Language Models},
author={Lipeng He and Vasisht Duddu and N. Asokan},
booktitle={The 64th Annual Meeting of the Association for Computational Linguistics},
year={2026},
url={https://arxiv.org/abs/2510.12117}
}
Experiments were run on Lambda with 8 × NVIDIA A100 40GB GPUs.
conda create -n locket python=3.12
conda activate locketInstall in the following order to resolve conflicts:
conda install -c pytorch -c nvidia faiss-gpu=1.12.0
pip install datasets==4.0.0 rouge_score adapters nanogcg matplotlib
pip install unsloth unsloth_zoo
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
pip install -U xformers==0.0.29.post3 --index-url https://download.pytorch.org/whl/cu126
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
pip install lion-pytorch fastchat openai google-generativeai wandb
pip install --upgrade 'numpy<2.0' 'pandas>=2.2'
pip install transformers==4.51.3 trl==0.18.2 torchao==0.13.0 peft==0.17.1pip install -e .Upload the data/ folder (contains math/, sql/, samsum/ datasets).
Login to HuggingFace and Weights & Biases:
huggingface-cli login
wandb loginDownload the Llama-3-8B chat template used by AutoDAN-Turbo's judge:
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct \
--local-dir ./locket/robustness/AutoDAN_Turbo/llm/chat_templates/model_ckpt/meta-llama_Meta-Llama-3-8B-Instruct \
--local-dir-use-symlinks FalseLong-running jobs should be run in a screen session or tmux with logging:
screen -S <name> -L -Logfile /path/to/<name>.logTrains one LoRA adapter per feature via LAT (§4). Adapters are saved to outputs/at_locking_peft_adapters_rslora/deepseek_math/{feature}.
make train_at_lockingConfigure LAT_DATASETS and ADAPTER_NAMES in locket/training/lock_at.py to select which features to train.
Single-feature and multi-feature scalability.
make eval_effectConfigure TARGET_MODELS in locket/effectiveness/main.py to select configurations. Results are logged to stdout and saved to logs/.
Attack success rates for Many-shot, GCG, TAP, AutoDAN-Turbo.
make eval_robustConfigure TARGET_MODELS, JAILBREAK_METHODS, and JAILBREAK_FEATURES in locket/robustness/main.py. Results are saved as JSON to logs/.
locket/
├── training/
│ ├── lock_at.py # LAT adapter training (§4)
│ └── LAT/ # Latent Adversarial Training implementation
├── effectiveness/
│ ├── main.py # Effectiveness + utility evaluation (§6.2, §6.3, §6.5)
│ ├── eval_math.py
│ ├── eval_mmlu.py
│ ├── eval_sql.py
│ └── eval_samsum.py
├── robustness/
│ ├── main.py # Robustness evaluation (§6.4)
│ ├── gcg.py # GCG attack
│ ├── tap.py # TAP attack
│ ├── manyshot.py # Many-shot jailbreak
│ ├── autodan_turbo.py # AutoDAN-Turbo attack
│ └── evaluator.py # JailbreakEvaluator
├── utils/
│ ├── model.py # get_model(), LOCKET Merging (Algorithm 1)
│ ├── dataset.py # Dataset loaders
│ ├── tokenizer.py
│ └── prompt.py
├── constants.py # Hyperparameters and adapter paths
└── typings.py # Model and dataset enums
data/
├── math/ # MATH competition dataset
├── sql/ # SQL Create Context dataset
└── samsum/ # SAMSum dataset
outputs/
└── at_locking_peft_adapters_rslora/deepseek_math/
├── math/ # Trained Math adapter
├── sql/ # Trained SQL adapter
├── samsum/ # Trained SAMSum adapter
└── mmlu/ # Trained MMLU adapter
| Parameter | Value | Description |
|---|---|---|
| LoRA rank | 64 | Adapter rank (RSLoRA) |
| PGD steps | 16 | LAT inner loop iterations |
| PGD layers | embedding, 6, 14, 22, 29 | Layers attacked during LAT |
| Training steps | 100 | Total LAT training steps |
| τ (single) | 0.5–0.95 | Per-feature spectral cap (see locket/utils/model.py) |
| τ (multi) | 0.6–0.9 | Multi-feature spectral cap (see locket/utils/model.py) |
See Appendix E of the paper for full details.