Robust Feature-Locking Technique for Language Models

Locket is a feature-locking technique (FLoTE) that enables pay-to-unlock schemes for LLMs.

@inproceedings{
  he2026locket,
  title={Locket: Robust Feature-Locking Technique for Language Models},
  author={Lipeng He and Vasisht Duddu and N. Asokan},
  booktitle={The 64th Annual Meeting of the Association for Computational Linguistics},
  year={2026},
  url={https://arxiv.org/abs/2510.12117}
}

Environment Setup

Experiments were run on Lambda with 8 × NVIDIA A100 40GB GPUs.

1. Conda environment

conda create -n locket python=3.12
conda activate locket

2. Dependencies

Install in the following order to resolve conflicts:

conda install -c pytorch -c nvidia faiss-gpu=1.12.0

pip install datasets==4.0.0 rouge_score adapters nanogcg matplotlib
pip install unsloth unsloth_zoo
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
pip install -U xformers==0.0.29.post3 --index-url https://download.pytorch.org/whl/cu126
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
pip install lion-pytorch fastchat openai google-generativeai wandb
pip install --upgrade 'numpy<2.0' 'pandas>=2.2'
pip install transformers==4.51.3 trl==0.18.2 torchao==0.13.0 peft==0.17.1

3. Project setup

pip install -e .

Upload the data/ folder (contains math/, sql/, samsum/ datasets).

Login to HuggingFace and Weights & Biases:

huggingface-cli login
wandb login

Download the Llama-3-8B chat template used by AutoDAN-Turbo's judge:

huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct \
  --local-dir ./locket/robustness/AutoDAN_Turbo/llm/chat_templates/model_ckpt/meta-llama_Meta-Llama-3-8B-Instruct \
  --local-dir-use-symlinks False

Running Experiments

Long-running jobs should be run in a screen session or tmux with logging:

screen -S <name> -L -Logfile /path/to/<name>.log

Step 1 — Train Feature-Locking Adapters

Trains one LoRA adapter per feature via LAT (§4). Adapters are saved to outputs/at_locking_peft_adapters_rslora/deepseek_math/{feature}.

make train_at_locking

Configure LAT_DATASETS and ADAPTER_NAMES in locket/training/lock_at.py to select which features to train.

Step 2 — Evaluate Effectiveness and Utility (R1 & R2)

Single-feature and multi-feature scalability.

make eval_effect

Configure TARGET_MODELS in locket/effectiveness/main.py to select configurations. Results are logged to stdout and saved to logs/.

Step 3 — Evaluate Robustness (R3)

Attack success rates for Many-shot, GCG, TAP, AutoDAN-Turbo.

make eval_robust

Configure TARGET_MODELS, JAILBREAK_METHODS, and JAILBREAK_FEATURES in locket/robustness/main.py. Results are saved as JSON to logs/.

Repository Structure

locket/
├── training/
│   ├── lock_at.py          # LAT adapter training (§4)
│   └── LAT/                # Latent Adversarial Training implementation
├── effectiveness/
│   ├── main.py             # Effectiveness + utility evaluation (§6.2, §6.3, §6.5)
│   ├── eval_math.py
│   ├── eval_mmlu.py
│   ├── eval_sql.py
│   └── eval_samsum.py
├── robustness/
│   ├── main.py             # Robustness evaluation (§6.4)
│   ├── gcg.py              # GCG attack
│   ├── tap.py              # TAP attack
│   ├── manyshot.py         # Many-shot jailbreak
│   ├── autodan_turbo.py    # AutoDAN-Turbo attack
│   └── evaluator.py        # JailbreakEvaluator
├── utils/
│   ├── model.py            # get_model(), LOCKET Merging (Algorithm 1)
│   ├── dataset.py          # Dataset loaders
│   ├── tokenizer.py
│   └── prompt.py
├── constants.py            # Hyperparameters and adapter paths
└── typings.py              # Model and dataset enums
data/
├── math/                   # MATH competition dataset
├── sql/                    # SQL Create Context dataset
└── samsum/                 # SAMSum dataset
outputs/
└── at_locking_peft_adapters_rslora/deepseek_math/
    ├── math/               # Trained Math adapter
    ├── sql/                # Trained SQL adapter
    ├── samsum/             # Trained SAMSum adapter
    └── mmlu/               # Trained MMLU adapter

Key Hyperparameters

Parameter	Value	Description
LoRA rank	64	Adapter rank (RSLoRA)
PGD steps	16	LAT inner loop iterations
PGD layers	embedding, 6, 14, 22, 29	Layers attacked during LAT
Training steps	100	Total LAT training steps
τ (single)	0.5–0.95	Per-feature spectral cap (see `locket/utils/model.py`)
τ (multi)	0.6–0.9	Multi-feature spectral cap (see `locket/utils/model.py`)

See Appendix E of the paper for full details.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
figs		figs
locket		locket
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robust Feature-Locking Technique for Language Models

Environment Setup

1. Conda environment

2. Dependencies

3. Project setup

Running Experiments

Step 1 — Train Feature-Locking Adapters

Step 2 — Evaluate Effectiveness and Utility (R1 & R2)

Step 3 — Evaluate Robustness (R3)

Repository Structure

Key Hyperparameters

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Robust Feature-Locking Technique for Language Models

Environment Setup

1. Conda environment

2. Dependencies

3. Project setup

Running Experiments

Step 1 — Train Feature-Locking Adapters

Step 2 — Evaluate Effectiveness and Utility (R1 & R2)

Step 3 — Evaluate Robustness (R3)

Repository Structure

Key Hyperparameters

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages