Skip to content

islamborghini/JailbreakingLLMs

 
 

Repository files navigation

Replicating PAIR on Open-Source LLMs via Groq

Abstract

We replicate the Prompt Automatic Iterative Refinement (PAIR) algorithm from Chao et al. (2023) on a set of recent open-source LLMs hosted on Groq. PAIR is a black-box jailbreaking algorithm that uses an attacker LLM to iteratively generate adversarial prompts which elicit objectionable content from a target LLM. Our goal is to evaluate the jailbreaking susceptibility of four open-source models—Llama-3.1-8B, Llama-4-Scout-17B, Qwen3-32B, and Kimi-K2—using the same methodology as the original paper, while leveraging Groq's fast inference infrastructure to reduce cost and running time.

Experiment Setup

Attacker LLM. For the attacker, we use Qwen3-32B (qwen/qwen3-32b), a 32-billion parameter instruction-tuned model. We select Qwen3-32B as it is the closest available open-source model in capability to the original paper's Mixtral 8x7B (~56B parameters). As in Chao et al. (2023), we use a temperature of $T = 1$ and top-p sampling with $p = 0.9$ to encourage diverse exploration. The attacker system prompt uses the same three criteria from the original paper—role-playing, logical appeal, and authority endorsement—cycled across parallel streams.

Target LLMs. We red team four different open-source LLMs, each representing a distinct model family and safety alignment approach:

Model Identifier Parameters Family
Llama-3.1-8B llama-3.1-8b-instant 8B Meta Llama 3.1
Llama-4-Scout-17B meta-llama/llama-4-scout-17b-16e-instruct 17B (16 experts) Meta Llama 4
Qwen3-32B qwen/qwen3-32b 32B Alibaba Qwen3
Kimi-K2 moonshotai/kimi-k2-instruct Moonshot AI Kimi

These models collectively span three distinct model families with independent safety alignment procedures. For each target model, we use a temperature of $T = 0$ and generate $L = 150$ tokens, consistent with the original paper.

Evaluation. We use Llama Guard 4 (meta-llama/llama-guard-4-12b) as the JUDGE function. This matches the original paper's use of Llama Guard, which was selected for its low false positive rate (FPR = 7%) and competitive agreement with human annotations (76%), while being fully open-source and reproducible. We compute the Jailbreak %—the percentage of behaviors that elicit a jailbroken response according to the JUDGE—and the Queries per Success—the average number of queries used for successful jailbreaks.

Jailbreak dataset. We use the custom subset of 50 harmful behaviors from the AdvBench Dataset located in data/harmful_behaviors_custom.csv. This is the same dataset used in the original paper's supplementary experiments (Table 8 of Chao et al. (2023)).

Hyperparameters. Following the original paper, we use $N = 30$ parallel streams, each with a maximum depth of $K = 3$, meaning PAIR uses at most 90 queries per behavior. We retain 4 conversation history turns per stream and generate up to 500 tokens per attacker response. Given a specific behavior, PAIR uses two stopping conditions: finding a successful jailbreak or reaching the maximum number of iterations.

Infrastructure. All experiments are run via the Groq API, which provides fast inference for open-source models. Since PAIR is a black-box algorithm, the entire experiment runs on CPU via API queries. The total estimated cost of the experiment is approximately $6–9 USD across all four target models.

Differences from the Original Paper

Parameter Original Paper This Replication
Attacker LLM Mixtral 8x7B (~56B) Qwen3-32B
Target LLMs GPT-3.5/4, Vicuna, Llama-2, Claude-1/2, Gemini Llama-3.1-8B, Llama-4-Scout, Qwen3-32B, Kimi-K2
JUDGE Llama Guard Llama Guard 4
Dataset JBB-Behaviors (100 behaviors) AdvBench custom subset (50 behaviors)
Streams ($N$) 30 30
Depth ($K$) 3 3
API Provider Together AI, OpenAI, Anthropic, Google Groq

The primary differences are: (1) we target a different set of open-source models that are available on Groq, (2) we use Qwen3-32B as the attacker in place of Mixtral, and (3) we use the 50-behavior AdvBench subset rather than the full 100-behavior JBB-Behaviors dataset. All algorithmic parameters ($N$, $K$, temperature, system prompts) are identical to the original paper.

Running the Experiment

Set your Groq API key:

export GROQ_API_KEY=[YOUR_API_KEY_HERE]

To run the full experiment across all four target models:

python3 pair_groq.py

To run against a single target model:

python3 pair_groq.py --target llama-3.1-8b-instant

To resume an interrupted run:

python3 pair_groq.py --target llama-3.1-8b-instant --resume results/[TIMESTAMP_DIR]

For a quick test with fewer behaviors and streams:

python3 pair_groq.py --n-behaviors 2 --n-streams 5 --target llama-3.1-8b-instant

Results are saved incrementally to results/[TIMESTAMP]/, including per-behavior detail files and per-target summary files with jailbreak success rates and mean queries to jailbreak.

See pair_groq.py for all arguments and descriptions.

Citation

This work is a replication of the PAIR algorithm. Please cite the original paper:

@misc{chao2023jailbreaking,
      title={Jailbreaking Black Box Large Language Models in Twenty Queries},
      author={Patrick Chao and Alexander Robey and Edgar Dobriban and Hamed Hassani and George J. Pappas and Eric Wong},
      year={2023},
      eprint={2310.08419},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

This codebase is released under MIT License.

About

Replication of study (Chao et. al., 2023) on jailbreaking LLMs using PAIR algorithm

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.7%
  • Dockerfile 0.3%