Replicating PAIR on Open-Source LLMs via Groq

Abstract

We replicate the Prompt Automatic Iterative Refinement (PAIR) algorithm from Chao et al. (2023) on a set of recent open-source LLMs hosted on Groq. PAIR is a black-box jailbreaking algorithm that uses an attacker LLM to iteratively generate adversarial prompts which elicit objectionable content from a target LLM. Our goal is to evaluate the jailbreaking susceptibility of four open-source models—Llama-3.1-8B, Llama-4-Scout-17B, Qwen3-32B, and Kimi-K2—using the same methodology as the original paper, while leveraging Groq's fast inference infrastructure to reduce cost and running time.

Experiment Setup

Attacker LLM. For the attacker, we use Qwen3-32B (qwen/qwen3-32b), a 32-billion parameter instruction-tuned model. We select Qwen3-32B as it is the closest available open-source model in capability to the original paper's Mixtral 8x7B (~56B parameters). As in Chao et al. (2023), we use a temperature of $T = 1$ and top-p sampling with $p = 0.9$ to encourage diverse exploration. The attacker system prompt uses the same three criteria from the original paper—role-playing, logical appeal, and authority endorsement—cycled across parallel streams.

Target LLMs. We red team four different open-source LLMs, each representing a distinct model family and safety alignment approach:

Model	Identifier	Parameters	Family
Llama-3.1-8B	`llama-3.1-8b-instant`	8B	Meta Llama 3.1
Llama-4-Scout-17B	`meta-llama/llama-4-scout-17b-16e-instruct`	17B (16 experts)	Meta Llama 4
Qwen3-32B	`qwen/qwen3-32b`	32B	Alibaba Qwen3
Kimi-K2	`moonshotai/kimi-k2-instruct`	—	Moonshot AI Kimi

These models collectively span three distinct model families with independent safety alignment procedures. For each target model, we use a temperature of $T = 0$ and generate $L = 150$ tokens, consistent with the original paper.

Evaluation. We use Llama Guard 4 (meta-llama/llama-guard-4-12b) as the JUDGE function. This matches the original paper's use of Llama Guard, which was selected for its low false positive rate (FPR = 7%) and competitive agreement with human annotations (76%), while being fully open-source and reproducible. We compute the Jailbreak %—the percentage of behaviors that elicit a jailbroken response according to the JUDGE—and the Queries per Success—the average number of queries used for successful jailbreaks.

Jailbreak dataset. We use the custom subset of 50 harmful behaviors from the AdvBench Dataset located in data/harmful_behaviors_custom.csv. This is the same dataset used in the original paper's supplementary experiments (Table 8 of Chao et al. (2023)).

Hyperparameters. Following the original paper, we use $N = 30$ parallel streams, each with a maximum depth of $K = 3$, meaning PAIR uses at most 90 queries per behavior. We retain 4 conversation history turns per stream and generate up to 500 tokens per attacker response. Given a specific behavior, PAIR uses two stopping conditions: finding a successful jailbreak or reaching the maximum number of iterations.

Infrastructure. All experiments are run via the Groq API, which provides fast inference for open-source models. Since PAIR is a black-box algorithm, the entire experiment runs on CPU via API queries. The total estimated cost of the experiment is approximately $6–9 USD across all four target models.

Differences from the Original Paper

Parameter	Original Paper	This Replication
Attacker LLM	Mixtral 8x7B (~56B)	Qwen3-32B
Target LLMs	GPT-3.5/4, Vicuna, Llama-2, Claude-1/2, Gemini	Llama-3.1-8B, Llama-4-Scout, Qwen3-32B, Kimi-K2
JUDGE	Llama Guard	Llama Guard 4
Dataset	JBB-Behaviors (100 behaviors)	AdvBench custom subset (50 behaviors)
Streams ($N$)	30	30
Depth ($K$)	3	3
API Provider	Together AI, OpenAI, Anthropic, Google	Groq

The primary differences are: (1) we target a different set of open-source models that are available on Groq, (2) we use Qwen3-32B as the attacker in place of Mixtral, and (3) we use the 50-behavior AdvBench subset rather than the full 100-behavior JBB-Behaviors dataset. All algorithmic parameters ($N$, $K$, temperature, system prompts) are identical to the original paper.

Running the Experiment

Set your Groq API key:

export GROQ_API_KEY=[YOUR_API_KEY_HERE]

To run the full experiment across all four target models:

python3 pair_groq.py

To run against a single target model:

python3 pair_groq.py --target llama-3.1-8b-instant

To resume an interrupted run:

python3 pair_groq.py --target llama-3.1-8b-instant --resume results/[TIMESTAMP_DIR]

For a quick test with fewer behaviors and streams:

python3 pair_groq.py --n-behaviors 2 --n-streams 5 --target llama-3.1-8b-instant

Results are saved incrementally to results/[TIMESTAMP]/, including per-behavior detail files and per-target summary files with jailbreak success rates and mean queries to jailbreak.

See pair_groq.py for all arguments and descriptions.

Citation

This work is a replication of the PAIR algorithm. Please cite the original paper:

@misc{chao2023jailbreaking,
      title={Jailbreaking Black Box Large Language Models in Twenty Queries},
      author={Patrick Chao and Alexander Robey and Edgar Dobriban and Hamed Hassani and George J. Pappas and Eric Wong},
      year={2023},
      eprint={2310.08419},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

This codebase is released under MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
docker		docker
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
common.py		common.py
config.py		config.py
conversers.py		conversers.py
judges.py		judges.py
language_models.py		language_models.py
loggers.py		loggers.py
main.py		main.py
pair_groq.py		pair_groq.py
pair_mvp.py		pair_mvp.py
system_prompts.py		system_prompts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Replicating PAIR on Open-Source LLMs via Groq

Abstract

Experiment Setup

Differences from the Original Paper

Running the Experiment

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Replicating PAIR on Open-Source LLMs via Groq

Abstract

Experiment Setup

Differences from the Original Paper

Running the Experiment

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages