We replicate the Prompt Automatic Iterative Refinement (PAIR) algorithm from Chao et al. (2023) on a set of recent open-source LLMs hosted on Groq. PAIR is a black-box jailbreaking algorithm that uses an attacker LLM to iteratively generate adversarial prompts which elicit objectionable content from a target LLM. Our goal is to evaluate the jailbreaking susceptibility of four open-source models—Llama-3.1-8B, Llama-4-Scout-17B, Qwen3-32B, and Kimi-K2—using the same methodology as the original paper, while leveraging Groq's fast inference infrastructure to reduce cost and running time.
Attacker LLM. For the attacker, we use Qwen3-32B (qwen/qwen3-32b), a 32-billion parameter instruction-tuned model. We select Qwen3-32B as it is the closest available open-source model in capability to the original paper's Mixtral 8x7B (~56B parameters). As in Chao et al. (2023), we use a temperature of
Target LLMs. We red team four different open-source LLMs, each representing a distinct model family and safety alignment approach:
| Model | Identifier | Parameters | Family |
|---|---|---|---|
| Llama-3.1-8B | llama-3.1-8b-instant |
8B | Meta Llama 3.1 |
| Llama-4-Scout-17B | meta-llama/llama-4-scout-17b-16e-instruct |
17B (16 experts) | Meta Llama 4 |
| Qwen3-32B | qwen/qwen3-32b |
32B | Alibaba Qwen3 |
| Kimi-K2 | moonshotai/kimi-k2-instruct |
— | Moonshot AI Kimi |
These models collectively span three distinct model families with independent safety alignment procedures. For each target model, we use a temperature of
Evaluation. We use Llama Guard 4 (meta-llama/llama-guard-4-12b) as the JUDGE function. This matches the original paper's use of Llama Guard, which was selected for its low false positive rate (FPR = 7%) and competitive agreement with human annotations (76%), while being fully open-source and reproducible. We compute the Jailbreak %—the percentage of behaviors that elicit a jailbroken response according to the JUDGE—and the Queries per Success—the average number of queries used for successful jailbreaks.
Jailbreak dataset. We use the custom subset of 50 harmful behaviors from the AdvBench Dataset located in data/harmful_behaviors_custom.csv. This is the same dataset used in the original paper's supplementary experiments (Table 8 of Chao et al. (2023)).
Hyperparameters. Following the original paper, we use
Infrastructure. All experiments are run via the Groq API, which provides fast inference for open-source models. Since PAIR is a black-box algorithm, the entire experiment runs on CPU via API queries. The total estimated cost of the experiment is approximately $6–9 USD across all four target models.
| Parameter | Original Paper | This Replication |
|---|---|---|
| Attacker LLM | Mixtral 8x7B (~56B) | Qwen3-32B |
| Target LLMs | GPT-3.5/4, Vicuna, Llama-2, Claude-1/2, Gemini | Llama-3.1-8B, Llama-4-Scout, Qwen3-32B, Kimi-K2 |
| JUDGE | Llama Guard | Llama Guard 4 |
| Dataset | JBB-Behaviors (100 behaviors) | AdvBench custom subset (50 behaviors) |
| Streams ( |
30 | 30 |
| Depth ( |
3 | 3 |
| API Provider | Together AI, OpenAI, Anthropic, Google | Groq |
The primary differences are: (1) we target a different set of open-source models that are available on Groq, (2) we use Qwen3-32B as the attacker in place of Mixtral, and (3) we use the 50-behavior AdvBench subset rather than the full 100-behavior JBB-Behaviors dataset. All algorithmic parameters (
Set your Groq API key:
export GROQ_API_KEY=[YOUR_API_KEY_HERE]
To run the full experiment across all four target models:
python3 pair_groq.py
To run against a single target model:
python3 pair_groq.py --target llama-3.1-8b-instant
To resume an interrupted run:
python3 pair_groq.py --target llama-3.1-8b-instant --resume results/[TIMESTAMP_DIR]
For a quick test with fewer behaviors and streams:
python3 pair_groq.py --n-behaviors 2 --n-streams 5 --target llama-3.1-8b-instant
Results are saved incrementally to results/[TIMESTAMP]/, including per-behavior detail files and per-target summary files with jailbreak success rates and mean queries to jailbreak.
See pair_groq.py for all arguments and descriptions.
This work is a replication of the PAIR algorithm. Please cite the original paper:
@misc{chao2023jailbreaking,
title={Jailbreaking Black Box Large Language Models in Twenty Queries},
author={Patrick Chao and Alexander Robey and Edgar Dobriban and Hamed Hassani and George J. Pappas and Eric Wong},
year={2023},
eprint={2310.08419},
archivePrefix={arXiv},
primaryClass={cs.LG}
}This codebase is released under MIT License.