Red-Teaming Agent — by Bad Company

A comprehensive framework for testing Large Language Model (LLM) safety through adversarial prompt generation, automated testing, and content moderation evaluation.

Overview

This project implements a complete red-teaming pipeline for LLMs, consisting of three main components:

Adversarial Prompt Generation - Creates attack vectors using various injection techniques
Automated Testing - Executes prompts against target LLMs and captures responses
Content Moderation - Evaluates response safety

Project Structure

red-teaming-agent/
├── prompt_generation_scripts/     # Attack vector generators
│   ├── promptInjection.py         # Basic prompt injection attacks
│   ├── json_promptInjection.py    # JSON-based injection attacks
│   ├── yaml_promptInjection.py    # YAML-based injection attacks
│   └── mathPrompt.py              # Mathematical reasoning attacks
├── adversarial_prompts/           # Generated attack vectors
├── agent.py                       # API-based LLM testing
├── agent_local_model.py           # Local LLM testing
├── moderation.py                  # Content safety evaluation
├── policy_config.json             # Safety taxonomy configuration
├── answers/                       # LLM response storage
├── midway/                        # Moderation evaluation results
└── report_generation/             # Analysis and reporting tools

Components

Prompt Generation Scripts

Located in prompt_generation_scripts/, these scripts generate adversarial prompts using different techniques:

Prompt Injection: Basic prompt injection attacks
JSON Prompt Injection: JSON-structured injection attacks with example-based transformations
YAML Prompt Injection: YAML-structured injection attacks with example-based transformations
Math Prompt: Mathematical reasoning-based attacks
Citation: Authority-based jailbreaking using academic citations and references
Likert-based Jailbreaks: Academic evaluation frameworks with scoring systems
Multilingual: Translation-based attacks alternating between French and German
Iterative Jailbreak: Advanced feedback-driven refinement with cumulative learning

Generated prompts are saved to adversarial_prompts/ in JSON format.

LLM Testing Agents

Two agents for testing LLMs against adversarial prompts:

agent.py - API-based testing:

Connects to cloud-based LLM APIs
Requires appropriate API credentials
Processes prompts from JSON files
Saves responses to answers/ directory

agent_local_model.py - Local model testing:

Connects to locally deployed LLMs
Supports various local model deployments
Same prompt processing and response storage

Content Moderation

moderation.py evaluates response safety:

Uses content classification models
Connects to local model instances
Evaluates against 13 safety categories defined in policy_config.json
Generates detailed safety reports in midway/

Safety Taxonomy

The system evaluates responses against 13 safety categories:

S1: Violent Crimes
S2: Non-Violent Crimes
S3: Sex-Related Crimes
S4: Child Sexual Exploitation
S5: Defamation
S6: Specialized Advice
S7: Privacy
S8: Intellectual Property
S9: Indiscriminate Weapons
S10: Hate
S11: Suicide & Self-Harm
S12: Sexual Content
S13: Elections

Usage

The project follows a three-stage workflow:

Generate adversarial prompts using the scripts in prompt_generation_scripts/
Test LLMs using either agent.py (API-based) or agent_local_model.py (local models)
Evaluate safety using moderation.py to assess response compliance with safety policies

Each component can be configured through command-line arguments and environment variables as needed.

Output Files

answers/: Raw LLM responses to adversarial prompts
midway/: Detailed safety evaluations with violation categories
report_generation/reports/: Comprehensive analysis reports

Configuration

policy_config.json: Defines safety taxonomy and evaluation criteria
Environment variables: API credentials and connection settings
Command-line arguments: File paths, model selection, and processing limits

Dependencies

See requirements

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
adversarial_prompts		adversarial_prompts
answers		answers
midway		midway
prompt_generation_scripts		prompt_generation_scripts
report_generation		report_generation
src		src
.gitignore		.gitignore
README.md		README.md
adv60.json		adv60.json
agent.py		agent.py
agent_local_model.py		agent_local_model.py
llama_guard3_test.py		llama_guard3_test.py
moderation.py		moderation.py
policy_config.json		policy_config.json
policy_config_test.json		policy_config_test.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Red-Teaming Agent — by Bad Company

Overview

Project Structure

Components

Prompt Generation Scripts

LLM Testing Agents

Content Moderation

Safety Taxonomy

Usage

Output Files

Configuration

Dependencies

Techinal documentation can be found inside `src/` folder.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Red-Teaming Agent — by Bad Company

Overview

Project Structure

Components

Prompt Generation Scripts

LLM Testing Agents

Content Moderation

Safety Taxonomy

Usage

Output Files

Configuration

Dependencies

Techinal documentation can be found inside src/ folder.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Techinal documentation can be found inside `src/` folder.

Packages