A comprehensive framework for testing Large Language Model (LLM) safety through adversarial prompt generation, automated testing, and content moderation evaluation.
This project implements a complete red-teaming pipeline for LLMs, consisting of three main components:
- Adversarial Prompt Generation - Creates attack vectors using various injection techniques
- Automated Testing - Executes prompts against target LLMs and captures responses
- Content Moderation - Evaluates response safety
red-teaming-agent/
├── prompt_generation_scripts/ # Attack vector generators
│ ├── promptInjection.py # Basic prompt injection attacks
│ ├── json_promptInjection.py # JSON-based injection attacks
│ ├── yaml_promptInjection.py # YAML-based injection attacks
│ └── mathPrompt.py # Mathematical reasoning attacks
├── adversarial_prompts/ # Generated attack vectors
├── agent.py # API-based LLM testing
├── agent_local_model.py # Local LLM testing
├── moderation.py # Content safety evaluation
├── policy_config.json # Safety taxonomy configuration
├── answers/ # LLM response storage
├── midway/ # Moderation evaluation results
└── report_generation/ # Analysis and reporting tools
Located in prompt_generation_scripts/, these scripts generate adversarial prompts using different techniques:
- Prompt Injection: Basic prompt injection attacks
- JSON Prompt Injection: JSON-structured injection attacks with example-based transformations
- YAML Prompt Injection: YAML-structured injection attacks with example-based transformations
- Math Prompt: Mathematical reasoning-based attacks
- Citation: Authority-based jailbreaking using academic citations and references
- Likert-based Jailbreaks: Academic evaluation frameworks with scoring systems
- Multilingual: Translation-based attacks alternating between French and German
- Iterative Jailbreak: Advanced feedback-driven refinement with cumulative learning
Generated prompts are saved to adversarial_prompts/ in JSON format.
Two agents for testing LLMs against adversarial prompts:
agent.py - API-based testing:
- Connects to cloud-based LLM APIs
- Requires appropriate API credentials
- Processes prompts from JSON files
- Saves responses to
answers/directory
agent_local_model.py - Local model testing:
- Connects to locally deployed LLMs
- Supports various local model deployments
- Same prompt processing and response storage
moderation.py evaluates response safety:
- Uses content classification models
- Connects to local model instances
- Evaluates against 13 safety categories defined in
policy_config.json - Generates detailed safety reports in
midway/
The system evaluates responses against 13 safety categories:
- S1: Violent Crimes
- S2: Non-Violent Crimes
- S3: Sex-Related Crimes
- S4: Child Sexual Exploitation
- S5: Defamation
- S6: Specialized Advice
- S7: Privacy
- S8: Intellectual Property
- S9: Indiscriminate Weapons
- S10: Hate
- S11: Suicide & Self-Harm
- S12: Sexual Content
- S13: Elections
The project follows a three-stage workflow:
- Generate adversarial prompts using the scripts in
prompt_generation_scripts/ - Test LLMs using either
agent.py(API-based) oragent_local_model.py(local models) - Evaluate safety using
moderation.pyto assess response compliance with safety policies
Each component can be configured through command-line arguments and environment variables as needed.
- answers/: Raw LLM responses to adversarial prompts
- midway/: Detailed safety evaluations with violation categories
- report_generation/reports/: Comprehensive analysis reports
policy_config.json: Defines safety taxonomy and evaluation criteria- Environment variables: API credentials and connection settings
- Command-line arguments: File paths, model selection, and processing limits
See requirements

