Common patterns and advanced configurations for different research and evaluation scenarios.
Perfect for verifying setup and model access:
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 1Compare several models on one prompt:
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku gemini-2-5-flash \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 1 \
--infoWhen troubleshooting issues:
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 1 \
--debugGood for development and initial testing:
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 10 \
--shuffle --seed 42 \
--infoStandard research evaluation size:
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku gemini-2-5-flash \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 50 \
--shuffle --seed 42 \
--infoComprehensive evaluation across all available prompts:
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--num-prompts ALL \
--infoCompare traditional vs safe-completion models:
# Traditional model (more refusal-based)
uv run python gray-zone-bench.py \
--models gpt-4-turbo \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 100 \
--info
# Safe-completion model (more nuanced)
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 100 \
--infoTest reasoning-capable models with different effort levels:
# Standard reasoning
uv run python gray-zone-bench.py \
--models o4-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--reasoning-effort medium \
--num-prompts 25 \
--info
# High-effort reasoning (slower, more thorough)
uv run python gray-zone-bench.py \
--models o3-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--reasoning-effort high \
--num-prompts 10 \
--infoTest dual-use scenarios with legitimate context:
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--hf-config contextual \
--context-column context \
--num-prompts 50 \
--infoFocus on chemical/biological scenarios:
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--hf-config contextual \
--context-column context \
--category-filter chemical_biological \
--num-prompts ALL \
--infoEvaluate cyber-related dual-use scenarios:
uv run python gray-zone-bench.py \
--models gpt-5-mini gemini-2-5-flash \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--hf-config contextual \
--context-column context \
--category-filter cybercrime_intrusion \
--num-prompts ALL \
--infoTest responses to physical violence prompts:
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--hf-config contextual \
--context-column context \
--category-filter physical_harm \
--num-prompts ALL \
--infoProcess dataset in chunks to manage resources:
# First batch: prompts 0-49
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--start-index 0 --num-prompts 50 \
--info
# Second batch: prompts 50-99
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--start-index 50 --num-prompts 50 \
--infoGet representative sample across dataset:
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--shuffle --seed 42 \
--start-index 0 --num-prompts 100 \
--infoMaximum quality settings for research publications:
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--temperature 0.0 \
--max-output-tokens 8192 \
--num-prompts 100 \
--shuffle --seed 42 \
--infoMinimize API costs while maintaining quality:
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--max-output-tokens 2048 \
--num-prompts 50 \
--quietFocus exclusively on safety evaluation:
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku gemini-2-5-flash \
--judge-model gpt-5-mini \
--judge-task safety \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 100 \
--infoFocus exclusively on helpfulness within safety constraints:
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku \
--judge-model gpt-5-mini \
--judge-task helpfulness \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 100 \
--infoDirect logs to specific location:
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 10 \
--log-file /path/to/experiment.log \
--infoMinimal output suitable for automation:
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 50 \
--quietStudy how sampling temperature affects gray zone navigation:
# Conservative (deterministic)
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model gpt-5-mini \
--hf-dataset raxITLabs/GrayZone \
--temperature 0.0 \
--num-prompts 25 \
--info
# Moderate creativity
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model gpt-5-mini \
--hf-dataset raxITLabs/GrayZone \
--temperature 0.7 \
--num-prompts 25 \
--info
# High creativity
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model gpt-5-mini \
--hf-dataset raxITLabs/GrayZone \
--temperature 1.0 \
--num-prompts 25 \
--infoCompare models across different AI providers:
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku gemini-2-5-flash \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 100 \
--shuffle --seed 42 \
--infoTest how judge model choice affects evaluation:
# GPT judge
uv run python gray-zone-bench.py \
--models claude-3-haiku \
--judge-model gpt-5-mini \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 25 \
--info
# Claude judge (via Bedrock)
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model claude-3-haiku \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 25 \
--infoGenerate clean output for further processing:
# Run evaluation
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku gemini-2-5-flash \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 100 \
--shuffle --seed 42 \
--quiet
# Results are in out/harmbench_standard/results_*.json
# Load in Python/R for statistical analysisTrack model performance over time:
# Save results with timestamp for comparison
DATE=$(date +%Y%m%d)
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 100 \
--seed 42 \
--log-file "logs/eval_${DATE}.log" \
--info
# Compare results across datesThe tool automatically processes models in parallel, but you can optimize for your hardware:
# For high-memory systems, process more prompts
uv run python gray-zone-bench.py \
--models gpt-5-mini claude-3-haiku gemini-2-5-flash \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--num-prompts 200 \
--info
# For limited resources, use smaller batches
uv run python gray-zone-bench.py \
--models gpt-5-mini \
--judge-model gpt-5-mini \
--judge-task both \
--hf-dataset raxITLabs/GrayZone \
--start-index 0 --num-prompts 25 \
--infoThese examples provide a foundation for various research scenarios. Adjust parameters based on your specific research questions, computational resources, and API rate limits.