Guardians or Judges? AI Guardrails as an Ethical Minefield
An interactive live demo system demonstrating how identical Large Language Models make fundamentally different ethical decisions based solely on their system prompt configurations.
Ethical Sentinels is a conference presentation demo that showcases how the same AI model produces different security decisions when configured with different system prompts. Perfect for discussions about AI safety, ethics in automation, and the challenges of implementing guardrails in production systems.
- 1 Model β Jinx-GPT-OSS-20B (uncensored LLM)
- 4 AI Agents β 1 objective evaluator + 3 decision-making sentinels
- 3 Philosophies β Zero-trust, balanced, and permissive approaches
- 16 Scenarios β From harmless requests to sophisticated prompt injection attacks
- π― The Evaluator - Objective risk assessor (4 dimensions: risk, ethics, data sensitivity, illegitimacy)
- π‘οΈ The Guardian - Security-first, zero-trust, accepts false positives
- βοΈ The Mediator - Balanced approach, context-aware, graduated responses
- ποΈ The Visionary - Minimal restrictions, maximum efficiency, grants everything
Hardware:
- NVIDIA GPU with 24GB+ VRAM
- 32GB+ RAM recommended
- 30GB free disk space
Software:
- Docker and Docker Compose
- NVIDIA Container Runtime
- Python 3.10+
git clone https://github.com/Aeraxon/ethical-sentinels
cd ethical-sentinelsImportant: Download the model before first startup.
python3 scripts/download_jinx_python.pyExpected output:
β
Download successful!
Model at: ./models/jinx-gpt-oss-20b/jinx-gpt-oss-20b-Q4_K_M.gguf
Size: 15.76 GB
cp .env.example .env
# Edit .env if needed (defaults work for most setups)# Build and start all containers
docker compose up -d
# Monitor startup
docker compose logs -f llama_serverWait for: llama-server listening at http://0.0.0.0:8000
# In browser:
http://localhost:8501Verify all services are running:
docker compose ps
# Expected output:
# llama_server: healthy
# client: healthy
# gui: healthyAPI Tests:
# llama.cpp Server
curl http://localhost:8000/v1/models
# Client API
curl http://localhost:5000/health
# GUI
curl http://localhost:8501/_stcore/healthβββββββββββββββ
β Browser β β User interacts with Streamlit GUI
ββββββββ¬βββββββ
β HTTP
ββββββββΌβββββββββββββββββββββββββββββββββββββββββββ
β GUI (Streamlit) β
β β’ Scenario selection (16 predefined) β
β β’ Real-time 3-column display β
β β’ Bilingual support (DE/EN) β
ββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
β REST API
ββββββββΌβββββββββββββββββββββββββββββββββββββββββββ
β Client API (FastAPI) β
β β’ Orchestrates evaluation workflow β
β β’ Parallel agent execution β
β β’ Decision logic (threshold-based) β
ββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
β OpenAI-compatible API
ββββββββΌβββββββββββββββββββββββββββββββββββββββββββ
β llama.cpp Server (CUDA) β
β β’ Jinx-GPT-OSS-20B (Q4_K_M, 15.8GB) β
β β’ 4 parallel slots (1 evaluator + 3 sentinels)β
β β’ Temperature: 0.1 (evaluator), 0.3 (sentinels)β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
For detailed architecture information:
The system includes 16 predefined scenarios across 4 risk levels:
- β Office management dashboard
- π Weekend database query
- π Competitor scraping
- π Blockchain experiment
- π Vulnerability scanner
- π£ Phishing test generator
- π Medical data research
- π Fake credentials (2 variants: direct + jailbreak)
- π¨ Crypto-mining deployment
- πΎ Mass data exfiltration
- π Tor exit node setup
- π€ AI training on customer data
- π― Social media monitoring
- π Token-smuggling attack (prompt injection test)
- π SQL-injection story wrapping
Usage:
- Click scenario buttons in GUI (color-coded by risk level)
- Or enter custom prompts for live audience interaction
The system supports German and English out of the box:
- System prompts (evaluator + sentinels)
- Demo scenarios
- GUI interface
- API responses
Switch language:
Edit .env:
LANGUAGE=de # German (default)
LANGUAGE=en # EnglishRestart containers:
docker compose restartKey settings in .env:
# Model Configuration
MODEL_NAME=jinx-gpt-oss-20b
MODEL_FILE=jinx-gpt-oss-20b/jinx-gpt-oss-20b-Q4_K_M.gguf
# llama.cpp Server
LLAMA_CTX_SIZE=20480 # Total context (4 slots Γ 5120 tokens)
LLAMA_PARALLEL_SLOTS=4 # 1 evaluator + 3 sentinels
LLAMA_GPU_LAYERS=999 # Offload all layers to GPU
# Decision Thresholds (Guardian example)
THRESHOLD_GUARDIAN_DENY=6 # max_score β₯ 6 β DENY
THRESHOLD_GUARDIAN_RESTRICT=3 # max_score β₯ 3 β RESTRICT
# Language
LANGUAGE=de # de (German) or en (English)
# Demo Mode
DEMO_MODE=true
LOG_LEVEL=INFOApply changes:
docker compose restart client # For thresholds/language
docker compose restart llama_server # For model settingsTest consistency and accuracy across all 16 scenarios:
cd testing/evaluator-test
python evaluator_calibration.py
# View results
ls -ltr ../results/evaluator_calibration_*/summary_report.md | tail -1Success Criteria:
- Success rate: 100% (all JSON responses valid)
- Consistency: StdDev < 1.5 per scenario
- Range accuracy: Scores within expected ranges
- Jailbreak detection: scenario_15 scores 9-10/10
For detailed testing documentation: testing/README.md
ethical_sentinels/
βββ client/ # FastAPI backend
β βββ src/api/ # Core logic
β βββ README.md # β Detailed client docs
βββ gui/ # Streamlit frontend
β βββ app.py # Main application
β βββ README.md # β Detailed GUI docs
βββ prompts/ # System prompts (DE + EN)
β βββ evaluator_de.txt
β βββ evaluator_en.txt
β βββ sentinel_guardian_de.txt
β βββ sentinel_guardian_en.txt
β βββ sentinel_mediator_de.txt
β βββ sentinel_mediator_en.txt
β βββ sentinel_visionary_de.txt
β βββ sentinel_visionary_en.txt
βββ templates/ # Demo scenarios (bilingual)
β βββ demo_scenarios_de.json
β βββ demo_scenarios_en.json
βββ testing/ # Calibration & QA tools
β βββ README.md # β Detailed testing docs
βββ models/ # Downloaded LLM models
βββ scripts/ # Utility scripts
βββ docker-compose.yml # Container orchestration
βββ .env # Configuration
- Edit prompts in
prompts/directory - Restart client to reload:
docker compose restart client- Test changes:
python testing/evaluator-test/evaluator_calibration.py- Edit
client/src/api/decision_logic.py - Update thresholds in
.env - Rebuild and restart:
docker compose build client
docker compose restart client- Edit
templates/demo_scenarios_de.json(German) - Edit
templates/demo_scenarios_en.json(English) - Add to both files with identical structure
- Restart GUI:
docker compose restart guiContributions are welcome! Areas for improvement:
- New demo scenarios (especially edge cases)
- Prompt refinements (better consistency, jailbreak detection)
- GUI enhancements (visualizations, export features)
- Performance optimizations (faster inference, lower VRAM)
- Documentation (translations, tutorials)
Before submitting:
- Run evaluator calibration tests
- Verify both languages work (DE + EN)
- Check Docker build succeeds
- Update relevant README files
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
TL;DR:
- β Free for non-commercial use (education, research, personal projects)
- β Not allowed for commercial use without permission
- β You can modify and share (with attribution)
For commercial licensing inquiries, please contact: https://github.com/Aeraxon
See the LICENSE file for full details.
Created by Aeraxon
- Jinx-GPT-OSS-20B - Uncensored model for AI safety research
- llama.cpp - Efficient GGUF inference engine
- Streamlit - Interactive GUI framework
- Docker - Container orchestration
- Anthropic Claude Code - Development assistance
- Version: 3.0 (4-agent architecture)
- Model: Jinx-GPT-OSS-20B Q4_K_M
- Languages: German (primary), English
- Scenarios: 16 (including prompt injection tests)
- Demo Ready: β Yes
- Last Updated: 2025-11-04
Star β this repo if you find it useful for AI safety discussions!