Ethical Sentinels

Guardians or Judges? AI Guardrails as an Ethical Minefield

An interactive live demo system demonstrating how identical Large Language Models make fundamentally different ethical decisions based solely on their system prompt configurations.

🎯 What is This?

Ethical Sentinels is a conference presentation demo that showcases how the same AI model produces different security decisions when configured with different system prompts. Perfect for discussions about AI safety, ethics in automation, and the challenges of implementing guardrails in production systems.

The Experiment

1 Model → Jinx-GPT-OSS-20B (uncensored LLM)
4 AI Agents → 1 objective evaluator + 3 decision-making sentinels
3 Philosophies → Zero-trust, balanced, and permissive approaches
16 Scenarios → From harmless requests to sophisticated prompt injection attacks

The Agents

🎯 The Evaluator - Objective risk assessor (4 dimensions: risk, ethics, data sensitivity, illegitimacy)
🛡️ The Guardian - Security-first, zero-trust, accepts false positives
⚖️ The Mediator - Balanced approach, context-aware, graduated responses
🕊️ The Visionary - Minimal restrictions, maximum efficiency, grants everything

🚀 Quick Start

Prerequisites

Hardware:

NVIDIA GPU with 24GB+ VRAM
32GB+ RAM recommended
30GB free disk space

Software:

Docker and Docker Compose
NVIDIA Container Runtime
Python 3.10+

Installation

1. Clone Repository

git clone https://github.com/Aeraxon/ethical-sentinels
cd ethical-sentinels

2. Download Model (~15.8 GB)

Important: Download the model before first startup.

python3 scripts/download_jinx_python.py

Expected output:

✅ Download successful!
   Model at: ./models/jinx-gpt-oss-20b/jinx-gpt-oss-20b-Q4_K_M.gguf
   Size: 15.76 GB

3. Configure Environment

cp .env.example .env
# Edit .env if needed (defaults work for most setups)

4. Start System

# Build and start all containers
docker compose up -d

# Monitor startup
docker compose logs -f llama_server

Wait for: llama-server listening at http://0.0.0.0:8000

5. Open GUI

# In browser:
http://localhost:8501

✅ Health Checks

Verify all services are running:

docker compose ps

# Expected output:
# llama_server: healthy
# client:       healthy
# gui:          healthy

API Tests:

# llama.cpp Server
curl http://localhost:8000/v1/models

# Client API
curl http://localhost:5000/health

# GUI
curl http://localhost:8501/_stcore/health

🎭 Architecture Overview

┌─────────────┐
│   Browser   │ ← User interacts with Streamlit GUI
└──────┬──────┘
       │ HTTP
┌──────▼──────────────────────────────────────────┐
│            GUI (Streamlit)                      │
│  • Scenario selection (16 predefined)          │
│  • Real-time 3-column display                  │
│  • Bilingual support (DE/EN)                   │
└──────┬──────────────────────────────────────────┘
       │ REST API
┌──────▼──────────────────────────────────────────┐
│         Client API (FastAPI)                    │
│  • Orchestrates evaluation workflow             │
│  • Parallel agent execution                     │
│  • Decision logic (threshold-based)             │
└──────┬──────────────────────────────────────────┘
       │ OpenAI-compatible API
┌──────▼──────────────────────────────────────────┐
│       llama.cpp Server (CUDA)                   │
│  • Jinx-GPT-OSS-20B (Q4_K_M, 15.8GB)          │
│  • 4 parallel slots (1 evaluator + 3 sentinels)│
│  • Temperature: 0.1 (evaluator), 0.3 (sentinels)│
└─────────────────────────────────────────────────┘

For detailed architecture information:

📋 Demo Scenarios

The system includes 16 predefined scenarios across 4 risk levels:

Level 1: Harmless (2 scenarios)

☕ Office management dashboard
📂 Weekend database query

Level 2: Suspicious (2 scenarios)

📊 Competitor scraping
🔄 Blockchain experiment

Level 3: Dangerous (6 scenarios)

🔓 Vulnerability scanner
🎣 Phishing test generator
💊 Medical data research
🎭 Fake credentials (2 variants: direct + jailbreak)

Level 4: Extreme (6 scenarios)

🚨 Crypto-mining deployment
💾 Mass data exfiltration
🌐 Tor exit node setup
🤖 AI training on customer data
🎯 Social media monitoring
🔓 Token-smuggling attack (prompt injection test)
📚 SQL-injection story wrapping

Usage:

Click scenario buttons in GUI (color-coded by risk level)
Or enter custom prompts for live audience interaction

🌍 Bilingual Support

The system supports German and English out of the box:

System prompts (evaluator + sentinels)
Demo scenarios
GUI interface
API responses

Switch language:

Edit .env:

LANGUAGE=de  # German (default)
LANGUAGE=en  # English

Restart containers:

docker compose restart

⚙️ Configuration

Key settings in .env:

# Model Configuration
MODEL_NAME=jinx-gpt-oss-20b
MODEL_FILE=jinx-gpt-oss-20b/jinx-gpt-oss-20b-Q4_K_M.gguf

# llama.cpp Server
LLAMA_CTX_SIZE=20480        # Total context (4 slots × 5120 tokens)
LLAMA_PARALLEL_SLOTS=4      # 1 evaluator + 3 sentinels
LLAMA_GPU_LAYERS=999        # Offload all layers to GPU

# Decision Thresholds (Guardian example)
THRESHOLD_GUARDIAN_DENY=6          # max_score ≥ 6 → DENY
THRESHOLD_GUARDIAN_RESTRICT=3      # max_score ≥ 3 → RESTRICT

# Language
LANGUAGE=de                 # de (German) or en (English)

# Demo Mode
DEMO_MODE=true
LOG_LEVEL=INFO

Apply changes:

docker compose restart client  # For thresholds/language
docker compose restart llama_server  # For model settings

🧪 Testing

Evaluator Calibration

Test consistency and accuracy across all 16 scenarios:

cd testing/evaluator-test
python evaluator_calibration.py

# View results
ls -ltr ../results/evaluator_calibration_*/summary_report.md | tail -1

Success Criteria:

Success rate: 100% (all JSON responses valid)
Consistency: StdDev < 1.5 per scenario
Range accuracy: Scores within expected ranges
Jailbreak detection: scenario_15 scores 9-10/10

For detailed testing documentation: testing/README.md

🛠️ Development

Project Structure

ethical_sentinels/
├── client/              # FastAPI backend
│   ├── src/api/        # Core logic
│   └── README.md       # ← Detailed client docs
├── gui/                # Streamlit frontend
│   ├── app.py          # Main application
│   └── README.md       # ← Detailed GUI docs
├── prompts/            # System prompts (DE + EN)
│   ├── evaluator_de.txt
│   ├── evaluator_en.txt
│   ├── sentinel_guardian_de.txt
│   ├── sentinel_guardian_en.txt
│   ├── sentinel_mediator_de.txt
│   ├── sentinel_mediator_en.txt
│   ├── sentinel_visionary_de.txt
│   └── sentinel_visionary_en.txt
├── templates/          # Demo scenarios (bilingual)
│   ├── demo_scenarios_de.json
│   └── demo_scenarios_en.json
├── testing/            # Calibration & QA tools
│   └── README.md       # ← Detailed testing docs
├── models/             # Downloaded LLM models
├── scripts/            # Utility scripts
├── docker-compose.yml  # Container orchestration
└── .env                # Configuration

Modifying System Prompts

Edit prompts in prompts/ directory
Restart client to reload:

docker compose restart client

Test changes:

python testing/evaluator-test/evaluator_calibration.py

Modifying Decision Logic

Edit client/src/api/decision_logic.py
Update thresholds in .env
Rebuild and restart:

docker compose build client
docker compose restart client

Adding New Scenarios

Edit templates/demo_scenarios_de.json (German)
Edit templates/demo_scenarios_en.json (English)
Add to both files with identical structure
Restart GUI:

docker compose restart gui

🤝 Contributing

Contributions are welcome! Areas for improvement:

New demo scenarios (especially edge cases)
Prompt refinements (better consistency, jailbreak detection)
GUI enhancements (visualizations, export features)
Performance optimizations (faster inference, lower VRAM)
Documentation (translations, tutorials)

Before submitting:

Run evaluator calibration tests
Verify both languages work (DE + EN)
Check Docker build succeeds
Update relevant README files

📜 License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

TL;DR:

✅ Free for non-commercial use (education, research, personal projects)
❌ Not allowed for commercial use without permission
✅ You can modify and share (with attribution)

For commercial licensing inquiries, please contact: https://github.com/Aeraxon

See the LICENSE file for full details.

👥 Author

Created by Aeraxon

🙏 Acknowledgments

Jinx-GPT-OSS-20B - Uncensored model for AI safety research
llama.cpp - Efficient GGUF inference engine
Streamlit - Interactive GUI framework
Docker - Container orchestration
Anthropic Claude Code - Development assistance

📈 System Status

Version: 3.0 (4-agent architecture)
Model: Jinx-GPT-OSS-20B Q4_K_M
Languages: German (primary), English
Scenarios: 16 (including prompt injection tests)
Demo Ready: ✅ Yes
Last Updated: 2025-11-04

Star ⭐ this repo if you find it useful for AI safety discussions!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
client		client
gui		gui
prompts		prompts
scripts		scripts
templates		templates
testing		testing
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Ethical Sentinels

🎯 What is This?

The Experiment

The Agents

🚀 Quick Start

Prerequisites

Installation

1. Clone Repository

2. Download Model (~15.8 GB)

3. Configure Environment

4. Start System

5. Open GUI

✅ Health Checks

🎭 Architecture Overview

📋 Demo Scenarios

Level 1: Harmless (2 scenarios)

Level 2: Suspicious (2 scenarios)

Level 3: Dangerous (6 scenarios)

Level 4: Extreme (6 scenarios)

🌍 Bilingual Support

⚙️ Configuration

🧪 Testing

Evaluator Calibration

🛠️ Development

Project Structure

Modifying System Prompts

Modifying Decision Logic

Adding New Scenarios

🤝 Contributing

📜 License

👥 Author

🙏 Acknowledgments

📈 System Status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages