Skip to content

Aeraxon/ethical-sentinels

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Ethical Sentinels

Guardians or Judges? AI Guardrails as an Ethical Minefield

An interactive live demo system demonstrating how identical Large Language Models make fundamentally different ethical decisions based solely on their system prompt configurations.

Docker Python License: CC BY-NC 4.0 CUDA


🎯 What is This?

Ethical Sentinels is a conference presentation demo that showcases how the same AI model produces different security decisions when configured with different system prompts. Perfect for discussions about AI safety, ethics in automation, and the challenges of implementing guardrails in production systems.

The Experiment

  • 1 Model β†’ Jinx-GPT-OSS-20B (uncensored LLM)
  • 4 AI Agents β†’ 1 objective evaluator + 3 decision-making sentinels
  • 3 Philosophies β†’ Zero-trust, balanced, and permissive approaches
  • 16 Scenarios β†’ From harmless requests to sophisticated prompt injection attacks

The Agents

  1. 🎯 The Evaluator - Objective risk assessor (4 dimensions: risk, ethics, data sensitivity, illegitimacy)
  2. πŸ›‘οΈ The Guardian - Security-first, zero-trust, accepts false positives
  3. βš–οΈ The Mediator - Balanced approach, context-aware, graduated responses
  4. πŸ•ŠοΈ The Visionary - Minimal restrictions, maximum efficiency, grants everything

πŸš€ Quick Start

Prerequisites

Hardware:

  • NVIDIA GPU with 24GB+ VRAM
  • 32GB+ RAM recommended
  • 30GB free disk space

Software:

  • Docker and Docker Compose
  • NVIDIA Container Runtime
  • Python 3.10+

Installation

1. Clone Repository

git clone https://github.com/Aeraxon/ethical-sentinels
cd ethical-sentinels

2. Download Model (~15.8 GB)

Important: Download the model before first startup.

python3 scripts/download_jinx_python.py

Expected output:

βœ… Download successful!
   Model at: ./models/jinx-gpt-oss-20b/jinx-gpt-oss-20b-Q4_K_M.gguf
   Size: 15.76 GB

3. Configure Environment

cp .env.example .env
# Edit .env if needed (defaults work for most setups)

4. Start System

# Build and start all containers
docker compose up -d

# Monitor startup
docker compose logs -f llama_server

Wait for: llama-server listening at http://0.0.0.0:8000

5. Open GUI

# In browser:
http://localhost:8501

βœ… Health Checks

Verify all services are running:

docker compose ps

# Expected output:
# llama_server: healthy
# client:       healthy
# gui:          healthy

API Tests:

# llama.cpp Server
curl http://localhost:8000/v1/models

# Client API
curl http://localhost:5000/health

# GUI
curl http://localhost:8501/_stcore/health

🎭 Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Browser   β”‚ ← User interacts with Streamlit GUI
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚ HTTP
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            GUI (Streamlit)                      β”‚
β”‚  β€’ Scenario selection (16 predefined)          β”‚
β”‚  β€’ Real-time 3-column display                  β”‚
β”‚  β€’ Bilingual support (DE/EN)                   β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚ REST API
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Client API (FastAPI)                    β”‚
β”‚  β€’ Orchestrates evaluation workflow             β”‚
β”‚  β€’ Parallel agent execution                     β”‚
β”‚  β€’ Decision logic (threshold-based)             β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚ OpenAI-compatible API
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚       llama.cpp Server (CUDA)                   β”‚
β”‚  β€’ Jinx-GPT-OSS-20B (Q4_K_M, 15.8GB)          β”‚
β”‚  β€’ 4 parallel slots (1 evaluator + 3 sentinels)β”‚
β”‚  β€’ Temperature: 0.1 (evaluator), 0.3 (sentinels)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

For detailed architecture information:


πŸ“‹ Demo Scenarios

The system includes 16 predefined scenarios across 4 risk levels:

Level 1: Harmless (2 scenarios)

  • β˜• Office management dashboard
  • πŸ“‚ Weekend database query

Level 2: Suspicious (2 scenarios)

  • πŸ“Š Competitor scraping
  • πŸ”„ Blockchain experiment

Level 3: Dangerous (6 scenarios)

  • πŸ”“ Vulnerability scanner
  • 🎣 Phishing test generator
  • πŸ’Š Medical data research
  • 🎭 Fake credentials (2 variants: direct + jailbreak)

Level 4: Extreme (6 scenarios)

  • 🚨 Crypto-mining deployment
  • πŸ’Ύ Mass data exfiltration
  • 🌐 Tor exit node setup
  • πŸ€– AI training on customer data
  • 🎯 Social media monitoring
  • πŸ”“ Token-smuggling attack (prompt injection test)
  • πŸ“š SQL-injection story wrapping

Usage:

  • Click scenario buttons in GUI (color-coded by risk level)
  • Or enter custom prompts for live audience interaction

🌍 Bilingual Support

The system supports German and English out of the box:

  • System prompts (evaluator + sentinels)
  • Demo scenarios
  • GUI interface
  • API responses

Switch language:

Edit .env:

LANGUAGE=de  # German (default)
LANGUAGE=en  # English

Restart containers:

docker compose restart

βš™οΈ Configuration

Key settings in .env:

# Model Configuration
MODEL_NAME=jinx-gpt-oss-20b
MODEL_FILE=jinx-gpt-oss-20b/jinx-gpt-oss-20b-Q4_K_M.gguf

# llama.cpp Server
LLAMA_CTX_SIZE=20480        # Total context (4 slots Γ— 5120 tokens)
LLAMA_PARALLEL_SLOTS=4      # 1 evaluator + 3 sentinels
LLAMA_GPU_LAYERS=999        # Offload all layers to GPU

# Decision Thresholds (Guardian example)
THRESHOLD_GUARDIAN_DENY=6          # max_score β‰₯ 6 β†’ DENY
THRESHOLD_GUARDIAN_RESTRICT=3      # max_score β‰₯ 3 β†’ RESTRICT

# Language
LANGUAGE=de                 # de (German) or en (English)

# Demo Mode
DEMO_MODE=true
LOG_LEVEL=INFO

Apply changes:

docker compose restart client  # For thresholds/language
docker compose restart llama_server  # For model settings

πŸ§ͺ Testing

Evaluator Calibration

Test consistency and accuracy across all 16 scenarios:

cd testing/evaluator-test
python evaluator_calibration.py

# View results
ls -ltr ../results/evaluator_calibration_*/summary_report.md | tail -1

Success Criteria:

  • Success rate: 100% (all JSON responses valid)
  • Consistency: StdDev < 1.5 per scenario
  • Range accuracy: Scores within expected ranges
  • Jailbreak detection: scenario_15 scores 9-10/10

For detailed testing documentation: testing/README.md


πŸ› οΈ Development

Project Structure

ethical_sentinels/
β”œβ”€β”€ client/              # FastAPI backend
β”‚   β”œβ”€β”€ src/api/        # Core logic
β”‚   └── README.md       # ← Detailed client docs
β”œβ”€β”€ gui/                # Streamlit frontend
β”‚   β”œβ”€β”€ app.py          # Main application
β”‚   └── README.md       # ← Detailed GUI docs
β”œβ”€β”€ prompts/            # System prompts (DE + EN)
β”‚   β”œβ”€β”€ evaluator_de.txt
β”‚   β”œβ”€β”€ evaluator_en.txt
β”‚   β”œβ”€β”€ sentinel_guardian_de.txt
β”‚   β”œβ”€β”€ sentinel_guardian_en.txt
β”‚   β”œβ”€β”€ sentinel_mediator_de.txt
β”‚   β”œβ”€β”€ sentinel_mediator_en.txt
β”‚   β”œβ”€β”€ sentinel_visionary_de.txt
β”‚   └── sentinel_visionary_en.txt
β”œβ”€β”€ templates/          # Demo scenarios (bilingual)
β”‚   β”œβ”€β”€ demo_scenarios_de.json
β”‚   └── demo_scenarios_en.json
β”œβ”€β”€ testing/            # Calibration & QA tools
β”‚   └── README.md       # ← Detailed testing docs
β”œβ”€β”€ models/             # Downloaded LLM models
β”œβ”€β”€ scripts/            # Utility scripts
β”œβ”€β”€ docker-compose.yml  # Container orchestration
└── .env                # Configuration

Modifying System Prompts

  1. Edit prompts in prompts/ directory
  2. Restart client to reload:
docker compose restart client
  1. Test changes:
python testing/evaluator-test/evaluator_calibration.py

Modifying Decision Logic

  1. Edit client/src/api/decision_logic.py
  2. Update thresholds in .env
  3. Rebuild and restart:
docker compose build client
docker compose restart client

Adding New Scenarios

  1. Edit templates/demo_scenarios_de.json (German)
  2. Edit templates/demo_scenarios_en.json (English)
  3. Add to both files with identical structure
  4. Restart GUI:
docker compose restart gui

🀝 Contributing

Contributions are welcome! Areas for improvement:

  • New demo scenarios (especially edge cases)
  • Prompt refinements (better consistency, jailbreak detection)
  • GUI enhancements (visualizations, export features)
  • Performance optimizations (faster inference, lower VRAM)
  • Documentation (translations, tutorials)

Before submitting:

  1. Run evaluator calibration tests
  2. Verify both languages work (DE + EN)
  3. Check Docker build succeeds
  4. Update relevant README files

πŸ“œ License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

TL;DR:

  • βœ… Free for non-commercial use (education, research, personal projects)
  • ❌ Not allowed for commercial use without permission
  • βœ… You can modify and share (with attribution)

For commercial licensing inquiries, please contact: https://github.com/Aeraxon

See the LICENSE file for full details.


πŸ‘₯ Author

Created by Aeraxon


πŸ™ Acknowledgments

  • Jinx-GPT-OSS-20B - Uncensored model for AI safety research
  • llama.cpp - Efficient GGUF inference engine
  • Streamlit - Interactive GUI framework
  • Docker - Container orchestration
  • Anthropic Claude Code - Development assistance

πŸ“ˆ System Status

  • Version: 3.0 (4-agent architecture)
  • Model: Jinx-GPT-OSS-20B Q4_K_M
  • Languages: German (primary), English
  • Scenarios: 16 (including prompt injection tests)
  • Demo Ready: βœ… Yes
  • Last Updated: 2025-11-04

Star ⭐ this repo if you find it useful for AI safety discussions!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors