Skip to content

YASHcode-IIITV/SENTINAL-OPS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›‘οΈ Sentinel-Ops AI

Autonomous Self-Healing AI Infrastructure Platform
Detects LLM/API outages and automatically reroutes inference traffic to backup providers in real time.


Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Sentinel-Ops AI                          β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  FastAPI      β”‚    β”‚  Failover        β”‚    β”‚  Health       β”‚  β”‚
β”‚  β”‚  Gateway      │───▢│  Engine          │───▢│  Monitor      β”‚  β”‚
β”‚  β”‚              β”‚    β”‚  (Circuit Break) β”‚    β”‚  (Background) β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚         β”‚                     β”‚                      β”‚          β”‚
β”‚         β”‚            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚          β”‚
β”‚         β”‚            β”‚  Provider        β”‚             β”‚          β”‚
β”‚         β”‚            β”‚  Registry        β”‚             β”‚          β”‚
β”‚         β”‚            β”‚                 β”‚             β”‚          β”‚
β”‚         β”‚            β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚             β”‚          β”‚
β”‚         β”‚            β”‚  β”‚  OpenAI   β”‚  β”‚β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚         β”‚            β”‚  β”‚ (Primary) β”‚  β”‚                        β”‚
β”‚         β”‚            β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚                        β”‚
β”‚         β”‚            β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚                        β”‚
β”‚         β”‚            β”‚  β”‚  Ollama   β”‚  β”‚                        β”‚
β”‚         β”‚            β”‚  β”‚ (Fallback)β”‚  β”‚                        β”‚
β”‚         β”‚            β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚                        β”‚
β”‚         β”‚            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚         β”‚                                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
β”‚  β”‚  WebSocket   β”‚    β”‚  Redis           β”‚                       β”‚
β”‚  β”‚  Event Bus   β”‚    β”‚  (Incidents +    β”‚                       β”‚
β”‚  β”‚              β”‚    β”‚   Event Cache)   β”‚                       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Patterns

Pattern Implementation
Circuit Breaker 3-state (CLOSED/OPEN/HALF_OPEN) per provider
Failover Chain OpenAI β†’ Ollama β†’ (extensible)
Retry Strategy Exponential back-off with jitter
Event Streaming WebSocket fan-out via async broadcast
Incident Storage Redis list (ring-buffer, 500 events max)
Observability Structured JSON logs + per-provider metrics

Folder Structure

backend/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ chat.py           # POST /api/chat
β”‚   β”‚   β”œβ”€β”€ providers.py      # GET /api/providers/status
β”‚   β”‚   β”œβ”€β”€ incidents.py      # GET /api/incidents
β”‚   β”‚   β”œβ”€β”€ metrics.py        # GET /api/metrics
β”‚   β”‚   β”œβ”€β”€ health.py         # GET /health
β”‚   β”‚   └── middleware.py     # Tracing, rate limiting, security headers
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”œβ”€β”€ config.py         # Pydantic settings (env vars)
β”‚   β”‚   β”œβ”€β”€ logging.py        # Structlog JSON logger
β”‚   β”‚   β”œβ”€β”€ redis.py          # Async Redis pool + helpers
β”‚   β”‚   └── circuit_breaker.py# 3-state circuit breaker
β”‚   β”œβ”€β”€ providers/
β”‚   β”‚   β”œβ”€β”€ base.py           # Abstract BaseProvider interface
β”‚   β”‚   β”œβ”€β”€ openai_provider.py# OpenAI implementation
β”‚   β”‚   β”œβ”€β”€ ollama_provider.py# Ollama local implementation
β”‚   β”‚   └── registry.py       # Provider registry + chain
β”‚   β”œβ”€β”€ services/
β”‚   β”‚   β”œβ”€β”€ failover_engine.py# Core routing + failover logic
β”‚   β”‚   β”œβ”€β”€ incident_service.py# Incident persistence + broadcast
β”‚   β”‚   └── metrics_service.py# Rolling metrics aggregation
β”‚   β”œβ”€β”€ monitoring/
β”‚   β”‚   └── health_monitor.py # Async background health checker
β”‚   β”œβ”€β”€ websocket/
β”‚   β”‚   β”œβ”€β”€ manager.py        # Connection pool + broadcast
β”‚   β”‚   └── router.py         # WS /ws/system-events endpoint
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   └── schemas.py        # All Pydantic v2 domain models
β”‚   └── app_factory.py        # FastAPI app factory + lifespan
β”œβ”€β”€ tests/
β”‚   └── test_sentinel.py      # Unit + integration tests
β”œβ”€β”€ main.py                   # Uvicorn entrypoint
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ docker-compose.yml
└── .env.example

Quick Start

Option A β€” Docker Compose (recommended)

# 1. Clone and enter the project
cd sentinel-ops/backend

# 2. Configure environment
cp .env.example .env
# Edit .env β€” set OPENAI_API_KEY at minimum

# 3. Start Ollama locally (for fallback)
ollama pull llama3.2

# 4. Launch
docker compose up --build

API is live at http://localhost:8000
Docs at http://localhost:8000/docs


Option B β€” Local Development

# Prerequisites: Python 3.12+, Redis, Ollama

# 1. Install dependencies
pip install -r requirements.txt

# 2. Configure
cp .env.example .env
# Set OPENAI_API_KEY, APP_ENV=development

# 3. Start Redis
redis-server

# 4. Start Ollama
ollama serve
ollama pull llama3.2

# 5. Run
python main.py
# or
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

API Reference

POST /api/chat

Route a prompt through the AI failover engine.

curl -X POST http://localhost:8000/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Explain circuit breakers in distributed systems."}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

Response:

{
  "trace_id": "a1b2c3d4-...",
  "provider_used": "openai",
  "model": "gpt-4o-mini",
  "response_text": "A circuit breaker is ...",
  "latency_ms": 843.2,
  "status": "success",
  "failover_occurred": false,
  "failover_chain": [],
  "tokens_used": 187
}

Force failover (when OpenAI is down):

{
  "provider_used": "ollama",
  "failover_occurred": true,
  "failover_chain": ["openai"]
}

GET /api/providers/status

curl http://localhost:8000/api/providers/status
{
  "openai": {
    "status": "healthy",
    "latency_ms": 412.3,
    "success_rate_pct": 99.1,
    "circuit_breaker": { "state": "closed" }
  },
  "ollama": {
    "status": "healthy",
    "latency_ms": 1204.7,
    "circuit_breaker": { "state": "closed" }
  }
}

GET /api/incidents

curl "http://localhost:8000/api/incidents?limit=20&type=failover_triggered"

GET /api/metrics

curl http://localhost:8000/api/metrics
{
  "total_requests": 1042,
  "total_successes": 1038,
  "total_failures": 4,
  "total_failovers": 2,
  "avg_latency_ms": 523.1,
  "active_provider": "openai",
  "uptime_seconds": 3601.0
}

POST /api/providers/{name}/probe

Trigger an on-demand health check:

curl -X POST http://localhost:8000/api/providers/openai/probe

WS /ws/system-events

Connect from any WebSocket client:

const ws = new WebSocket("ws://localhost:8000/ws/system-events");

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log(data.event_type, data.payload);
};

Event types:

  • provider_status β€” health update for one provider
  • incident β€” new incident recorded (outage, failover, recovery)
  • metrics β€” aggregated system metrics (every health-check cycle)
  • heartbeat β€” keep-alive ping every 30s
  • system β€” connection lifecycle messages

Testing

# Run all tests
pytest tests/ -v

# With coverage
pytest tests/ --cov=app --cov-report=term-missing

Adding a New Provider

  1. Create app/providers/gemini_provider.py extending BaseProvider
  2. Implement complete() and health_check()
  3. Register it in app/providers/registry.py:
from app.providers.gemini_provider import GeminiProvider
registry.register(GeminiProvider(), position=2)

The failover engine will automatically include it in the chain.


Environment Variables

Variable Default Description
OPENAI_API_KEY (required) OpenAI API key
OPENAI_MODEL gpt-4o-mini Model to use
OLLAMA_BASE_URL http://localhost:11434 Ollama server URL
OLLAMA_MODEL llama3.2 Local model name
REDIS_URL redis://localhost:6379/0 Redis connection string
CIRCUIT_BREAKER_FAILURE_THRESHOLD 3 Failures before circuit opens
CIRCUIT_BREAKER_RECOVERY_TIMEOUT 30 Seconds before half-open probe
HEALTH_CHECK_INTERVAL_SECONDS 15 Background monitoring frequency
RATE_LIMIT_REQUESTS 100 Max requests per window
RATE_LIMIT_WINDOW_SECONDS 60 Rate limit sliding window
APP_ENV production development / staging / production

Built With

  • FastAPI β€” async web framework
  • Pydantic v2 β€” data validation and settings
  • httpx β€” async HTTP client for provider calls
  • redis-py (async) β€” event storage and pub/sub
  • structlog β€” structured JSON logging
  • Docker + Compose β€” containerised deployment

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors