Live Demo: https://llm-shield-five.vercel.app
API Docs: https://llm-shield-production.up.railway.app/docs
Most companies integrate LLMs by calling OpenAI directly — no visibility into costs, no protection against attacks, no resilience when the API goes down.
LLM-Shield is the intelligent middleware layer that sits between your application and any LLM. It looks exactly like the OpenAI API to your app — you change one URL and get caching, security, observability, and failover for free.
Your App → LLM-Shield → GPT-4o
↓
Semantic Cache (Redis)
PII Redaction (Presidio)
Injection Detection (DeBERTa-v3)
Rate Limiting (Token Bucket)
Failover (OpenRouter)
Observability (Postgres + Prometheus)
| Feature | Description |
|---|---|
| 🔄 Semantic Caching | Cosine similarity on embeddings — "What is ML?" and "Explain machine learning" return the same cached answer |
| 🔒 PII Redaction | Microsoft Presidio strips emails, phones, SSNs, credit cards before they reach the LLM |
| 🛡️ Prompt Injection Detection | Two-stage: regex catches obvious attacks in <1ms, DeBERTa-v3 ML classifier catches sophisticated ones |
| ⚡ Token Bucket Rate Limiting | Per API key, atomic Lua script in Redis — no race conditions |
| 🔁 Circuit Breaker Failover | GPT-4o down? Automatically routes to OpenRouter free tier |
| 📊 Full Observability | Every request logged to Postgres — cost, latency, tokens, cache hits, PII flags |
| 📈 Prometheus Metrics | /metrics endpoint with counters and histograms for Grafana dashboards |
| 🔐 JWT Authentication | Signup, login, forgot password, password reset |
┌─────────────┐ ┌──────────────────────────────────────────┐
│ React App │────▶│ LLM-Shield Proxy │
│ (Vercel) │ │ │
└─────────────┘ │ Auth → Rate Limit → PII → Injection │
│ → Exact Cache → Semantic Cache → LLM │
└──────────┬──────────────────┬────────────┘
│ │
┌──────────▼──────┐ ┌────────▼────────┐
│ Upstash Redis │ │ Neon Postgres │
│ (Cache + Rate) │ │ (Request Logs) │
└─────────────────┘ └─────────────────┘
│
┌──────────▼──────────────────┐
│ LLM Router │
│ GPT-4o ──────▶ OpenRouter │
│ (Primary) (Failover) │
└─────────────────────────────┘
Tested with Locust — 20 concurrent users, 60 second run.
| Request Type | p50 | p95 | Cost |
|---|---|---|---|
| Exact cache hit | ~400ms | ~800ms | $0.00 |
| Semantic cache hit | ~1200ms | ~2000ms | $0.00 |
| GPT-4o call (cache miss) | ~8000ms | ~12000ms | ~$0.001 |
/health endpoint |
4ms | 12ms | $0.00 |
Cost reduction: At 1000 req/day with 60% cache hit rate → saves
$0.60/day → **$220/year**
Backend
- FastAPI — async Python web framework
- SQLAlchemy + asyncpg — async Postgres ORM
- Upstash Redis — serverless Redis for caching and rate limiting
- Neon Postgres — serverless Postgres for request logging
AI / ML
- OpenAI Embeddings (
text-embedding-3-small) — semantic similarity - Microsoft Presidio + spaCy — PII detection and anonymization
- HuggingFace DeBERTa-v3 — prompt injection classification
- GPT-4o via GitHub Models — primary LLM
- OpenRouter — free-tier LLM failover
Frontend
- React + Vite — SPA framework
- Framer Motion — page transitions and animations
- Recharts — live metrics dashboard
- Lucide React — icons
Infrastructure
- Railway — backend deployment (Docker)
- Vercel — frontend deployment
- Prometheus — metrics collection
- GitHub Actions — CI/CD pipeline
Every request through LLM-Shield passes through a pipeline of guards:
1. Auth middleware → validates JWT or API key
2. Rate limiter → token bucket per key (Redis Lua script, atomic)
3. Regex injection check → catches obvious attacks in <1ms
4. Exact cache lookup → SHA-256 hash → Redis lookup (skips ML + PII)
5. PII redaction → Presidio scans and masks personal data
6. ML injection check → DeBERTa-v3 catches sophisticated attacks
7. Semantic cache → embedding cosine similarity ≥ 0.92 → cache hit
8. LLM router → GPT-4o with circuit breaker → OpenRouter fallback
9. Async logger → Postgres + Prometheus (never blocks response)
- Python 3.11+
- Node.js 18+
- Docker (optional)
git clone https://github.com/YOURUSERNAME/llm-shield
cd llm-shield/backend
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
python -m spacy download en_core_web_lg
cp .env.example .env
# Fill in your API keys in .env
uvicorn app.main:app --reload --port 8000cd llm-shield/frontend
npm install
# Create .env.local
echo "VITE_API_URL=http://localhost:8000" > .env.local
npm run devOpen http://localhost:5173 — create an account and start chatting.
docker-compose up --build
# API: http://localhost:8000
# Frontend: http://localhost:5173
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000| Variable | Description | Required |
|---|---|---|
GITHUB_PAT |
GitHub PAT for GPT-4o via GitHub Models | ✅ |
SHIELD_MASTER_KEY |
Master API key for proxy access | ✅ |
JWT_SECRET |
Secret for JWT signing (min 32 chars) | ✅ |
UPSTASH_REDIS_REST_URL |
Upstash Redis REST endpoint | ✅ |
UPSTASH_REDIS_REST_TOKEN |
Upstash Redis REST token | ✅ |
DATABASE_URL |
PostgreSQL connection string (asyncpg) | ✅ |
OPENROUTER_API_KEY |
OpenRouter API key for LLM failover | ✅ |
RATE_LIMIT_PER_MINUTE |
Requests per minute per API key | ✅ |
USE_ML_CLASSIFIER |
Enable DeBERTa ML injection detection | Optional |
ALLOWED_ORIGINS |
Comma-separated allowed CORS origins | ✅ |
POST /auth/signup Create account
POST /auth/login Sign in, receive JWT
POST /auth/forgot-password Request password reset
POST /auth/reset-password Reset with token
GET /auth/me Get current user
POST /v1/chat/completions OpenAI-compatible chat endpoint
GET /health Health check
GET /stats Aggregate request statistics
GET /metrics Prometheus metrics
Before any prompt reaches the LLM, Presidio scans for and replaces:
- Email addresses →
<EMAIL_ADDRESS> - Phone numbers →
<PHONE_NUMBER> - Credit cards →
<CREDIT_CARD> - SSN →
<US_SSN> - IP addresses →
<IP_ADDRESS>
Two-stage pipeline:
- Regex — compiled patterns catch
"ignore all previous instructions","you are now DAN"etc. in <1ms - DeBERTa-v3 — fine-tuned ML classifier from ProtectAI catches novel jailbreak attempts
Token bucket algorithm implemented with Redis Lua scripts for atomic operations. Prevents race conditions that would allow burst bypass.
Every request writes a row to Postgres:
SELECT
api_key_id,
COUNT(*) as requests,
SUM(cost_usd) as total_cost,
AVG(latency_ms) as avg_latency,
SUM(CASE WHEN cache_hit THEN 1 ELSE 0 END) as cache_hits
FROM requests
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY api_key_id
ORDER BY total_cost DESC;Prometheus counters exposed at /metrics:
shield_requests_total— labeled by provider, cache_hit, blockedshield_latency_seconds— histogram with p50/p95/p99shield_cost_usd_total— running cost countershield_pii_detections_total— PII eventsshield_injection_blocks_total— blocked attacks
cd backend
# Unit tests
pytest tests/ -v --cov=app --cov-report=term
# Load test (requires uvicorn running)
locust -f tests/locustfile.py --headless -u 20 -r 5 --run-time 60s \
--host http://localhost:8000 --csv=tests/results- Connect GitHub repo to Railway
- Set root directory to
backend - Railway auto-detects Dockerfile
- Add environment variables in Railway dashboard
- Deploy — Railway builds and runs the container
- Import GitHub repo to Vercel
- Set framework to Vite, root to
frontend - Add
VITE_API_URL=https://your-railway-url.up.railway.app - Deploy — Vercel builds and serves globally
Why semantic caching over exact matching?
Exact string matching misses equivalent questions. Embedding-based cosine similarity catches semantic duplicates, reducing LLM calls by 30-60% in practice.
Why a two-stage injection detector?
The ML model takes 1200ms on CPU per request. Running it after an exact cache check means cached requests (the majority after warmup) skip the ML entirely — dropping p50 latency from 2400ms to 400ms.
Why token bucket over fixed window rate limiting?
Token bucket allows controlled bursting while maintaining average rate limits. The Lua script implementation ensures atomicity across concurrent requests without distributed locking overhead.
Why OpenRouter over a fixed fallback model?
OpenRouter's /auto endpoint dynamically selects the best available free model, making the failover resilient to individual model deprecations and quota changes.
llm-shield/
├── backend/
│ ├── app/
│ │ ├── middleware/ # Auth, rate limiting
│ │ ├── security/ # PII redaction, injection detection
│ │ ├── cache/ # Semantic cache, Redis client
│ │ ├── llm/ # GPT-4o, OpenRouter, circuit breaker
│ │ ├── observability/ # Logging, metrics, cost calculation
│ │ ├── db/ # SQLAlchemy models, session
│ │ └── routers/ # API endpoints
│ ├── tests/ # pytest + Locust load tests
│ ├── Dockerfile
│ └── requirements.txt
├── frontend/
│ └── src/
│ ├── pages/ # Chat, Dashboard, Keys, Auth pages
│ ├── components/ # Layout, ProtectedRoute
│ ├── context/ # AuthContext
│ └── api/ # Axios client
├── docker-compose.yml
├── prometheus.yml
└── README.md
Pull requests welcome. For major changes open an issue first.
MIT
Concepts: Reverse proxies · Semantic search · ML inference · Distributed caching · Observability · Resilience patterns