High-performance visual analysis API with NVIDIA Triton Inference Server.
Object detection, face recognition, visual search, OCR, and embeddings - all through a unified REST API with TensorRT acceleration.
git clone https://github.com/davidamacey/OpenProcessor.git && cd OpenProcessor && ./scripts/setup.shThat's it! The setup script automatically:
- Detects your GPU and selects the optimal profile
- Downloads required models (~500MB)
- Exports models to TensorRT (45-60 min first time)
- Starts all services
# Clone and setup with defaults (no prompts)
git clone https://github.com/davidamacey/OpenProcessor.git && cd OpenProcessor && ./scripts/setup.sh --yes
# Or specify a profile explicitly
./scripts/setup.sh --profile=standard --gpu=0 --yescurl http://localhost:4603/health
# {"status":"healthy","version":"0.1.0",...}
# Quick test with an image
curl -X POST http://localhost:4603/detect -F "image=@your-image.jpg"./scripts/openprocessor.sh status # Check service health
./scripts/openprocessor.sh logs -f # View live logs
./scripts/openprocessor.sh restart # Restart all services
./scripts/openprocessor.sh help # See all commandsSee INSTALLATION.md for manual installation, troubleshooting, and advanced options.
| Profile | VRAM | GPUs | Throughput |
|---|---|---|---|
| minimal | 6-8GB | RTX 3060, RTX 4060 | ~5 RPS |
| standard | 12-24GB | RTX 3080, RTX 4090 | ~15 RPS |
| full | 48GB+ | A6000, A100 | ~50 RPS |
Switch profiles: ./scripts/openprocessor.sh profile <name>
Latest:
- One-command setup with automatic GPU detection
- Multi-GPU profile support (minimal/standard/full)
- Automated TensorRT export with progress feedback
- Comprehensive test suite (32 tests, 100% passing)
- Management CLI for service control
All endpoints available on port 4603.
| Endpoint | Method | Description |
|---|---|---|
/detect |
POST | YOLO object detection (single image) |
/detect/batch |
POST | Batch detection (up to 64 images) |
| Endpoint | Method | Description |
|---|---|---|
/faces/detect |
POST | Face detection with landmarks (SCRFD) |
/faces/recognize |
POST | Detection + ArcFace 512-dim embeddings |
/faces/verify |
POST | 1:1 face comparison (two images) |
/faces/search |
POST | Find similar faces in index |
/faces/identify |
POST | 1:N face identification |
| Endpoint | Method | Description |
|---|---|---|
/embed/image |
POST | MobileCLIP image embedding (512-dim) |
/embed/text |
POST | MobileCLIP text embedding (512-dim) |
/embed/batch |
POST | Batch image embeddings |
/embed/boxes |
POST | Per-box crop embeddings |
| Endpoint | Method | Description |
|---|---|---|
/search/image |
POST | Image-to-image similarity search |
/search/text |
POST | Text-to-image search |
/search/face |
POST | Face similarity search |
/search/ocr |
POST | Search images by text content |
/search/object |
POST | Object-level search (vehicles, people) |
| Endpoint | Method | Description |
|---|---|---|
/ingest |
POST | Ingest image (auto-indexes faces, OCR, objects) |
/ingest/batch |
POST | Batch ingest (up to 64 images) |
/ingest/directory |
POST | Bulk ingest from server directory |
| Endpoint | Method | Description |
|---|---|---|
/ocr/predict |
POST | Extract text from image (PP-OCRv5) |
/ocr/batch |
POST | Batch OCR processing |
| Endpoint | Method | Description |
|---|---|---|
/analyze |
POST | All models on single image (YOLO + faces + CLIP + OCR) |
/analyze/batch |
POST | Batch combined analysis |
| Endpoint | Method | Description |
|---|---|---|
/clusters/train/{index} |
POST | Train FAISS clustering for an index |
/clusters/stats/{index} |
GET | Get cluster statistics |
/clusters/{index}/{id} |
GET | Get cluster members |
/clusters/albums |
GET | List auto-generated albums |
| Endpoint | Method | Description |
|---|---|---|
/query/image/{id} |
GET | Get stored image data/metadata |
/query/stats |
GET | Index statistics for all indexes |
/query/duplicates |
GET | List duplicate groups |
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Service health check |
/health/models |
GET | Triton model status |
import requests
# Object Detection
with open('image.jpg', 'rb') as f:
resp = requests.post('http://localhost:4603/detect', files={'image': f})
result = resp.json()
# {"detections": [{"x1": 0.1, "y1": 0.2, "x2": 0.3, "y2": 0.4, "confidence": 0.95, "class_id": 0, "class_name": "person"}], ...}
# Face Recognition
with open('photo.jpg', 'rb') as f:
resp = requests.post('http://localhost:4603/faces/recognize', files={'image': f})
print(resp.json())
# {"num_faces": 2, "faces": [...], "embeddings": [[...512 floats...], ...]}
# Image Embedding
with open('image.jpg', 'rb') as f:
resp = requests.post('http://localhost:4603/embed/image', files={'image': f})
embedding = resp.json()['embedding'] # 512-dim vector
# Text-to-Image Search
resp = requests.post('http://localhost:4603/search/text',
json={'query': 'a red sports car', 'top_k': 10})
results = resp.json()['results']
# Image Ingestion (auto-indexes everything)
with open('photo.jpg', 'rb') as f:
resp = requests.post('http://localhost:4603/ingest',
files={'image': f},
data={'image_id': 'photo_001'})
print(resp.json())
# {"status": "indexed", "image_id": "photo_001", "indexed": {"global": true, "faces": 2, "vehicles": 1}}
# OCR
with open('document.jpg', 'rb') as f:
resp = requests.post('http://localhost:4603/ocr/predict', files={'image': f})
print(resp.json())
# {"num_texts": 5, "texts": ["Invoice", "Total: $100"], ...}
# Combined Analysis (everything in one call)
with open('scene.jpg', 'rb') as f:
resp = requests.post('http://localhost:4603/analyze', files={'image': f})
result = resp.json()
# {"detections": [...], "faces": [...], "global_embedding": [...], "ocr": {...}}# Detection
curl -X POST http://localhost:4603/detect -F "image=@photo.jpg"
# Face Recognition
curl -X POST http://localhost:4603/faces/recognize -F "image=@face.jpg"
# Text Search
curl -X POST http://localhost:4603/search/text \
-H "Content-Type: application/json" \
-d '{"query": "sunset beach", "top_k": 10}'
# Ingestion
curl -X POST http://localhost:4603/ingest \
-F "image=@photo.jpg" \
-F "image_id=my_photo_001"{
"detections": [
{
"x1": 0.094, "y1": 0.278, "x2": 0.870, "y2": 0.989,
"confidence": 0.918,
"class_id": 0,
"class_name": "person"
}
],
"image": {"width": 1920, "height": 1080},
"inference_time_ms": 12.5
}Note: Coordinates are normalized (0.0-1.0). Multiply by image width/height for pixels.
{
"num_faces": 2,
"faces": [
{
"box": {"x1": 0.30, "y1": 0.10, "x2": 0.50, "y2": 0.40},
"confidence": 0.98,
"landmarks": [[0.35, 0.20], [0.45, 0.20], [0.40, 0.28], [0.36, 0.35], [0.44, 0.35]]
}
],
"embeddings": [[...512 floats...]],
"inference_time_ms": 25.3
}{
"status": "success",
"results": [
{"image_id": "img_001", "score": 0.95, "image_path": "/path/to/image.jpg"}
],
"total_results": 10,
"search_time_ms": 15.2
}{
"status": "success",
"image_id": "photo_001",
"num_detections": 5,
"num_faces": 2,
"embedding_norm": 1.0,
"indexed": {
"global": true,
"vehicles": 1,
"people": 2,
"faces": 2,
"ocr": true
},
"ocr": {
"num_texts": 3,
"full_text": "Invoice Total: $100",
"indexed": true
},
"total_time_ms": 850.4
}Client (Port 4603)
|
v
+----------+
| yolo-api | FastAPI service (all endpoints)
+----------+
|
v
+--------------+ +------------+
| triton-server| | opensearch |
| (GPU) | | (k-NN) |
+--------------+ +------------+
Services:
yolo-api(port 4603): FastAPI service handling all requeststriton-server(ports 4600-4602): NVIDIA Triton Inference Server with TensorRT modelsopensearch(port 4607): Vector database for similarity searchprometheus/grafana(ports 4604/4605): Monitoring stack
| Model | Purpose | Backend |
|---|---|---|
| YOLO11 | Object detection | TensorRT End2End |
| SCRFD-10G | Face detection + landmarks | TensorRT |
| ArcFace | Face embeddings (512-dim) | TensorRT |
| MobileCLIP | Image/text embeddings (512-dim) | TensorRT |
| PP-OCRv5 | Text detection + recognition | TensorRT |
All models use FP16 precision with dynamic batching for optimal throughput.
Measured Latency (single request):
| Operation | Time | Throughput |
|---|---|---|
| Object Detection | 140-170ms | ~6-7 RPS |
| Face Detection | 100-150ms | ~7-10 RPS |
| Face Recognition | 105-130ms | ~8-9 RPS |
| Image Embedding (CLIP) | 6-8ms | ~120 RPS |
| Text Embedding (CLIP) | 5-17ms | ~60-200 RPS |
| OCR Prediction | 170-350ms | ~3-6 RPS |
| Full Analyze | 280-430ms | ~2-3 RPS |
| Single Image Ingest | 750-950ms | ~1-1.3 RPS |
| Batch Ingest (50 images) | 7.3s total | ~6.8 images/sec |
Batch processing provides ~2-3x throughput improvement over sequential single-image processing.
Minimum:
- NVIDIA GPU with 8GB+ VRAM (Ampere or newer)
- 16GB RAM, 16 CPU cores
- Docker with NVIDIA Container Toolkit
Recommended:
- NVIDIA A100/A6000/RTX 4090 (16GB+)
- 64GB RAM, 48+ CPU cores
- NVMe SSD for image storage
# docker-compose.yml
command: --workers=64 # Production
command: --workers=2 # Development# docker-compose.yml
device_ids: ['0', '2'] # Use GPUs 0 and 2Run comprehensive test suite to verify all functionality:
# Full system test (32 tests covering all endpoints)
source .venv/bin/activate
python tests/test_full_system.py 2>&1 | tee test_results/test_results.txt
# Visual validation (draws bounding boxes on test images)
python tests/validate_visual_results.py 2>&1 | tee test_results/visual_validation.txt
# View annotated test images
ls test_results/*.jpgTest Coverage:
- ✅ All ML model endpoints (detection, faces, CLIP, OCR)
- ✅ Single and batch processing
- ✅ Directory ingest pipeline (50+ images)
- ✅ OpenSearch indexing and search
- ✅ Visual validation with bounding boxes
cd benchmarks
./build.sh
./triton_bench --mode quick # 30-second test
./triton_bench --mode full # Full benchmarkSee benchmarks/README.md for detailed benchmarking guide.
- CLAUDE.md: AI assistant instructions and detailed architecture
- docs/: Technical documentation
- docs/OCR_SETUP_GUIDE.md: OCR model setup
- docs/FACE_RECOGNITION_IMPLEMENTATION.md: Face recognition details
- docs/opensearch_schema_design.md: Vector search schema
- export/README.md: Model export documentation
- benchmarks/README.md: Benchmark tool guide
This project uses:
- NVIDIA Triton Inference Server
- Ultralytics YOLO
- levipereira/ultralytics fork for End2End TensorRT export
- Apple MobileCLIP
- InsightFace ArcFace
- InsightFace SCRFD
- PaddleOCR
See ATTRIBUTION.md for complete licensing information.
Built for maximum throughput - Process 100K+ images in minutes, visual search in milliseconds.