Vectorizer API

A high-performance FastAPI service for converting text (especially log messages) into vector embeddings using the Sentence Transformers library.

Overview

This service provides a REST API endpoint that transforms text into dense vector representations using the all-MiniLM-L6-v2 model. It's optimized for log processing and analysis workflows where semantic similarity and clustering of text data is needed.

Purpose & Use Case

This vectorizer is designed to support log anomaly detection and similarity analysis workflows by:

🔍 Anomaly Detection Pipeline:

Vectorize logs → Convert log messages into 384-dimensional embeddings
Store in Elasticsearch → Index vectors using Elasticsearch's dense vector fields
Detect anomalies → Use vector similarity to identify unusual log patterns
Find similar logs → Query for logs with similar semantic meaning

🎯 Key Benefits:

Semantic Understanding: Goes beyond keyword matching to understand log meaning
Pattern Recognition: Identifies anomalous behavior even with different wording
Similarity Search: Find related issues across different log formats
Scalable Processing: High-throughput vectorization for large log volumes

📊 Integration with Elasticsearch:

PUT /logs
{
  "mappings": {
    "properties": {
      "message": {"type": "text"},
      "vector": {"type": "dense_vector", "dims": 384},
      "timestamp": {"type": "date"}
    }
  }
}

Features

⚡ High Performance: ~82 RPS sustained throughput with ~122ms response times
🐳 Docker Ready: Containerized with Docker Compose
🔄 Auto-restart: Production-ready with automatic container restart and health checks
📊 Consistent Latency: 95% of requests complete within 132ms
🛡️ Reliable: Zero failed requests in extensive load testing
🏥 Health Monitoring: Built-in /healthz endpoint for load balancers

Quick Start

Using Docker Compose (Recommended)

Clone and start the service:

git clone https://github.com/scott-hiemstra/vectorizer.git
cd vectorizer
docker compose up -d

Test the API:

curl -X POST "http://localhost:8000/vectorize" \
     -H "Content-Type: application/json" \
     -d '{"text": "Database connection timeout - retrying"}'

GPU Deployment

For GPU-accelerated inference (2-3x faster, see Performance):

Prerequisites:
- NVIDIA GPU with CUDA support
- NVIDIA Container Toolkit installed
- Verify with: nvidia-smi and docker run --rm --gpus all nvidia/cuda:12.6.3-base-ubuntu24.04 nvidia-smi
Start the GPU service:

docker compose -f docker-compose.gpu.yml up -d

Verify GPU is active:

curl -s http://localhost:8000/healthz | python3 -m json.tool
# Check container logs for: "Using device: cuda"
docker compose -f docker-compose.gpu.yml logs vectorizer-gpu | grep device

Note: The default Dockerfile.gpu uses PyTorch with CUDA 12.4 (cu124), which supports RTX 20xx and newer GPUs. For older GPUs (GTX 10xx / Pascal), see GPU Compatibility to switch to cu118.

Manual Setup

Install dependencies:

pip install -r requirements.txt

Run the service:

uvicorn main:app --host 0.0.0.0 --port 8000

API Reference

POST `/vectorize`

Converts input text into a vector embedding. Text longer than 10,000 characters is silently truncated (configurable via MAX_TEXT_LENGTH).

Request Body:

{
  "text": "Your text to vectorize here"
}

Response:

{
  "vector": [0.1234, -0.5678, 0.9012, ...]
}

POST `/vectorize/batch`

Encode multiple texts in a single request. Significantly faster than individual calls, especially on GPU. Maximum 64 texts per request (configurable via MAX_BATCH_SIZE).

Request Body:

{
  "texts": [
    "Database connection timeout after 30 seconds",
    "User login successful",
    "Disk usage exceeded 90% threshold"
  ]
}

Response:

{
  "vectors": [
    [0.1234, -0.5678, ...],
    [0.9012, 0.3456, ...],
    [0.7890, -0.1234, ...]
  ]
}

GET `/healthz`

Health check endpoint for monitoring and load balancers.

Response (Healthy):

{
  "status": "healthy",
  "model": "all-MiniLM-L6-v2"
}

Response (Unhealthy):

{
  "status": "unhealthy",
  "reason": "Model not loaded"
}

Returns HTTP 503 when unhealthy

GET `/metrics`

Prometheus metrics endpoint for monitoring and observability.

Response: Prometheus-formatted metrics including:

vectorizer_requests_total - Total request count by method, endpoint, and status
vectorizer_request_duration_seconds - Request latency histogram
vectorizer_encode_duration_seconds - Time spent in model encoding
vectorizer_active_requests - Current number of active requests
vectorizer_model_loaded - Model status (1=loaded, 0=not loaded)
vectorizer_text_length_chars - Input text length distribution
vectorizer_texts_truncated_total - Texts truncated to max length

Example Usage:

# Error log
curl -X POST "http://localhost:8000/vectorize" \
     -H "Content-Type: application/json" \
     -d '{"text": "2024-10-10 14:32:15 ERROR Database connection timeout"}'

# Info log  
curl -X POST "http://localhost:8000/vectorize" \
     -H "Content-Type: application/json" \
     -d '{"text": "2024-10-10 14:33:01 INFO User login successful"}'

# Health check
curl http://localhost:8000/healthz

# Prometheus metrics
curl http://localhost:8000/metrics

Performance

Config	RPS (load)	Avg Latency	p95 Latency
CPU (1 worker)	67	49ms	90ms
CPU (4 workers)	72	36ms	75ms
GPU (GTX 1050 Ti)	76-120	29ms	59ms

GPU provides ~1.7x lower latency and ~1.7x more throughput at high concurrency. Full benchmark results, methodology, and load testing instructions are in PERFORMANCE.md.

Configuration

Environment Variables

Variable	Default	Description
`PORT`	8000	Service port
`HOST`	0.0.0.0	Service host
`MODEL_NAME`	all-MiniLM-L6-v2	Sentence Transformers model to load
`MAX_TEXT_LENGTH`	10000	Truncate input text beyond this many characters
`MAX_BATCH_SIZE`	64	Maximum number of texts per batch request

Docker Configuration

The service is configured to:

Run on port 8000 (mapped from host)
Auto-restart on failure
Use minimal resource footprint

To modify resource limits, uncomment the deploy section in docker-compose.yml:

deploy:
  resources:
    limits:
      cpus: '2'
      memory: 8G

Model Information

Model: all-MiniLM-L6-v2
Embedding Dimension: 384
Max Sequence Length: 256 tokens
Performance: Optimized for speed while maintaining quality

Use Cases

Perfect for:

� Anomaly Detection in Logs

Elasticsearch Integration: Store vectors in dense_vector fields for fast similarity search
Pattern Detection: Identify unusual log patterns that deviate from normal behavior
Incident Response: Quickly find logs similar to known issues
Baseline Establishment: Create vector baselines for normal system behavior

🔍 Log Analysis & Operations

Semantic Clustering: Group logs by meaning, not just keywords
Cross-System Correlation: Find related issues across different applications
Error Classification: Automatically categorize errors by semantic similarity
Troubleshooting: Search for logs with similar context or meaning

📊 Advanced Analytics

Feature Extraction: Use vectors as input for downstream ML models
Time-Series Analysis: Track semantic drift in log patterns over time
Root Cause Analysis: Identify common patterns in incident logs

Development

Local Development

# Install dependencies
pip install -r requirements.txt

# Run with auto-reload
uvicorn main:app --reload --host 0.0.0.0 --port 8000

Testing

# Basic functionality test
python -c "
import requests
response = requests.post('http://localhost:8000/vectorize', 
                        json={'text': 'test message'})
print(f'Status: {response.status_code}')
print(f'Vector length: {len(response.json()[\"vector\"])}')
"

Production Considerations

Scaling

For higher throughput:

GPU Acceleration (recommended — see GPU Deployment):
- 2-3x lower latency, 2x+ throughput at high concurrency
- Even modest GPUs (GTX 1050 Ti) provide meaningful gains
Load Balancer: Use nginx or similar for distributing requests across multiple instances
Multiple Workers (CPU only — not recommended for most cases):

uvicorn main:app --workers 4 --host 0.0.0.0 --port 8000

⚠️ Each worker loads its own copy of the model (~90MB each). At low-to-moderate concurrency, multi-worker adds overhead without throughput gains. Prefer GPU or horizontal scaling with separate containers behind a load balancer.

GPU Compatibility

The GPU Docker image uses PyTorch with a specific CUDA toolkit version. The default (cu124) supports RTX 20xx and newer. For older GPUs, switch to cu118.

GPU Family	Architecture	Compute Capability	CUDA	PyTorch Index URL
RTX 40xx (4070, 4090, etc.)	Ada Lovelace	sm_89	cu124 (default)	`https://download.pytorch.org/whl/cu124`
RTX 30xx (3060, 3090, etc.)	Ampere	sm_86	cu124 (default)	`https://download.pytorch.org/whl/cu124`
RTX 20xx (2070, 2080, etc.)	Turing	sm_75	cu124 (default)	`https://download.pytorch.org/whl/cu124`
A100, H100 (data center)	Ampere/Hopper	sm_80 / sm_90	cu124 (default)	`https://download.pytorch.org/whl/cu124`
GTX 10xx (1050 Ti, 1080, etc.)	Pascal	sm_61	cu118	`https://download.pytorch.org/whl/cu118`

For older GPUs (Pascal / GTX 10xx), edit Dockerfile.gpu:

# Change base image:
FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04

# Change PyTorch index:
RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu118

How to check your GPU's compute capability:

nvidia-smi --query-gpu=name,compute_cap --format=csv

Monitoring

Built-in observability features:

✅ Health check endpoint (/healthz) - Ready for load balancers
✅ Prometheus metrics (/metrics) - Comprehensive performance monitoring
✅ Request logging - Structured logging with uvicorn

Key Metrics to Monitor:

vectorizer_requests_total - Request volume and error rates
vectorizer_request_duration_seconds - API latency percentiles
vectorizer_encode_duration_seconds - Model performance
vectorizer_active_requests - Concurrent load
vectorizer_text_length_chars - Input size distribution

Sample Prometheus Query:

# 95th percentile latency
histogram_quantile(0.95, rate(vectorizer_request_duration_seconds_bucket[5m]))

# Error rate
rate(vectorizer_requests_total{status!="200"}[5m]) / rate(vectorizer_requests_total[5m])

# Requests per second
rate(vectorizer_requests_total[5m])

Grafana Dashboard: Monitor throughput, latency, error rates, and model performance in real-time.

Error tracking

Troubleshooting

Common Issues

Model loading fails:

Ensure sufficient memory (>2GB recommended)
Check internet connectivity for initial model download

Performance degradation:

Monitor CPU usage
Consider reducing concurrency
Check for memory leaks in long-running deployments

Container startup issues:

Verify port 8000 is available
Check Docker daemon is running
Review container logs: docker compose logs vectorizer

License

This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
k6		k6
.dockerignore		.dockerignore
Dockerfile		Dockerfile
Dockerfile.gpu		Dockerfile.gpu
LICENSE		LICENSE
PERFORMANCE.md		PERFORMANCE.md
README.md		README.md
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.yml		docker-compose.yml
main.py		main.py
payload.json		payload.json
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Vectorizer API

Overview

Purpose & Use Case

Features

Quick Start

Using Docker Compose (Recommended)

GPU Deployment

Manual Setup

API Reference

POST /vectorize

POST /vectorize/batch

GET /healthz

GET /metrics

Performance

Configuration

Environment Variables

Docker Configuration

Model Information

Use Cases

� Anomaly Detection in Logs

🔍 Log Analysis & Operations

📊 Advanced Analytics

Development

Local Development

Testing

Production Considerations

Scaling

GPU Compatibility

Monitoring

Troubleshooting

Common Issues

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

POST `/vectorize`

POST `/vectorize/batch`

GET `/healthz`

GET `/metrics`

Packages