A high-performance FastAPI service for converting text (especially log messages) into vector embeddings using the Sentence Transformers library.
This service provides a REST API endpoint that transforms text into dense vector representations using the all-MiniLM-L6-v2 model. It's optimized for log processing and analysis workflows where semantic similarity and clustering of text data is needed.
This vectorizer is designed to support log anomaly detection and similarity analysis workflows by:
π Anomaly Detection Pipeline:
- Vectorize logs β Convert log messages into 384-dimensional embeddings
- Store in Elasticsearch β Index vectors using Elasticsearch's dense vector fields
- Detect anomalies β Use vector similarity to identify unusual log patterns
- Find similar logs β Query for logs with similar semantic meaning
π― Key Benefits:
- Semantic Understanding: Goes beyond keyword matching to understand log meaning
- Pattern Recognition: Identifies anomalous behavior even with different wording
- Similarity Search: Find related issues across different log formats
- Scalable Processing: High-throughput vectorization for large log volumes
π Integration with Elasticsearch:
PUT /logs
{
"mappings": {
"properties": {
"message": {"type": "text"},
"vector": {"type": "dense_vector", "dims": 384},
"timestamp": {"type": "date"}
}
}
}- β‘ High Performance: ~82 RPS sustained throughput with ~122ms response times
- π³ Docker Ready: Containerized with Docker Compose
- π Auto-restart: Production-ready with automatic container restart and health checks
- π Consistent Latency: 95% of requests complete within 132ms
- π‘οΈ Reliable: Zero failed requests in extensive load testing
- π₯ Health Monitoring: Built-in
/healthzendpoint for load balancers
- Clone and start the service:
git clone https://github.com/scott-hiemstra/vectorizer.git
cd vectorizer
docker compose up -d- Test the API:
curl -X POST "http://localhost:8000/vectorize" \
-H "Content-Type: application/json" \
-d '{"text": "Database connection timeout - retrying"}'For GPU-accelerated inference (2-3x faster, see Performance):
-
Prerequisites:
- NVIDIA GPU with CUDA support
- NVIDIA Container Toolkit installed
- Verify with:
nvidia-smianddocker run --rm --gpus all nvidia/cuda:12.6.3-base-ubuntu24.04 nvidia-smi
-
Start the GPU service:
docker compose -f docker-compose.gpu.yml up -d- Verify GPU is active:
curl -s http://localhost:8000/healthz | python3 -m json.tool
# Check container logs for: "Using device: cuda"
docker compose -f docker-compose.gpu.yml logs vectorizer-gpu | grep deviceNote: The default
Dockerfile.gpuuses PyTorch with CUDA 12.4 (cu124), which supports RTX 20xx and newer GPUs. For older GPUs (GTX 10xx / Pascal), see GPU Compatibility to switch to cu118.
- Install dependencies:
pip install -r requirements.txt- Run the service:
uvicorn main:app --host 0.0.0.0 --port 8000Converts input text into a vector embedding. Text longer than 10,000 characters is silently truncated (configurable via MAX_TEXT_LENGTH).
Request Body:
{
"text": "Your text to vectorize here"
}Response:
{
"vector": [0.1234, -0.5678, 0.9012, ...]
}Encode multiple texts in a single request. Significantly faster than individual calls, especially on GPU. Maximum 64 texts per request (configurable via MAX_BATCH_SIZE).
Request Body:
{
"texts": [
"Database connection timeout after 30 seconds",
"User login successful",
"Disk usage exceeded 90% threshold"
]
}Response:
{
"vectors": [
[0.1234, -0.5678, ...],
[0.9012, 0.3456, ...],
[0.7890, -0.1234, ...]
]
}Health check endpoint for monitoring and load balancers.
Response (Healthy):
{
"status": "healthy",
"model": "all-MiniLM-L6-v2"
}Response (Unhealthy):
{
"status": "unhealthy",
"reason": "Model not loaded"
}Returns HTTP 503 when unhealthy
Prometheus metrics endpoint for monitoring and observability.
Response: Prometheus-formatted metrics including:
vectorizer_requests_total- Total request count by method, endpoint, and statusvectorizer_request_duration_seconds- Request latency histogramvectorizer_encode_duration_seconds- Time spent in model encodingvectorizer_active_requests- Current number of active requestsvectorizer_model_loaded- Model status (1=loaded, 0=not loaded)vectorizer_text_length_chars- Input text length distributionvectorizer_texts_truncated_total- Texts truncated to max length
Example Usage:
# Error log
curl -X POST "http://localhost:8000/vectorize" \
-H "Content-Type: application/json" \
-d '{"text": "2024-10-10 14:32:15 ERROR Database connection timeout"}'
# Info log
curl -X POST "http://localhost:8000/vectorize" \
-H "Content-Type: application/json" \
-d '{"text": "2024-10-10 14:33:01 INFO User login successful"}'
# Health check
curl http://localhost:8000/healthz
# Prometheus metrics
curl http://localhost:8000/metrics| Config | RPS (load) | Avg Latency | p95 Latency |
|---|---|---|---|
| CPU (1 worker) | 67 | 49ms | 90ms |
| CPU (4 workers) | 72 | 36ms | 75ms |
| GPU (GTX 1050 Ti) | 76-120 | 29ms | 59ms |
GPU provides ~1.7x lower latency and ~1.7x more throughput at high concurrency. Full benchmark results, methodology, and load testing instructions are in PERFORMANCE.md.
| Variable | Default | Description |
|---|---|---|
PORT |
8000 | Service port |
HOST |
0.0.0.0 | Service host |
MODEL_NAME |
all-MiniLM-L6-v2 | Sentence Transformers model to load |
MAX_TEXT_LENGTH |
10000 | Truncate input text beyond this many characters |
MAX_BATCH_SIZE |
64 | Maximum number of texts per batch request |
The service is configured to:
- Run on port 8000 (mapped from host)
- Auto-restart on failure
- Use minimal resource footprint
To modify resource limits, uncomment the deploy section in docker-compose.yml:
deploy:
resources:
limits:
cpus: '2'
memory: 8G- Model:
all-MiniLM-L6-v2 - Embedding Dimension: 384
- Max Sequence Length: 256 tokens
- Performance: Optimized for speed while maintaining quality
Perfect for:
- Elasticsearch Integration: Store vectors in dense_vector fields for fast similarity search
- Pattern Detection: Identify unusual log patterns that deviate from normal behavior
- Incident Response: Quickly find logs similar to known issues
- Baseline Establishment: Create vector baselines for normal system behavior
- Semantic Clustering: Group logs by meaning, not just keywords
- Cross-System Correlation: Find related issues across different applications
- Error Classification: Automatically categorize errors by semantic similarity
- Troubleshooting: Search for logs with similar context or meaning
- Feature Extraction: Use vectors as input for downstream ML models
- Time-Series Analysis: Track semantic drift in log patterns over time
- Root Cause Analysis: Identify common patterns in incident logs
# Install dependencies
pip install -r requirements.txt
# Run with auto-reload
uvicorn main:app --reload --host 0.0.0.0 --port 8000# Basic functionality test
python -c "
import requests
response = requests.post('http://localhost:8000/vectorize',
json={'text': 'test message'})
print(f'Status: {response.status_code}')
print(f'Vector length: {len(response.json()[\"vector\"])}')
"For higher throughput:
-
GPU Acceleration (recommended β see GPU Deployment):
- 2-3x lower latency, 2x+ throughput at high concurrency
- Even modest GPUs (GTX 1050 Ti) provide meaningful gains
-
Load Balancer: Use nginx or similar for distributing requests across multiple instances
-
Multiple Workers (CPU only β not recommended for most cases):
uvicorn main:app --workers 4 --host 0.0.0.0 --port 8000
β οΈ Each worker loads its own copy of the model (~90MB each). At low-to-moderate concurrency, multi-worker adds overhead without throughput gains. Prefer GPU or horizontal scaling with separate containers behind a load balancer.
The GPU Docker image uses PyTorch with a specific CUDA toolkit version. The default
(cu124) supports RTX 20xx and newer. For older GPUs, switch to cu118.
| GPU Family | Architecture | Compute Capability | CUDA | PyTorch Index URL |
|---|---|---|---|---|
| RTX 40xx (4070, 4090, etc.) | Ada Lovelace | sm_89 | cu124 (default) | https://download.pytorch.org/whl/cu124 |
| RTX 30xx (3060, 3090, etc.) | Ampere | sm_86 | cu124 (default) | https://download.pytorch.org/whl/cu124 |
| RTX 20xx (2070, 2080, etc.) | Turing | sm_75 | cu124 (default) | https://download.pytorch.org/whl/cu124 |
| A100, H100 (data center) | Ampere/Hopper | sm_80 / sm_90 | cu124 (default) | https://download.pytorch.org/whl/cu124 |
| GTX 10xx (1050 Ti, 1080, etc.) | Pascal | sm_61 | cu118 | https://download.pytorch.org/whl/cu118 |
For older GPUs (Pascal / GTX 10xx), edit Dockerfile.gpu:
# Change base image:
FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04
# Change PyTorch index:
RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu118How to check your GPU's compute capability:
nvidia-smi --query-gpu=name,compute_cap --format=csvBuilt-in observability features:
- β
Health check endpoint (
/healthz) - Ready for load balancers - β
Prometheus metrics (
/metrics) - Comprehensive performance monitoring - β Request logging - Structured logging with uvicorn
Key Metrics to Monitor:
vectorizer_requests_total- Request volume and error ratesvectorizer_request_duration_seconds- API latency percentilesvectorizer_encode_duration_seconds- Model performancevectorizer_active_requests- Concurrent loadvectorizer_text_length_chars- Input size distribution
Sample Prometheus Query:
# 95th percentile latency
histogram_quantile(0.95, rate(vectorizer_request_duration_seconds_bucket[5m]))
# Error rate
rate(vectorizer_requests_total{status!="200"}[5m]) / rate(vectorizer_requests_total[5m])
# Requests per second
rate(vectorizer_requests_total[5m])
Grafana Dashboard: Monitor throughput, latency, error rates, and model performance in real-time.
- Error tracking
Model loading fails:
- Ensure sufficient memory (>2GB recommended)
- Check internet connectivity for initial model download
Performance degradation:
- Monitor CPU usage
- Consider reducing concurrency
- Check for memory leaks in long-running deployments
Container startup issues:
- Verify port 8000 is available
- Check Docker daemon is running
- Review container logs:
docker compose logs vectorizer
This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request