A high-performance, self-hosted, model-agnostic embedding service designed for LLM applications, RAG pipelines, and code intelligence tools. It serves as a drop-in replacement for OpenAI's embedding API while offering advanced features like native batching, smart chunking, and hardware acceleration.
- π High Performance: Native batch processing and async architecture for maximum throughput.
- π§ Multi-Model Support: Run any Hugging Face model (e.g., MiniLM, BGE, E5) with dynamic loading.
- π€ OpenAI Compatible: Seamlessly works with LangChain, LlamaIndex, and AutoGPT using standard OpenAI clients.
- πͺ Smart Chunking: Built-in text splitting (Token/Character) with overlap handling to fit model context windows.
- β‘ Caching: Integrated Redis & In-Memory caching to eliminate redundant computations.
- π‘οΈ Production Ready: API Key/JWT authentication, rate limiting, and Prometheus metrics.
- βοΈ Hardware Acceleration: Auto-detects NVIDIA CUDA or Apple MPS (Metal) for GPU inference.
- π Dual Protocol: Exposes both HTTP (FastAPI) and gRPC interfaces.
graph LR
Client[Client App] -->|HTTP/gRPC| Auth[Auth Layer]
Auth --> Cache[Cache Layer]
Cache -->|Miss| Batcher[Batch Manager]
Batcher --> Model[Model Manager]
Model -->|Inference| GPU[GPU/CPU]
GPU -->|Vectors| Cache
Cache -->|Vectors| Client
- Python 3.11 (Recommended)
- Docker & Docker Compose (Optional)
-
Clone the repository:
git clone https://github.com/abdullah85398/embedding-server.git cd embedding-server -
Install dependencies:
pip install -r requirements.txt
-
Configure environment: Copy the example environment file:
cp .env.example .env
Edit
.envto set your API Key or enable Redis if needed.
Run the server with a single command:
docker-compose up --build -dThe server will be available at http://localhost:8000.
Pull and run the latest pre-built image from GitHub Container Registry:
docker run -p 8000:8000 ghcr.io/abdullah85398/embedding-server:latestThe server will be available at http://localhost:8000.
Start the server locally:
python main.py- HTTP API:
http://localhost:8000 - gRPC API:
localhost:50051 - Swagger UI:
http://localhost:8000/docs
| Variable | Default | Description |
|---|---|---|
AUTH_MODE |
KEY |
Auth mode: NONE, KEY, or JWT. |
API_KEY |
changeme |
Master API Key for requests. |
JWT_SECRET |
secret |
Secret for JWT signing (if mode is JWT). |
ENABLE_CACHE |
True |
Enable result caching. |
REDIS_URL |
- | Redis connection string (uses memory if empty). |
MAX_INFLIGHT_REQUESTS |
100 |
Concurrency limit (semaphore). |
Define available models in models.yaml. The key is the alias used in API calls.
models:
mini:
name: all-MiniLM-L6-v2
preload: true
bge:
name: BAAI/bge-base-en-v1.5
preload: trueGenerate a vector for a text string.
curl -X POST http://localhost:8000/embed \
-H "X-API-Key: changeme" \
-H "Content-Type: application/json" \
-d '{
"model": "mini",
"input": "Hello world"
}'Process multiple texts in parallel (optimized for GPU).
curl -X POST http://localhost:8000/embed \
-H "X-API-Key: changeme" \
-H "Content-Type: application/json" \
-d '{
"model": "mini",
"input": ["Document 1", "Document 2", "Document 3"]
}'Split long text into chunks (e.g., 256 tokens) and embed each.
curl -X POST http://localhost:8000/embed/chunk \
-H "X-API-Key: changeme" \
-H "Content-Type: application/json" \
-d '{
"model": "mini",
"input": "Very long text content...",
"method": "token",
"size": 256,
"overlap": 20
}'Works with standard OpenAI libraries.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="changeme"
)
response = client.embeddings.create(
model="mini",
input="Hello from OpenAI client!"
)
print(response.data[0].embedding)For a comprehensive guide on configuration, advanced features, and troubleshooting, please refer to the User Guide.
The repository includes several ready-to-run examples to help you get started:
- example_client.py: A complete demonstration of the HTTP API, including basic embedding, batching, smart chunking, and admin operations.
- example_grpc_client.py: Shows how to interact with the high-performance gRPC interface (Unary, Streaming, and Chunking).
- example_openai.py: Demonstrates how to use the standard
openaiPython library to communicate with this server.
The project uses pytest for testing. This includes unit tests and documentation integrity checks.
# Run all tests
pytest
# Verify documentation examples match implementation
pytest tests/test_docs_integrity.pyWe use ruff for code quality.
pip install ruff
ruff check .- Fork the repository.
- Create a feature branch (
git checkout -b feature/amazing-feature). - Commit your changes (
git commit -m 'Add amazing feature'). - Run tests to ensure no regressions.
- Push to the branch (
git push origin feature/amazing-feature). - Open a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.