Skip to content

abdullah85398/embedding-server

Repository files navigation

Embedding Server

Build Status Python License

A high-performance, self-hosted, model-agnostic embedding service designed for LLM applications, RAG pipelines, and code intelligence tools. It serves as a drop-in replacement for OpenAI's embedding API while offering advanced features like native batching, smart chunking, and hardware acceleration.

Key Features

  • πŸš€ High Performance: Native batch processing and async architecture for maximum throughput.
  • 🧠 Multi-Model Support: Run any Hugging Face model (e.g., MiniLM, BGE, E5) with dynamic loading.
  • 🀝 OpenAI Compatible: Seamlessly works with LangChain, LlamaIndex, and AutoGPT using standard OpenAI clients.
  • πŸ”ͺ Smart Chunking: Built-in text splitting (Token/Character) with overlap handling to fit model context windows.
  • ⚑ Caching: Integrated Redis & In-Memory caching to eliminate redundant computations.
  • πŸ›‘οΈ Production Ready: API Key/JWT authentication, rate limiting, and Prometheus metrics.
  • βš™οΈ Hardware Acceleration: Auto-detects NVIDIA CUDA or Apple MPS (Metal) for GPU inference.
  • πŸ”Œ Dual Protocol: Exposes both HTTP (FastAPI) and gRPC interfaces.

Architecture

graph LR
    Client[Client App] -->|HTTP/gRPC| Auth[Auth Layer]
    Auth --> Cache[Cache Layer]
    Cache -->|Miss| Batcher[Batch Manager]
    Batcher --> Model[Model Manager]
    Model -->|Inference| GPU[GPU/CPU]
    GPU -->|Vectors| Cache
    Cache -->|Vectors| Client
Loading

Installation

Prerequisites

  • Python 3.11 (Recommended)
  • Docker & Docker Compose (Optional)

Local Setup

  1. Clone the repository:

    git clone https://github.com/abdullah85398/embedding-server.git
    cd embedding-server
  2. Install dependencies:

    pip install -r requirements.txt
  3. Configure environment: Copy the example environment file:

    cp .env.example .env

    Edit .env to set your API Key or enable Redis if needed.

Docker Setup

Using Docker Compose

Run the server with a single command:

docker-compose up --build -d

The server will be available at http://localhost:8000.

Using Docker (Pre-built Image)

Pull and run the latest pre-built image from GitHub Container Registry:

docker run -p 8000:8000 ghcr.io/abdullah85398/embedding-server:latest

The server will be available at http://localhost:8000.

Usage Guide

Running the Server

Start the server locally:

python main.py
  • HTTP API: http://localhost:8000
  • gRPC API: localhost:50051
  • Swagger UI: http://localhost:8000/docs

Configuration

Environment Variables (.env)

Variable Default Description
AUTH_MODE KEY Auth mode: NONE, KEY, or JWT.
API_KEY changeme Master API Key for requests.
JWT_SECRET secret Secret for JWT signing (if mode is JWT).
ENABLE_CACHE True Enable result caching.
REDIS_URL - Redis connection string (uses memory if empty).
MAX_INFLIGHT_REQUESTS 100 Concurrency limit (semaphore).

Model Configuration (models.yaml)

Define available models in models.yaml. The key is the alias used in API calls.

models:
  mini:
    name: all-MiniLM-L6-v2
    preload: true
  bge:
    name: BAAI/bge-base-en-v1.5
    preload: true

API Examples

1. Standard Embedding

Generate a vector for a text string.

curl -X POST http://localhost:8000/embed \
  -H "X-API-Key: changeme" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mini",
    "input": "Hello world"
  }'

2. Native Batching

Process multiple texts in parallel (optimized for GPU).

curl -X POST http://localhost:8000/embed \
  -H "X-API-Key: changeme" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mini",
    "input": ["Document 1", "Document 2", "Document 3"]
  }'

3. Smart Chunking

Split long text into chunks (e.g., 256 tokens) and embed each.

curl -X POST http://localhost:8000/embed/chunk \
  -H "X-API-Key: changeme" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mini",
    "input": "Very long text content...",
    "method": "token",
    "size": 256,
    "overlap": 20
  }'

4. OpenAI Compatibility

Works with standard OpenAI libraries.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="changeme"
)

response = client.embeddings.create(
    model="mini",
    input="Hello from OpenAI client!"
)
print(response.data[0].embedding)

Documentation

For a comprehensive guide on configuration, advanced features, and troubleshooting, please refer to the User Guide.

Example Scripts

The repository includes several ready-to-run examples to help you get started:

  • example_client.py: A complete demonstration of the HTTP API, including basic embedding, batching, smart chunking, and admin operations.
  • example_grpc_client.py: Shows how to interact with the high-performance gRPC interface (Unary, Streaming, and Chunking).
  • example_openai.py: Demonstrates how to use the standard openai Python library to communicate with this server.

Development

Running Tests

The project uses pytest for testing. This includes unit tests and documentation integrity checks.

# Run all tests
pytest

# Verify documentation examples match implementation
pytest tests/test_docs_integrity.py

Linting

We use ruff for code quality.

pip install ruff
ruff check .

Contribution

  1. Fork the repository.
  2. Create a feature branch (git checkout -b feature/amazing-feature).
  3. Commit your changes (git commit -m 'Add amazing feature').
  4. Run tests to ensure no regressions.
  5. Push to the branch (git push origin feature/amazing-feature).
  6. Open a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A high-performance, self-hosted, model-agnostic embedding service designed for LLM applications, RAG pipelines, and code intelligence tools. It serves as a drop-in replacement for OpenAI's embedding API while offering advanced features like native batching, smart chunking, and hardware acceleration.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors