Embedding Server

A high-performance, self-hosted, model-agnostic embedding service designed for LLM applications, RAG pipelines, and code intelligence tools. It serves as a drop-in replacement for OpenAI's embedding API while offering advanced features like native batching, smart chunking, and hardware acceleration.

Key Features

🚀 High Performance: Native batch processing and async architecture for maximum throughput.
🧠 Multi-Model Support: Run any Hugging Face model (e.g., MiniLM, BGE, E5) with dynamic loading.
🤝 OpenAI Compatible: Seamlessly works with LangChain, LlamaIndex, and AutoGPT using standard OpenAI clients.
🔪 Smart Chunking: Built-in text splitting (Token/Character) with overlap handling to fit model context windows.
⚡ Caching: Integrated Redis & In-Memory caching to eliminate redundant computations.
🛡️ Production Ready: API Key/JWT authentication, rate limiting, and Prometheus metrics.
⚙️ Hardware Acceleration: Auto-detects NVIDIA CUDA or Apple MPS (Metal) for GPU inference.
🔌 Dual Protocol: Exposes both HTTP (FastAPI) and gRPC interfaces.

Architecture

graph LR
    Client[Client App] -->|HTTP/gRPC| Auth[Auth Layer]
    Auth --> Cache[Cache Layer]
    Cache -->|Miss| Batcher[Batch Manager]
    Batcher --> Model[Model Manager]
    Model -->|Inference| GPU[GPU/CPU]
    GPU -->|Vectors| Cache
    Cache -->|Vectors| Client

Installation

Prerequisites

Python 3.11 (Recommended)
Docker & Docker Compose (Optional)

Local Setup

Clone the repository:

git clone https://github.com/abdullah85398/embedding-server.git
cd embedding-server

Install dependencies:
```
pip install -r requirements.txt
```
Configure environment: Copy the example environment file:
```
cp .env.example .env
```
Edit .env to set your API Key or enable Redis if needed.

Docker Setup

Using Docker Compose

Run the server with a single command:

docker-compose up --build -d

The server will be available at http://localhost:8000.

Using Docker (Pre-built Image)

Pull and run the latest pre-built image from GitHub Container Registry:

docker run -p 8000:8000 ghcr.io/abdullah85398/embedding-server:latest

The server will be available at http://localhost:8000.

Usage Guide

Running the Server

Start the server locally:

python main.py

HTTP API: http://localhost:8000
gRPC API: localhost:50051
Swagger UI: http://localhost:8000/docs

Configuration

Environment Variables (`.env`)

Variable	Default	Description
`AUTH_MODE`	`KEY`	Auth mode: `NONE`, `KEY`, or `JWT`.
`API_KEY`	`changeme`	Master API Key for requests.
`JWT_SECRET`	`secret`	Secret for JWT signing (if mode is JWT).
`ENABLE_CACHE`	`True`	Enable result caching.
`REDIS_URL`	-	Redis connection string (uses memory if empty).
`MAX_INFLIGHT_REQUESTS`	`100`	Concurrency limit (semaphore).

Model Configuration (`models.yaml`)

Define available models in models.yaml. The key is the alias used in API calls.

models:
  mini:
    name: all-MiniLM-L6-v2
    preload: true
  bge:
    name: BAAI/bge-base-en-v1.5
    preload: true

API Examples

1. Standard Embedding

Generate a vector for a text string.

curl -X POST http://localhost:8000/embed \
  -H "X-API-Key: changeme" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mini",
    "input": "Hello world"
  }'

2. Native Batching

Process multiple texts in parallel (optimized for GPU).

curl -X POST http://localhost:8000/embed \
  -H "X-API-Key: changeme" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mini",
    "input": ["Document 1", "Document 2", "Document 3"]
  }'

3. Smart Chunking

Split long text into chunks (e.g., 256 tokens) and embed each.

curl -X POST http://localhost:8000/embed/chunk \
  -H "X-API-Key: changeme" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mini",
    "input": "Very long text content...",
    "method": "token",
    "size": 256,
    "overlap": 20
  }'

4. OpenAI Compatibility

Works with standard OpenAI libraries.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="changeme"
)

response = client.embeddings.create(
    model="mini",
    input="Hello from OpenAI client!"
)
print(response.data[0].embedding)

Documentation

For a comprehensive guide on configuration, advanced features, and troubleshooting, please refer to the User Guide.

Example Scripts

The repository includes several ready-to-run examples to help you get started:

example_client.py: A complete demonstration of the HTTP API, including basic embedding, batching, smart chunking, and admin operations.
example_grpc_client.py: Shows how to interact with the high-performance gRPC interface (Unary, Streaming, and Chunking).
example_openai.py: Demonstrates how to use the standard openai Python library to communicate with this server.

Development

Running Tests

The project uses pytest for testing. This includes unit tests and documentation integrity checks.

# Run all tests
pytest

# Verify documentation examples match implementation
pytest tests/test_docs_integrity.py

Linting

We use ruff for code quality.

pip install ruff
ruff check .

Contribution

Fork the repository.
Create a feature branch (git checkout -b feature/amazing-feature).
Commit your changes (git commit -m 'Add amazing feature').
Run tests to ensure no regressions.
Push to the branch (git push origin feature/amazing-feature).
Open a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
app		app
protos		protos
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
USER_GUIDE.md		USER_GUIDE.md
benchmark.py		benchmark.py
docker-compose.yml		docker-compose.yml
download_models.py		download_models.py
example_client.py		example_client.py
example_grpc_client.py		example_grpc_client.py
example_openai.py		example_openai.py
main.py		main.py
models.yaml		models.yaml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embedding Server

Key Features

Architecture

Installation

Prerequisites

Local Setup

Docker Setup

Using Docker Compose

Using Docker (Pre-built Image)

Usage Guide

Running the Server

Configuration

Environment Variables (`.env`)

Model Configuration (`models.yaml`)

API Examples

1. Standard Embedding

2. Native Batching

3. Smart Chunking

4. OpenAI Compatibility

Documentation

Example Scripts

Development

Running Tests

Linting

Contribution

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Embedding Server

Key Features

Architecture

Installation

Prerequisites

Local Setup

Docker Setup

Using Docker Compose

Using Docker (Pre-built Image)

Usage Guide

Running the Server

Configuration

Environment Variables (.env)

Model Configuration (models.yaml)

API Examples

1. Standard Embedding

2. Native Batching

3. Smart Chunking

4. OpenAI Compatibility

Documentation

Example Scripts

Development

Running Tests

Linting

Contribution

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Environment Variables (`.env`)

Model Configuration (`models.yaml`)

Packages