Omnispan

Omnispan is a tiny Token Factory perf lab.

Current state:

engine/: Rust direct-path engine
worker/: Python model worker with pluggable backends
proto/: shared gRPC contract
bench/: benchmark scripts and artifacts
docs/: design and planning notes

Modes

The current implementation supports direct, queued, and micro-batch modes:

client -> Rust engine
engine -> Python worker over gRPC
worker -> MLX or vLLM model runtime

Queued mode adds explicit in-engine queue ownership but still executes one request at a time. Micro-batch mode adds a short batching window and groups pending requests before dispatching them to the worker batch path.

Important:

Treat direct mode as debug-only.
Under concurrent direct load, the current Python MLX worker has crashed in native code.
Use queued mode for any meaningful load test or benchmark until worker-side parallel safety is proven.

Prerequisites

Rust toolchain
python
Python environment with backend-specific worker dependencies installed
grpcurl for manual testing

Install local MLX worker dependencies:

python -m pip install -r worker/requirements.txt

Install vLLM worker dependencies on your Linux/CUDA box instead:

python -m pip install -r worker/requirements-vllm.txt

Worker Backends

The worker now supports:

WORKER_BACKEND=mlx
- default backend
- intended for Apple Silicon local development
- default model: mlx-community/Qwen2.5-7B-Instruct-4bit
WORKER_BACKEND=vllm
- intended for Linux + NVIDIA GPU environments such as Runpod
- default model: Qwen/Qwen2.5-7B-Instruct
- useful env vars:
  - MODEL_ID
  - VLLM_TENSOR_PARALLEL_SIZE
  - VLLM_GPU_MEMORY_UTILIZATION
  - VLLM_MAX_MODEL_LEN
  - VLLM_TRUST_REMOTE_CODE
  - VLLM_ENFORCE_EAGER
  - VLLM_ENABLE_PREFIX_CACHING
  - WORKER_DEBUG_BATCH_LOGGING
  - VLLM_DTYPE
  - VLLM_QUANTIZATION

Run

Start the local MLX worker:

python worker/worker.py

Start a vLLM worker on Runpod/Linux:

WORKER_BACKEND=vllm \
MODEL_ID=Qwen/Qwen2.5-7B-Instruct \
VLLM_GPU_MEMORY_UTILIZATION=0.9 \
python worker/worker.py

Example for an AWQ model on a single NVIDIA GPU:

WORKER_BACKEND=vllm \
MODEL_ID=Qwen/Qwen3-32B-AWQ \
VLLM_QUANTIZATION=AWQ \
VLLM_GPU_MEMORY_UTILIZATION=0.85 \
VLLM_MAX_MODEL_LEN=4096 \
VLLM_ENFORCE_EAGER=1 \
VLLM_ENABLE_PREFIX_CACHING=1 \
python worker/worker.py

Start the engine in a second terminal:

cd engine
ENGINE_MODE=direct WORKER_ENDPOINT=http://127.0.0.1:50071 cargo run --bin omnispan-engine

Use direct mode only for single-request debugging.

Run queued mode instead:

cd engine
ENGINE_MODE=queued WORKER_ENDPOINT=http://127.0.0.1:50071 cargo run --bin omnispan-engine

Use queued mode for benchmarks and concurrent tests.

Run micro-batch mode:

cd engine
ENGINE_MODE=micro_batch WORKER_ENDPOINT=http://127.0.0.1:50071 BATCH_WINDOW_MS=20 MAX_BATCH_SIZE=4 cargo run --bin omnispan-engine

Submit a request with grpcurl from the repo root:

grpcurl -plaintext -import-path ./proto -proto omnispan.proto \
  -d '{"tenant_id":"shared-basic","prompt":"Explain transformer attention in 3 sentences.","max_tokens":150}' \
  127.0.0.1:50061 omnispan.Engine/SubmitGenerate

Fetch engine stats:

grpcurl -plaintext -import-path ./proto -proto omnispan.proto \
  -d '{}' \
  127.0.0.1:50061 omnispan.Engine/GetEngineStats

Regenerate Python gRPC Stubs

If proto/omnispan.proto changes:

python -m grpc_tools.protoc \
  -I proto \
  --python_out=worker/generated \
  --grpc_python_out=worker/generated \
  proto/omnispan.proto

Notes

The worker must run in a Python environment that has the dependencies for the selected backend installed.
The engine auto-generates a request ID if the client omits one.
Concurrent direct mode has triggered Python worker segmentation faults in the current MLX runtime path.
BATCH_WINDOW_MS controls how long the engine waits to gather additional requests in micro_batch mode.
MAX_BATCH_SIZE controls how many pending requests are grouped into one worker batch.
The worker gRPC contract is unchanged across backends. The backend switch is entirely inside worker/.
Worker startup now fails loudly if WORKER_HOST:WORKER_PORT is already occupied, which helps catch stale worker processes instead of silently hitting the wrong instance.
Benchmark artifacts from the earlier FastAPI prototype are in bench/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Omnispan

Modes

Prerequisites

Worker Backends

Run

Regenerate Python gRPC Stubs

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bench		bench
docs		docs
engine		engine
proto		proto
worker		worker
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Omnispan

Modes

Prerequisites

Worker Backends

Run

Regenerate Python gRPC Stubs

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages