Omnispan is a tiny Token Factory perf lab.
Current state:
engine/: Rust direct-path engineworker/: Python model worker with pluggable backendsproto/: shared gRPC contractbench/: benchmark scripts and artifactsdocs/: design and planning notes
The current implementation supports direct, queued, and micro-batch modes:
- client -> Rust engine
- engine -> Python worker over gRPC
- worker -> MLX or vLLM model runtime
Queued mode adds explicit in-engine queue ownership but still executes one request at a time. Micro-batch mode adds a short batching window and groups pending requests before dispatching them to the worker batch path.
Important:
- Treat
directmode as debug-only. - Under concurrent direct load, the current Python MLX worker has crashed in native code.
- Use
queuedmode for any meaningful load test or benchmark until worker-side parallel safety is proven.
- Rust toolchain
python- Python environment with backend-specific worker dependencies installed
grpcurlfor manual testing
Install local MLX worker dependencies:
python -m pip install -r worker/requirements.txtInstall vLLM worker dependencies on your Linux/CUDA box instead:
python -m pip install -r worker/requirements-vllm.txtThe worker now supports:
-
WORKER_BACKEND=mlx- default backend
- intended for Apple Silicon local development
- default model:
mlx-community/Qwen2.5-7B-Instruct-4bit
-
WORKER_BACKEND=vllm- intended for Linux + NVIDIA GPU environments such as Runpod
- default model:
Qwen/Qwen2.5-7B-Instruct - useful env vars:
MODEL_IDVLLM_TENSOR_PARALLEL_SIZEVLLM_GPU_MEMORY_UTILIZATIONVLLM_MAX_MODEL_LENVLLM_TRUST_REMOTE_CODEVLLM_ENFORCE_EAGERVLLM_ENABLE_PREFIX_CACHINGWORKER_DEBUG_BATCH_LOGGINGVLLM_DTYPEVLLM_QUANTIZATION
Start the local MLX worker:
python worker/worker.pyStart a vLLM worker on Runpod/Linux:
WORKER_BACKEND=vllm \
MODEL_ID=Qwen/Qwen2.5-7B-Instruct \
VLLM_GPU_MEMORY_UTILIZATION=0.9 \
python worker/worker.pyExample for an AWQ model on a single NVIDIA GPU:
WORKER_BACKEND=vllm \
MODEL_ID=Qwen/Qwen3-32B-AWQ \
VLLM_QUANTIZATION=AWQ \
VLLM_GPU_MEMORY_UTILIZATION=0.85 \
VLLM_MAX_MODEL_LEN=4096 \
VLLM_ENFORCE_EAGER=1 \
VLLM_ENABLE_PREFIX_CACHING=1 \
python worker/worker.pyStart the engine in a second terminal:
cd engine
ENGINE_MODE=direct WORKER_ENDPOINT=http://127.0.0.1:50071 cargo run --bin omnispan-engineUse direct mode only for single-request debugging.
Run queued mode instead:
cd engine
ENGINE_MODE=queued WORKER_ENDPOINT=http://127.0.0.1:50071 cargo run --bin omnispan-engineUse queued mode for benchmarks and concurrent tests.
Run micro-batch mode:
cd engine
ENGINE_MODE=micro_batch WORKER_ENDPOINT=http://127.0.0.1:50071 BATCH_WINDOW_MS=20 MAX_BATCH_SIZE=4 cargo run --bin omnispan-engineSubmit a request with grpcurl from the repo root:
grpcurl -plaintext -import-path ./proto -proto omnispan.proto \
-d '{"tenant_id":"shared-basic","prompt":"Explain transformer attention in 3 sentences.","max_tokens":150}' \
127.0.0.1:50061 omnispan.Engine/SubmitGenerateFetch engine stats:
grpcurl -plaintext -import-path ./proto -proto omnispan.proto \
-d '{}' \
127.0.0.1:50061 omnispan.Engine/GetEngineStatsIf proto/omnispan.proto changes:
python -m grpc_tools.protoc \
-I proto \
--python_out=worker/generated \
--grpc_python_out=worker/generated \
proto/omnispan.proto- The worker must run in a Python environment that has the dependencies for the selected backend installed.
- The engine auto-generates a request ID if the client omits one.
- Concurrent direct mode has triggered Python worker segmentation faults in the current MLX runtime path.
BATCH_WINDOW_MScontrols how long the engine waits to gather additional requests inmicro_batchmode.MAX_BATCH_SIZEcontrols how many pending requests are grouped into one worker batch.- The worker gRPC contract is unchanged across backends. The backend switch is entirely inside
worker/. - Worker startup now fails loudly if
WORKER_HOST:WORKER_PORTis already occupied, which helps catch stale worker processes instead of silently hitting the wrong instance. - Benchmark artifacts from the earlier FastAPI prototype are in
bench/.