This document shows the actual message flow between components for each
major scenario. Message types reference the structs in the
modelrelay-protocol crate (ServerToWorkerMessage / WorkerToServerMessage).
Worker Proxy Server
│ │
│──── WebSocket UPGRADE ────────►│ GET /v1/worker/connect
│◄─── 101 Switching Protocols ──│
│ │
│ WorkerToServerMessage::Register
│ { │
│ "type": "register", │
│ "worker_name": "gpu-box-1", │
│ "models": ["llama3-8b"], │
│ "max_concurrent": 4, │
│ "protocol_version": "1", │
│ "current_load": 0 │
│ } │
│──────────────────────────────►│ Proxy validates worker_secret
│ │ (passed as query param or header
│ │ during WebSocket upgrade)
│ │
│ ServerToWorkerMessage::RegisterAck
│ { │
│ "type": "register_ack", │
│ "worker_id": "w-a1b2c3", │
│ "models": ["llama3-8b"], │
│ "protocol_version": "1" │
│ } │
│◄──────────────────────────────│
│ │
│ ┌──── heartbeat loop (HEARTBEAT_INTERVAL) ────┐
│ │ │ │
│ ServerToWorkerMessage::Ping │ │
│ { "type": "ping", │ │
│ "timestamp_unix_ms": ... } │ │
│◄──────────────────────────────│ │
│ │ │
│ WorkerToServerMessage::Pong │ │
│ { "type": "pong", │ │
│ "current_load": 1, │ │
│ "timestamp_unix_ms": ... } │ │
│──────────────────────────────►│ Proxy updates load │
│ └─────────────────────────────────────────────┘
If the worker misses heartbeats, the proxy closes the WebSocket with
reason "worker heartbeat timed out" and requeues any in-flight requests
(up to MAX_REQUEUE_COUNT = 3 retries).
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions │ │
│ {"model":"llama3-8b", │ │ │
│ "stream": false, │ │ │
│ "messages":[...]} │ │ │
│───────────────────────►│ │ │
│ │ │ │
│ │ Queue lookup: │ │
│ │ provider="local", │ │
│ │ model="llama3-8b" │ │
│ │ → worker "gpu-box-1" │ │
│ │ has capacity │ │
│ │ │ │
│ │ ServerToWorkerMessage::Request │
│ │ { "type": "request", │ │
│ │ "request_id": "r-001",│ │
│ │ "model": "llama3-8b", │ │
│ │ "endpoint_path": │ │
│ │ "/v1/chat/completions", │
│ │ "is_streaming": false, │ │
│ │ "body": "{...}", │ │
│ │ "headers": {...} } │ │
│ │────────────────────────►│ │
│ │ │ │
│ │ │ POST /v1/chat/completions
│ │ │───────────────────►│
│ │ │ │
│ │ │◄───────────────────│
│ │ │ 200 OK + JSON body│
│ │ │ │
│ │ WorkerToServerMessage::ResponseComplete │
│ │ { "type": "response_complete", │
│ │ "request_id": "r-001",│ │
│ │ "status_code": 200, │ │
│ │ "headers": {"content-type": │
│ │ "application/json"},│ │
│ │ "body": "{...}", │ │
│ │ "token_counts": { │ │
│ │ "prompt_tokens": 42,│ │
│ │ "completion_tokens": 128, │
│ │ "total_tokens": 170 │ │
│ │ } │ │
│ │ } │ │
│ │◄────────────────────────│ │
│ │ │ │
│◄───────────────────────│ 200 OK │ │
│ {"choices":[...]} │ (body forwarded) │ │
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions │ │
│ {"stream": true, ...} │ │ │
│───────────────────────►│ │ │
│ │ │ │
│ │ Request dispatched │ │
│ │ (is_streaming: true) │ │
│ │────────────────────────►│ │
│ │ │ POST (stream=true)│
│ │ │───────────────────►│
│ │ │ │
│ │ │ SSE: data: {...} │
│ │ │◄───────────────────│
│ │ WorkerToServerMessage::ResponseChunk │
│ │ { "type": "response_chunk", │
│ │ "request_id": "r-002",│ │
│ │ "chunk": "data: {\"choices\":[...]}\n\n" │
│ │ } │ │
│ │◄────────────────────────│ │
│ SSE: data: {...} │ │ │
│◄───────────────────────│ │ │
│ │ │ │
│ ...more chunks... │ ...more ResponseChunk..│ ...more SSE... │
│ │ │ │
│ │ │ SSE: data: [DONE] │
│ │ │◄───────────────────│
│ │ ResponseChunk (final) │ │
│ │◄────────────────────────│ │
│ SSE: data: [DONE] │ │ │
│◄───────────────────────│ │ │
│ │ │ │
│ │ WorkerToServerMessage::ResponseComplete │
│ │ { "type": "response_complete", │
│ │ "request_id": "r-002",│ │
│ │ "status_code": 200, │ │
│ │ "token_counts": {...} │ │
│ │ } │ │
│ │◄────────────────────────│ │
Chunks are forwarded byte-for-byte without re-parsing. The proxy writes
each ResponseChunk.chunk directly to the HTTP response body, preserving
SSE framing intact.
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions (streaming) │ │
│───────────────────────►│ │ │
│ │ → dispatched to worker │ │
│ │────────────────────────►│ │
│ │ │───────────────────►│
│ (receiving chunks...) │ │ │
│◄───────────────────────│ │ │
│ │ │ │
│ CLIENT DISCONNECTS │ │ │
│──── TCP RST / close ──►│ │ │
│ │ │ │
│ │ Proxy detects drop │ │
│ │ │ │
│ │ ServerToWorkerMessage::Cancel │
│ │ { "type": "cancel", │ │
│ │ "request_id": "r-002",│ │
│ │ "reason": │ │
│ │ "client_disconnect" │ │
│ │ } │ │
│ │────────────────────────►│ │
│ │ │ │
│ │ │ Worker aborts │
│ │ │ backend request │
│ │ │───── abort ────────►│
│ │ │ │
Cancel reasons (from CancelReason enum):
client_disconnect— HTTP client dropped the connectiontimeout— request exceededREQUEST_TIMEOUT_SECSgraceful_shutdown— server is shutting downworker_disconnect— worker WebSocket closed unexpectedlyrequeue_exhausted— max requeue attempts (MAX_REQUEUE_COUNT = 3) exceededserver_shutdown— server process is terminating
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions │ │
│───────────────────────►│ │ │
│ │ → dispatched to worker │ │
│ │────────────────────────►│ │
│ │ │ │
│ │ WORKER CRASHES │ │
│ │ (WebSocket closes) │ │
│ │◄─── close frame / EOF ──│ │
│ │ × │
│ │ │ │
│ │ requeue_count < MAX_REQUEUE_COUNT (3)? │
│ │ YES → put request back in queue │
│ │ │ │
│ │ ...time passes, another worker available... │
│ │ │ │
│ │ Worker-2 │ │
│ │ ServerToWorkerMessage::Request │
│ │────────────────────────►│ Worker-2 │
│ │ │───────────────────►│
│ │ │◄───────────────────│
│ │ ResponseComplete │ │
│ │◄────────────────────────│ │
│◄───────────────────────│ 200 OK │ │
│ │ │ │
If requeue_count >= MAX_REQUEUE_COUNT (3):
│ │ │
│◄───────────────────────│ 503 Service Unavailable │
│ {"error": "requeue │ Cancel with reason: │
│ attempts exhausted"} │ "requeue_exhausted" │
Client Proxy Server
│ │
│ POST /v1/chat/completions
│───────────────────────►│
│ │
│ │ Queue length >= max_queue_len
│ │ (configured via MAX_QUEUE_LEN,
│ │ default: 100)
│ │
│◄───────────────────────│ 429 Too Many Requests
│ {"error": │
│ "queue full"} │
Client Proxy Server
│ │
│ POST /v1/chat/completions
│ {"model": "llama3-8b"}│
│───────────────────────►│
│ │
│ │ No provider registered
│ │ for model "llama3-8b",
│ │ or no workers connected
│ │
│ │ If a provider exists but
│ │ no workers: request is queued
│ │ (will timeout after
│ │ QUEUE_TIMEOUT_SECS = 30)
│ │
│ │ If no provider matches at all:
│◄───────────────────────│ 404 Not Found
│ {"error": │
│ "no provider for │
│ model llama3-8b"} │
Proxy Server Worker
│ │
│ (admin triggers drain │
│ or server shutting down)│
│ │
│ ServerToWorkerMessage::GracefulShutdown
│ { "type": "graceful_shutdown",
│ "reason": "maintenance",
│ "drain_timeout_secs": 30
│ }
│──────────────────────────►│
│ │
│ Worker marked is_draining│
│ No new requests sent │
│ │
│ Worker finishes in-flight│
│ requests normally... │
│ │
│ ResponseComplete(s) │
│◄──────────────────────────│
│ │
│ disconnect_drained_worker_if_idle():
│ all in-flight done? │
│ YES → close WebSocket │
│──── close frame ─────────►│
│ ×
Worker Proxy Server
│ │
│ (new model loaded or │
│ model removed locally) │
│ │
│ WorkerToServerMessage::ModelsUpdate
│ { "type": "models_update",
│ "models": ["llama3-8b",
│ "codellama-13b"],
│ "current_load": 1
│ }
│──────────────────────────►│
│ │ Proxy updates worker's
│ │ model list and routing
│ │
│ (or server requests it) │
│ │
│ ServerToWorkerMessage::ModelsRefresh
│ { "type": "models_refresh",
│ "reason": "periodic"
│ }
│◄──────────────────────────│
│ │
│ ModelsUpdate (response) │
│──────────────────────────►│
Client Proxy Server
│ │
│ POST /v1/chat/completions
│ {"model": "llama3-8b"}│
│───────────────────────►│
│ │
│ │ Provider exists but all
│ │ workers are busy (at
│ │ max_concurrent).
│ │ Request enters queue.
│ │
│ │ ┌── QUEUE_TIMEOUT_SECS (30) ──┐
│ │ │ waiting for a worker to │
│ │ │ become available... │
│ │ │ │
│ │ │ no worker picks up │
│ │ └──────────── timeout fires ───┘
│ │
│ │ Cancel with reason: "timeout"
│ │
│◄───────────────────────│ 504 Gateway Timeout
│ {"error": │
│ "queue timeout: │
│ no worker available │
│ within deadline"} │
The request never reaches a worker. The proxy removes it from the queue
and returns 504 to the client. No Cancel message is sent over WebSocket
because no worker was ever assigned.
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions │ │
│───────────────────────►│ │ │
│ │ → dispatched to worker │ │
│ │────────────────────────►│ │
│ │ │───────────────────►│
│ │ │ │
│ │ ┌── REQUEST_TIMEOUT_SECS (300) ──┐ │
│ │ │ waiting for ResponseComplete │ │
│ │ │ or streaming chunks... │ │
│ │ │ │ │
│ │ │ backend is still processing │ │
│ │ └──────────── timeout fires ──────┘ │
│ │ │ │
│ │ ServerToWorkerMessage::Cancel │
│ │ { "type": "cancel", │ │
│ │ "request_id": "r-003",│ │
│ │ "reason": "timeout" │ │
│ │ } │ │
│ │────────────────────────►│ │
│ │ │ │
│ │ │ Worker aborts │
│ │ │ backend request │
│ │ │───── abort ────────►│
│ │ │ │
│◄───────────────────────│ 504 Gateway Timeout │ │
│ {"error": │ │ │
│ "request timeout"} │ │ │
Unlike queue timeout, the request was dispatched to a worker, so the proxy
sends a Cancel message with reason "timeout" over the WebSocket. The
worker receives the cancellation and aborts the in-flight backend request.
The proxy returns 504 to the client.
| Type | Struct | Purpose |
|---|---|---|
register_ack |
RegisterAck |
Confirm registration, assign worker ID |
request |
RequestMessage |
Dispatch an inference request |
cancel |
CancelMessage |
Cancel an in-flight request |
ping |
PingMessage |
Heartbeat probe |
graceful_shutdown |
GracefulShutdownMessage |
Begin drain sequence |
models_refresh |
ModelsRefreshMessage |
Ask worker to re-report models |
| Type | Struct | Purpose |
|---|---|---|
register |
RegisterMessage |
Announce name, models, capacity |
models_update |
ModelsUpdateMessage |
Update model list / load |
response_chunk |
ResponseChunkMessage |
Forward a streaming chunk |
response_complete |
ResponseCompleteMessage |
Signal request completion |
pong |
PongMessage |
Heartbeat reply with load |
error |
ErrorMessage |
Report a request-level error |