Bug Description
WorkerService::create_worker (model_gateway/src/worker/service.rs:240) reserves a WorkerId against the submitted URL string before the AddWorker workflow runs, so it can return 202 Accepted synchronously with Location: /workers/{worker_id}. The 202 contract holds only when the workflow eventually registers the worker under the same URL string the reservation was keyed on. There are several real scenarios where it doesn't, and each one creates a permanently-orphaned entry in WorkerRegistry::url_to_id plus a worker_id in the 202 response that points at nothing.
Concretely:
- The reservation key and the registration key diverge.
CreateWorkerStep rewrites the URL via normalize_url (workflow/steps/local/create_worker.rs:300-313). Any input that does not already start with one of http://, https://, grpc://, grpcs:// gets a scheme prepended. The reservation entry under the original URL becomes orphaned; a fresh WorkerId is minted for the canonical URL during register_inner; the client polls the original worker_id and gets 404.
- The AddWorker job submission fails after a successful reservation (
service.rs:269-272). The reservation is never reaped.
- The workflow itself fails downstream (detection times out, gRPC build fails, etc.). The reservation is never reaped.
The orphan is only an url_to_id → WorkerId row with no matching workers[WorkerId]. It is small (one String key + one 16-byte WorkerId) but unbounded over process lifetime. The client-visible 404 is the more painful symptom.
Adjacent context: PR #1523 added a resolve_url_to_id canonicalization helper that makes downstream URL lookups tolerate the orphan (used by K8s service-discovery removal), but it does not address creation or cleanup of the orphan — and it has no effect on the 404-on-poll wart.
Steps to Reproduce
1. Start SMG with default config.
2. curl -X POST http://localhost:<router-port>/workers \
-H 'Content-Type: application/json' \
-d '{"url":"10.0.0.5:8000"}' # bare host:port, NOT http://...
3. Note the worker_id and Location header in the 202 Accepted response.
4. Wait for the AddWorker workflow to complete (a few seconds for detection).
5. curl http://localhost:<router-port>/workers/<worker_id_from_step_3>
-> 404 Not Found
6. curl http://localhost:<router-port>/workers
-> the actual worker is registered, but under a DIFFERENT worker_id
that the client was never told about.
Same flow with {"url":"http://10.0.0.5:8000"} works correctly — the orphan is only created when normalize_url rewrites the input.
Expected Behavior
Either:
- the API rejects schemeless input at the boundary with
400 Bad Request, or
- the API accepts schemeless input but the 202 returns the
worker_id the worker is actually registered under, and there is no orphan in url_to_id.
For the workflow- / submit-failure cases, the orphan should be reaped when the workflow terminates in failure or when the API returns an error to the client.
Actual Behavior
url_to_id["10.0.0.5:8000"] = reservedId (orphan; never reaped)
url_to_id["grpc://10.0.0.5:8000"] = liveId (the actual worker)
workers[liveId] = <Worker>
workers[reservedId] is absent
- HTTP client polling
/workers/{reservedId} from the 202 response gets 404 Not Found forever.
- No log line surfaces the WorkerId divergence; the client has no way to discover
liveId except by listing /workers and matching on URL — and the URL in the listing is grpc://10.0.0.5:8000, not the 10.0.0.5:8000 they submitted.
Component
model-gateway (core routing)
Routing Policy (if applicable)
N/A — this is an API-layer / registry-layer issue independent of routing policy.
Connection Mode
N/A — affects HTTP API path regardless of detected worker protocol.
Configuration
Not configuration-dependent. The bug is present in any deployment that exposes POST /workers (i.e. the default).
Logs / Error Output
No error is emitted by SMG for the URL-divergence case — the AddWorker workflow logs success, and the orphan reservation is silently leaked. The visible symptom is just the client-side 404 on the polled worker_id.
For the workflow-failure case, the workflow's failure logs do appear, but they don't connect to the orphan cleanup.
Environment
Reproducible against lightseekorg/smg@main at the time of filing (post-merge of #1523). Code-level issue, build-environment agnostic.
OS: any (Linux / macOS / etc.)
cargo: any modern stable
rustc: any modern stable
Deployment Environment
Bare metal / VM, Kubernetes, Docker, local development — all affected equally; this is an API-contract bug.
Streaming Context
N/A — the bug is in the management API (POST /workers), not the inference path.
Additional Context
Detailed case-by-case walkthrough
What follows is the full behavior matrix the team worked through while investigating #1523, describing how main behaves today for each scenario. Each timeline starts at t=0 when the API handler is invoked (or when service_discovery acts, for K8s cases).
Case A — K8s pod is discovered, registered, deleted
The primary scenario PR #1523 fixed. Not affected by the orphan bug — service_discovery submits Job::AddWorker directly to the job queue and never calls WorkerService::create_worker, so no reservation is ever written.
t=0.000s service_discovery::worker_url("10.0.0.5", 8000) -> "10.0.0.5:8000" (bare, post-#1523)
t=0.001s Job::AddWorker { config { url: "10.0.0.5:8000" } } submitted directly to queue
t=0.05s Workflow: DetectConnectionMode probes the pod
t=2.30s Detection -> ConnectionMode::Grpc
t=2.31s CreateWorkerStep: normalize_url("10.0.0.5:8000", Grpc) = "grpc://10.0.0.5:8000"
t=3.50s RegisterWorkersStep -> register_or_replace
register_inner: url_to_id.entry("grpc://10.0.0.5:8000") VACANT -> mints liveId
url_to_id["grpc://10.0.0.5:8000"] = liveId
workers[liveId] = worker
t=...s Pod deleted -> Job::RemoveWorker { url: "10.0.0.5:8000" }
resolve_url_to_id("10.0.0.5:8000"):
exact "10.0.0.5:8000" -> miss
fallback "http://10.0.0.5:8000" -> miss
fallback "grpc://10.0.0.5:8000" -> liveId (is_live True) -> returned
Worker removed cleanly.
No reservation, no orphan, no 404. End state is consistent.
Case B — User POSTs {"url":"http://10.0.0.5:8000"} (already-schemed)
The intended happy path of POST /workers. Not affected by the orphan bug.
t=0.000s HTTP handler -> WorkerService::create_worker
t=0.001s reserve_id_for_url("http://10.0.0.5:8000")
-> url_to_id["http://10.0.0.5:8000"] = reservedId
t=0.002s Job::AddWorker { url: "http://10.0.0.5:8000" } submitted to queue
t=0.003s 202 Accepted, Location: /workers/{reservedId}, body worker_id = reservedId
t=0.05s Workflow: detect HTTP, normalize_url no-op (scheme already present)
t=3.50s RegisterWorkersStep -> register_or_replace(worker, url="http://10.0.0.5:8000")
register_inner: url_to_id.entry("http://10.0.0.5:8000") OCCUPIED with reservedId
-> reuses reservedId; workers[reservedId] = worker
t=...s Client polls /workers/{reservedId} -> 200 with worker info ✓
Same key on both sides of the reservation -> no orphan. worker_id in the 202 is the one the worker actually has.
Case C — User POSTs {"url":"10.0.0.5:8000"} (bare host:port) — THE BUG
t=0.000s HTTP handler -> WorkerService::create_worker
t=0.001s reserve_id_for_url("10.0.0.5:8000")
-> url_to_id["10.0.0.5:8000"] = reservedId
-> workers is unchanged (no worker at this id yet)
t=0.002s service.rs:258 conflict check: get(&reservedId).is_some() -> false -> no 409
t=0.003s Job::AddWorker { url: "10.0.0.5:8000" } submitted to queue
t=0.004s 202 Accepted, Location: /workers/{reservedId}, body worker_id = reservedId
── HTTP response done. Client now starts polling. ──
t=0.05s AddWorker workflow starts in background
t=0.05s DetectConnectionMode probes 10.0.0.5:8000
t=2.30s Detection -> ConnectionMode::Grpc
t=2.31s CreateWorkerStep:
normalize_url("10.0.0.5:8000", Grpc) = "grpc://10.0.0.5:8000"
worker = BasicWorkerBuilder::new("grpc://10.0.0.5:8000").build()
worker.url() = "grpc://10.0.0.5:8000" <-- NEW string, ≠ reservation key
t=3.50s RegisterWorkersStep -> register_or_replace(worker)
register_inner:
url_to_id.entry("grpc://10.0.0.5:8000") -> VACANT
(the existing reservation is keyed on "10.0.0.5:8000", NOT "grpc://...")
-> mints a NEW WorkerId (liveId, ≠ reservedId)
-> workers[liveId] = worker
-> url_to_id["grpc://10.0.0.5:8000"] = liveId
Final registry state at t=3.50s:
url_to_id["10.0.0.5:8000"] = reservedId ← orphan, points at nothing
url_to_id["grpc://10.0.0.5:8000"] = liveId ← live worker
workers[liveId] = <Worker>
workers[reservedId] = (absent)
What the client sees:
202 response body: worker_id = reservedId
Polling GET /workers/reservedId -> 404 Not Found (forever)
Polling GET /workers/liveId -> the worker exists, but the client doesn't know liveId
The orphan at url_to_id["10.0.0.5:8000"] is never reaped — the only url_to_id.remove(...) sites in the registry (registry.rs:920 inside remove(), and registry.rs:643 inside register_or_replace's error-recovery branch) both key on worker.url(), which is "grpc://10.0.0.5:8000", never "10.0.0.5:8000".
Case D — Mixed source: a human POSTs the same pod K8s is already managing
D1: POST schemed URL while K8s also discovers the same pod
t=0.000s service_discovery (K8s) registers the pod first:
url_to_id["grpc://10.0.0.5:8000"] = liveId
workers[liveId] = worker_k8s
t=1.000s User POSTs {"url":"http://10.0.0.5:8000"}
t=1.001s reserve_id_for_url("http://10.0.0.5:8000")
-> url_to_id["http://10.0.0.5:8000"] = newReservedId
(different key from K8s's "grpc://" entry, so this is a fresh insert)
t=1.002s service.rs:258 conflict check: get(&newReservedId).is_some() -> false -> no 409
t=1.003s Job::AddWorker { url: "http://10.0.0.5:8000" } submitted
t=1.004s 202 Accepted with worker_id = newReservedId
t=1.05s Workflow: detect HTTP, normalize_url no-op
t=4.50s RegisterWorkersStep -> register_or_replace(worker, url="http://10.0.0.5:8000")
register_inner: url_to_id.entry("http://10.0.0.5:8000") OCCUPIED with newReservedId
-> reuses newReservedId; workers[newReservedId] = worker_http
Result: two distinct workers exist for the same host:port under different schemes. The 202's worker_id is honest (polling it returns worker_http). No orphan. (Whether SMG should permit two workers on the same host:port is a separate design question; the registry has historically allowed it.)
D2: POST bare URL while K8s also discovers the same pod
Produces the same orphan + 404 as Case C. The K8s-side registration is unaffected and remains correct. The mixed-source aspect doesn't change the orphan dynamics; the orphan exists purely on the HTTP API side.
Case E — Workflow or job-submit fails after a successful reservation
E1: queue submit fails
t=0.000s HTTP handler -> WorkerService::create_worker
t=0.001s reserve_id_for_url("http://10.0.0.5:8000") -> reservedId ← orphan-to-be
t=0.002s Job::AddWorker submitted via job_queue.submit().await
t=0.003s submit() returns Err(...)
t=0.003s create_worker returns Err(QueueSubmitFailed). HTTP 500 to client.
End state:
url_to_id["http://10.0.0.5:8000"] = reservedId ← orphan, never reaped
workers[reservedId] = (absent)
The client got an error and probably won't poll, so the 404 wart is less visible — but the registry leak is real and unbounded.
E2: queue submit succeeds, workflow fails downstream
t=0.000s HTTP handler -> WorkerService::create_worker
t=0.001s reserve_id_for_url("http://10.0.0.5:8000") -> reservedId
t=0.002s Job::AddWorker submitted successfully
t=0.003s 202 Accepted with worker_id = reservedId
t=0.05s Workflow runs:
- DetectConnectionMode times out, OR
- DetectBackend fails, OR
- CreateWorker fails (e.g. invalid model labels), OR
- RegisterWorkers fails (rare, but possible under contention)
t=...s Workflow terminates in failure. register_or_replace never runs.
End state:
url_to_id["http://10.0.0.5:8000"] = reservedId ← orphan
workers[reservedId] = (absent)
Client polls /workers/{reservedId} -> 404 Not Found (correctly says "not found",
but gives no signal that the workflow itself failed).
E1 and E2 exist independently of input URL format — even a perfectly-schemed input creates an orphan when the downstream registration fails.
Summary table
| Scenario |
Reservation? |
Registration? |
Orphan? |
Client visible? |
| A. K8s discover -> register -> delete |
No |
Yes |
No |
— |
| B. POST schemed URL, success |
Yes |
Yes (same key) |
No (reused) |
Correct 202 |
| C. POST bare URL, success |
Yes |
Yes (different key) |
Yes |
404 on poll |
| D1. POST schemed + K8s same pod |
Yes |
Yes (same key) |
No |
Correct 202 (two workers exist) |
| D2. POST bare + K8s same pod |
Yes |
Yes (different key) |
Yes |
404 on poll (K8s side is fine) |
| E1. Submit failure after reserve |
Yes |
No |
Yes |
500, no poll |
| E2. Workflow failure after reserve |
Yes |
No |
Yes |
404 on poll |
Possible fix directions
Two orthogonal pieces would close all the cases above:
-
Reject schemeless / unparsable URLs at the API boundary before reserve_id_for_url runs. Closes Case C and the URL-divergence half of D2. Cheap; preserves the existing reservation model.
-
Reservation lifecycle: add a release_reservation(url) -> bool primitive on WorkerRegistry that drops a url_to_id entry only when its WorkerId is not in workers (an orphan; never touch live workers). Wire it into:
WorkerService::create_worker's error-return paths (closes E1)
- The AddWorker workflow's terminal-failure handler, called for
config.url when the workflow ends in failure (closes E2)
- Optionally,
register_inner when it inserts under a key different from any prior reservation for the same caller (defense-in-depth against future divergence sources)
(2) is the strictly more correct fix; (1) is a cheap partial fix that closes the most user-visible case (Case C) without changing the registry's lifecycle model.
Pre-submission Checklist
Bug Description
WorkerService::create_worker(model_gateway/src/worker/service.rs:240) reserves aWorkerIdagainst the submitted URL string before the AddWorker workflow runs, so it can return202 Acceptedsynchronously withLocation: /workers/{worker_id}. The 202 contract holds only when the workflow eventually registers the worker under the same URL string the reservation was keyed on. There are several real scenarios where it doesn't, and each one creates a permanently-orphaned entry inWorkerRegistry::url_to_idplus aworker_idin the 202 response that points at nothing.Concretely:
CreateWorkerSteprewrites the URL vianormalize_url(workflow/steps/local/create_worker.rs:300-313). Any input that does not already start with one ofhttp://,https://,grpc://,grpcs://gets a scheme prepended. The reservation entry under the original URL becomes orphaned; a freshWorkerIdis minted for the canonical URL duringregister_inner; the client polls the originalworker_idand gets 404.service.rs:269-272). The reservation is never reaped.The orphan is only an
url_to_id → WorkerIdrow with no matchingworkers[WorkerId]. It is small (oneStringkey + one 16-byteWorkerId) but unbounded over process lifetime. The client-visible 404 is the more painful symptom.Adjacent context: PR #1523 added a
resolve_url_to_idcanonicalization helper that makes downstream URL lookups tolerate the orphan (used by K8s service-discovery removal), but it does not address creation or cleanup of the orphan — and it has no effect on the 404-on-poll wart.Steps to Reproduce
Same flow with
{"url":"http://10.0.0.5:8000"}works correctly — the orphan is only created whennormalize_urlrewrites the input.Expected Behavior
Either:
400 Bad Request, orworker_idthe worker is actually registered under, and there is no orphan inurl_to_id.For the workflow- / submit-failure cases, the orphan should be reaped when the workflow terminates in failure or when the API returns an error to the client.
Actual Behavior
url_to_id["10.0.0.5:8000"] = reservedId(orphan; never reaped)url_to_id["grpc://10.0.0.5:8000"] = liveId(the actual worker)workers[liveId] = <Worker>workers[reservedId]is absent/workers/{reservedId}from the 202 response gets404 Not Foundforever.liveIdexcept by listing/workersand matching on URL — and the URL in the listing isgrpc://10.0.0.5:8000, not the10.0.0.5:8000they submitted.Component
model-gateway (core routing)
Routing Policy (if applicable)
N/A — this is an API-layer / registry-layer issue independent of routing policy.
Connection Mode
N/A — affects HTTP API path regardless of detected worker protocol.
Configuration
Not configuration-dependent. The bug is present in any deployment that exposes
POST /workers(i.e. the default).Logs / Error Output
No error is emitted by SMG for the URL-divergence case — the AddWorker workflow logs success, and the orphan reservation is silently leaked. The visible symptom is just the client-side 404 on the polled
worker_id.For the workflow-failure case, the workflow's failure logs do appear, but they don't connect to the orphan cleanup.
Environment
Reproducible against
lightseekorg/smg@mainat the time of filing (post-merge of #1523). Code-level issue, build-environment agnostic.Deployment Environment
Bare metal / VM, Kubernetes, Docker, local development — all affected equally; this is an API-contract bug.
Streaming Context
N/A — the bug is in the management API (
POST /workers), not the inference path.Additional Context
Detailed case-by-case walkthrough
What follows is the full behavior matrix the team worked through while investigating #1523, describing how
mainbehaves today for each scenario. Each timeline starts att=0when the API handler is invoked (or when service_discovery acts, for K8s cases).Case A — K8s pod is discovered, registered, deleted
The primary scenario PR #1523 fixed. Not affected by the orphan bug —
service_discoverysubmitsJob::AddWorkerdirectly to the job queue and never callsWorkerService::create_worker, so no reservation is ever written.No reservation, no orphan, no 404. End state is consistent.
Case B — User POSTs
{"url":"http://10.0.0.5:8000"}(already-schemed)The intended happy path of
POST /workers. Not affected by the orphan bug.Same key on both sides of the reservation -> no orphan.
worker_idin the 202 is the one the worker actually has.Case C — User POSTs
{"url":"10.0.0.5:8000"}(bare host:port) — THE BUGThe orphan at
url_to_id["10.0.0.5:8000"]is never reaped — the onlyurl_to_id.remove(...)sites in the registry (registry.rs:920insideremove(), andregistry.rs:643insideregister_or_replace's error-recovery branch) both key onworker.url(), which is"grpc://10.0.0.5:8000", never"10.0.0.5:8000".Case D — Mixed source: a human POSTs the same pod K8s is already managing
D1: POST schemed URL while K8s also discovers the same pod
Result: two distinct workers exist for the same host:port under different schemes. The 202's
worker_idis honest (polling it returnsworker_http). No orphan. (Whether SMG should permit two workers on the same host:port is a separate design question; the registry has historically allowed it.)D2: POST bare URL while K8s also discovers the same pod
Produces the same orphan + 404 as Case C. The K8s-side registration is unaffected and remains correct. The mixed-source aspect doesn't change the orphan dynamics; the orphan exists purely on the HTTP API side.
Case E — Workflow or job-submit fails after a successful reservation
E1: queue submit fails
The client got an error and probably won't poll, so the 404 wart is less visible — but the registry leak is real and unbounded.
E2: queue submit succeeds, workflow fails downstream
E1 and E2 exist independently of input URL format — even a perfectly-schemed input creates an orphan when the downstream registration fails.
Summary table
Possible fix directions
Two orthogonal pieces would close all the cases above:
Reject schemeless / unparsable URLs at the API boundary before
reserve_id_for_urlruns. Closes Case C and the URL-divergence half of D2. Cheap; preserves the existing reservation model.Reservation lifecycle: add a
release_reservation(url) -> boolprimitive onWorkerRegistrythat drops aurl_to_identry only when itsWorkerIdis not inworkers(an orphan; never touch live workers). Wire it into:WorkerService::create_worker's error-return paths (closes E1)config.urlwhen the workflow ends in failure (closes E2)register_innerwhen it inserts under a key different from any prior reservation for the same caller (defense-in-depth against future divergence sources)(2) is the strictly more correct fix; (1) is a cheap partial fix that closes the most user-visible case (Case C) without changing the registry's lifecycle model.
Pre-submission Checklist