Skip to content

[Bug]: WorkerService::create_worker orphans url_to_id entries; clients get 202 with worker_id that 404s on poll #1533

@CatherineSue

Description

@CatherineSue

Bug Description

WorkerService::create_worker (model_gateway/src/worker/service.rs:240) reserves a WorkerId against the submitted URL string before the AddWorker workflow runs, so it can return 202 Accepted synchronously with Location: /workers/{worker_id}. The 202 contract holds only when the workflow eventually registers the worker under the same URL string the reservation was keyed on. There are several real scenarios where it doesn't, and each one creates a permanently-orphaned entry in WorkerRegistry::url_to_id plus a worker_id in the 202 response that points at nothing.

Concretely:

  1. The reservation key and the registration key diverge. CreateWorkerStep rewrites the URL via normalize_url (workflow/steps/local/create_worker.rs:300-313). Any input that does not already start with one of http://, https://, grpc://, grpcs:// gets a scheme prepended. The reservation entry under the original URL becomes orphaned; a fresh WorkerId is minted for the canonical URL during register_inner; the client polls the original worker_id and gets 404.
  2. The AddWorker job submission fails after a successful reservation (service.rs:269-272). The reservation is never reaped.
  3. The workflow itself fails downstream (detection times out, gRPC build fails, etc.). The reservation is never reaped.

The orphan is only an url_to_id → WorkerId row with no matching workers[WorkerId]. It is small (one String key + one 16-byte WorkerId) but unbounded over process lifetime. The client-visible 404 is the more painful symptom.

Adjacent context: PR #1523 added a resolve_url_to_id canonicalization helper that makes downstream URL lookups tolerate the orphan (used by K8s service-discovery removal), but it does not address creation or cleanup of the orphan — and it has no effect on the 404-on-poll wart.

Steps to Reproduce

1. Start SMG with default config.
2. curl -X POST http://localhost:<router-port>/workers \
        -H 'Content-Type: application/json' \
        -d '{"url":"10.0.0.5:8000"}'   # bare host:port, NOT http://...
3. Note the worker_id and Location header in the 202 Accepted response.
4. Wait for the AddWorker workflow to complete (a few seconds for detection).
5. curl http://localhost:<router-port>/workers/<worker_id_from_step_3>
   -> 404 Not Found
6. curl http://localhost:<router-port>/workers
   -> the actual worker is registered, but under a DIFFERENT worker_id
      that the client was never told about.

Same flow with {"url":"http://10.0.0.5:8000"} works correctly — the orphan is only created when normalize_url rewrites the input.

Expected Behavior

Either:

  • the API rejects schemeless input at the boundary with 400 Bad Request, or
  • the API accepts schemeless input but the 202 returns the worker_id the worker is actually registered under, and there is no orphan in url_to_id.

For the workflow- / submit-failure cases, the orphan should be reaped when the workflow terminates in failure or when the API returns an error to the client.

Actual Behavior

  • url_to_id["10.0.0.5:8000"] = reservedId (orphan; never reaped)
  • url_to_id["grpc://10.0.0.5:8000"] = liveId (the actual worker)
  • workers[liveId] = <Worker>
  • workers[reservedId] is absent
  • HTTP client polling /workers/{reservedId} from the 202 response gets 404 Not Found forever.
  • No log line surfaces the WorkerId divergence; the client has no way to discover liveId except by listing /workers and matching on URL — and the URL in the listing is grpc://10.0.0.5:8000, not the 10.0.0.5:8000 they submitted.

Component

model-gateway (core routing)

Routing Policy (if applicable)

N/A — this is an API-layer / registry-layer issue independent of routing policy.

Connection Mode

N/A — affects HTTP API path regardless of detected worker protocol.

Configuration

Not configuration-dependent. The bug is present in any deployment that exposes POST /workers (i.e. the default).

Logs / Error Output

No error is emitted by SMG for the URL-divergence case — the AddWorker workflow logs success, and the orphan reservation is silently leaked. The visible symptom is just the client-side 404 on the polled worker_id.

For the workflow-failure case, the workflow's failure logs do appear, but they don't connect to the orphan cleanup.

Environment

Reproducible against lightseekorg/smg@main at the time of filing (post-merge of #1523). Code-level issue, build-environment agnostic.

OS: any (Linux / macOS / etc.)
cargo: any modern stable
rustc: any modern stable

Deployment Environment

Bare metal / VM, Kubernetes, Docker, local development — all affected equally; this is an API-contract bug.

Streaming Context

N/A — the bug is in the management API (POST /workers), not the inference path.

Additional Context

Detailed case-by-case walkthrough

What follows is the full behavior matrix the team worked through while investigating #1523, describing how main behaves today for each scenario. Each timeline starts at t=0 when the API handler is invoked (or when service_discovery acts, for K8s cases).

Case A — K8s pod is discovered, registered, deleted

The primary scenario PR #1523 fixed. Not affected by the orphan bugservice_discovery submits Job::AddWorker directly to the job queue and never calls WorkerService::create_worker, so no reservation is ever written.

t=0.000s  service_discovery::worker_url("10.0.0.5", 8000) -> "10.0.0.5:8000"  (bare, post-#1523)
t=0.001s  Job::AddWorker { config { url: "10.0.0.5:8000" } } submitted directly to queue
t=0.05s   Workflow: DetectConnectionMode probes the pod
t=2.30s   Detection -> ConnectionMode::Grpc
t=2.31s   CreateWorkerStep: normalize_url("10.0.0.5:8000", Grpc) = "grpc://10.0.0.5:8000"
t=3.50s   RegisterWorkersStep -> register_or_replace
          register_inner: url_to_id.entry("grpc://10.0.0.5:8000") VACANT -> mints liveId
          url_to_id["grpc://10.0.0.5:8000"] = liveId
          workers[liveId] = worker
t=...s    Pod deleted -> Job::RemoveWorker { url: "10.0.0.5:8000" }
          resolve_url_to_id("10.0.0.5:8000"):
            exact "10.0.0.5:8000" -> miss
            fallback "http://10.0.0.5:8000" -> miss
            fallback "grpc://10.0.0.5:8000" -> liveId (is_live True) -> returned
          Worker removed cleanly.

No reservation, no orphan, no 404. End state is consistent.

Case B — User POSTs {"url":"http://10.0.0.5:8000"} (already-schemed)

The intended happy path of POST /workers. Not affected by the orphan bug.

t=0.000s  HTTP handler -> WorkerService::create_worker
t=0.001s  reserve_id_for_url("http://10.0.0.5:8000")
          -> url_to_id["http://10.0.0.5:8000"] = reservedId
t=0.002s  Job::AddWorker { url: "http://10.0.0.5:8000" } submitted to queue
t=0.003s  202 Accepted, Location: /workers/{reservedId}, body worker_id = reservedId
t=0.05s   Workflow: detect HTTP, normalize_url no-op (scheme already present)
t=3.50s   RegisterWorkersStep -> register_or_replace(worker, url="http://10.0.0.5:8000")
          register_inner: url_to_id.entry("http://10.0.0.5:8000") OCCUPIED with reservedId
          -> reuses reservedId; workers[reservedId] = worker
t=...s    Client polls /workers/{reservedId} -> 200 with worker info ✓

Same key on both sides of the reservation -> no orphan. worker_id in the 202 is the one the worker actually has.

Case C — User POSTs {"url":"10.0.0.5:8000"} (bare host:port) — THE BUG

t=0.000s  HTTP handler -> WorkerService::create_worker
t=0.001s  reserve_id_for_url("10.0.0.5:8000")
          -> url_to_id["10.0.0.5:8000"] = reservedId
          -> workers is unchanged (no worker at this id yet)
t=0.002s  service.rs:258 conflict check: get(&reservedId).is_some() -> false -> no 409
t=0.003s  Job::AddWorker { url: "10.0.0.5:8000" } submitted to queue
t=0.004s  202 Accepted, Location: /workers/{reservedId}, body worker_id = reservedId

          ── HTTP response done. Client now starts polling. ──

t=0.05s   AddWorker workflow starts in background
t=0.05s   DetectConnectionMode probes 10.0.0.5:8000
t=2.30s   Detection -> ConnectionMode::Grpc
t=2.31s   CreateWorkerStep:
          normalize_url("10.0.0.5:8000", Grpc) = "grpc://10.0.0.5:8000"
          worker = BasicWorkerBuilder::new("grpc://10.0.0.5:8000").build()
          worker.url() = "grpc://10.0.0.5:8000"   <-- NEW string, ≠ reservation key
t=3.50s   RegisterWorkersStep -> register_or_replace(worker)
          register_inner:
            url_to_id.entry("grpc://10.0.0.5:8000")  -> VACANT
            (the existing reservation is keyed on "10.0.0.5:8000", NOT "grpc://...")
            -> mints a NEW WorkerId (liveId, ≠ reservedId)
            -> workers[liveId] = worker
            -> url_to_id["grpc://10.0.0.5:8000"] = liveId

Final registry state at t=3.50s:
  url_to_id["10.0.0.5:8000"]        = reservedId   ← orphan, points at nothing
  url_to_id["grpc://10.0.0.5:8000"] = liveId       ← live worker
  workers[liveId]                   = <Worker>
  workers[reservedId]               = (absent)

What the client sees:
  202 response body: worker_id = reservedId
  Polling GET /workers/reservedId  -> 404 Not Found (forever)
  Polling GET /workers/liveId      -> the worker exists, but the client doesn't know liveId

The orphan at url_to_id["10.0.0.5:8000"] is never reaped — the only url_to_id.remove(...) sites in the registry (registry.rs:920 inside remove(), and registry.rs:643 inside register_or_replace's error-recovery branch) both key on worker.url(), which is "grpc://10.0.0.5:8000", never "10.0.0.5:8000".

Case D — Mixed source: a human POSTs the same pod K8s is already managing

D1: POST schemed URL while K8s also discovers the same pod
t=0.000s  service_discovery (K8s) registers the pod first:
          url_to_id["grpc://10.0.0.5:8000"] = liveId
          workers[liveId]                   = worker_k8s

t=1.000s  User POSTs {"url":"http://10.0.0.5:8000"}
t=1.001s  reserve_id_for_url("http://10.0.0.5:8000")
          -> url_to_id["http://10.0.0.5:8000"] = newReservedId
             (different key from K8s's "grpc://" entry, so this is a fresh insert)
t=1.002s  service.rs:258 conflict check: get(&newReservedId).is_some() -> false -> no 409
t=1.003s  Job::AddWorker { url: "http://10.0.0.5:8000" } submitted
t=1.004s  202 Accepted with worker_id = newReservedId
t=1.05s   Workflow: detect HTTP, normalize_url no-op
t=4.50s   RegisterWorkersStep -> register_or_replace(worker, url="http://10.0.0.5:8000")
          register_inner: url_to_id.entry("http://10.0.0.5:8000") OCCUPIED with newReservedId
          -> reuses newReservedId; workers[newReservedId] = worker_http

Result: two distinct workers exist for the same host:port under different schemes. The 202's worker_id is honest (polling it returns worker_http). No orphan. (Whether SMG should permit two workers on the same host:port is a separate design question; the registry has historically allowed it.)

D2: POST bare URL while K8s also discovers the same pod

Produces the same orphan + 404 as Case C. The K8s-side registration is unaffected and remains correct. The mixed-source aspect doesn't change the orphan dynamics; the orphan exists purely on the HTTP API side.

Case E — Workflow or job-submit fails after a successful reservation

E1: queue submit fails
t=0.000s  HTTP handler -> WorkerService::create_worker
t=0.001s  reserve_id_for_url("http://10.0.0.5:8000") -> reservedId    ← orphan-to-be
t=0.002s  Job::AddWorker submitted via job_queue.submit().await
t=0.003s  submit() returns Err(...)
t=0.003s  create_worker returns Err(QueueSubmitFailed). HTTP 500 to client.

End state:
  url_to_id["http://10.0.0.5:8000"] = reservedId   ← orphan, never reaped
  workers[reservedId]               = (absent)

The client got an error and probably won't poll, so the 404 wart is less visible — but the registry leak is real and unbounded.

E2: queue submit succeeds, workflow fails downstream
t=0.000s  HTTP handler -> WorkerService::create_worker
t=0.001s  reserve_id_for_url("http://10.0.0.5:8000") -> reservedId
t=0.002s  Job::AddWorker submitted successfully
t=0.003s  202 Accepted with worker_id = reservedId
t=0.05s   Workflow runs:
          - DetectConnectionMode times out, OR
          - DetectBackend fails, OR
          - CreateWorker fails (e.g. invalid model labels), OR
          - RegisterWorkers fails (rare, but possible under contention)
t=...s    Workflow terminates in failure. register_or_replace never runs.

End state:
  url_to_id["http://10.0.0.5:8000"] = reservedId   ← orphan
  workers[reservedId]               = (absent)
  Client polls /workers/{reservedId} -> 404 Not Found (correctly says "not found",
  but gives no signal that the workflow itself failed).

E1 and E2 exist independently of input URL format — even a perfectly-schemed input creates an orphan when the downstream registration fails.

Summary table

Scenario Reservation? Registration? Orphan? Client visible?
A. K8s discover -> register -> delete No Yes No
B. POST schemed URL, success Yes Yes (same key) No (reused) Correct 202
C. POST bare URL, success Yes Yes (different key) Yes 404 on poll
D1. POST schemed + K8s same pod Yes Yes (same key) No Correct 202 (two workers exist)
D2. POST bare + K8s same pod Yes Yes (different key) Yes 404 on poll (K8s side is fine)
E1. Submit failure after reserve Yes No Yes 500, no poll
E2. Workflow failure after reserve Yes No Yes 404 on poll

Possible fix directions

Two orthogonal pieces would close all the cases above:

  1. Reject schemeless / unparsable URLs at the API boundary before reserve_id_for_url runs. Closes Case C and the URL-divergence half of D2. Cheap; preserves the existing reservation model.

  2. Reservation lifecycle: add a release_reservation(url) -> bool primitive on WorkerRegistry that drops a url_to_id entry only when its WorkerId is not in workers (an orphan; never touch live workers). Wire it into:

    • WorkerService::create_worker's error-return paths (closes E1)
    • The AddWorker workflow's terminal-failure handler, called for config.url when the workflow ends in failure (closes E2)
    • Optionally, register_inner when it inserts under a key different from any prior reservation for the same caller (defense-in-depth against future divergence sources)

(2) is the strictly more correct fix; (1) is a cheap partial fix that closes the most user-visible case (Case C) without changing the registry's lifecycle model.

Pre-submission Checklist

  • I have searched existing issues and discussions
  • I can reproduce this issue consistently
  • I am using the latest version of SMG

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions