[Bug]: WorkerService::create_worker orphans url_to_id entries; clients get 202 with worker_id that 404s on poll

## Bug Description

`WorkerService::create_worker` (`model_gateway/src/worker/service.rs:240`) reserves a `WorkerId` against the submitted URL string **before** the AddWorker workflow runs, so it can return `202 Accepted` synchronously with `Location: /workers/{worker_id}`. The 202 contract holds **only** when the workflow eventually registers the worker under the *same URL string* the reservation was keyed on. There are several real scenarios where it doesn't, and each one creates a permanently-orphaned entry in `WorkerRegistry::url_to_id` plus a `worker_id` in the 202 response that points at nothing.

Concretely:

1. **The reservation key and the registration key diverge.** `CreateWorkerStep` rewrites the URL via `normalize_url` (`workflow/steps/local/create_worker.rs:300-313`). Any input that does not already start with one of `http://`, `https://`, `grpc://`, `grpcs://` gets a scheme prepended. The reservation entry under the original URL becomes orphaned; a fresh `WorkerId` is minted for the canonical URL during `register_inner`; the client polls the original `worker_id` and gets 404.
2. **The AddWorker job submission fails after a successful reservation** (`service.rs:269-272`). The reservation is never reaped.
3. **The workflow itself fails downstream** (detection times out, gRPC build fails, etc.). The reservation is never reaped.

The orphan is only an `url_to_id → WorkerId` row with no matching `workers[WorkerId]`. It is small (one `String` key + one 16-byte `WorkerId`) but unbounded over process lifetime. The client-visible 404 is the more painful symptom.

Adjacent context: PR #1523 added a `resolve_url_to_id` canonicalization helper that makes downstream URL lookups *tolerate* the orphan (used by K8s service-discovery removal), but it does not address creation or cleanup of the orphan — and it has no effect on the 404-on-poll wart.

## Steps to Reproduce

```
1. Start SMG with default config.
2. curl -X POST http://localhost:<router-port>/workers \
        -H 'Content-Type: application/json' \
        -d '{"url":"10.0.0.5:8000"}'   # bare host:port, NOT http://...
3. Note the worker_id and Location header in the 202 Accepted response.
4. Wait for the AddWorker workflow to complete (a few seconds for detection).
5. curl http://localhost:<router-port>/workers/<worker_id_from_step_3>
   -> 404 Not Found
6. curl http://localhost:<router-port>/workers
   -> the actual worker is registered, but under a DIFFERENT worker_id
      that the client was never told about.
```

Same flow with `{"url":"http://10.0.0.5:8000"}` works correctly — the orphan is only created when `normalize_url` rewrites the input.

## Expected Behavior

Either:
- the API rejects schemeless input at the boundary with `400 Bad Request`, **or**
- the API accepts schemeless input but the 202 returns the `worker_id` the worker is actually registered under, and there is no orphan in `url_to_id`.

For the workflow- / submit-failure cases, the orphan should be reaped when the workflow terminates in failure or when the API returns an error to the client.

## Actual Behavior

- `url_to_id["10.0.0.5:8000"] = reservedId`     (orphan; never reaped)
- `url_to_id["grpc://10.0.0.5:8000"] = liveId`  (the actual worker)
- `workers[liveId] = <Worker>`
- `workers[reservedId]` is absent
- HTTP client polling `/workers/{reservedId}` from the 202 response gets `404 Not Found` forever.
- No log line surfaces the WorkerId divergence; the client has no way to discover `liveId` except by listing `/workers` and matching on URL — and the URL in the listing is `grpc://10.0.0.5:8000`, not the `10.0.0.5:8000` they submitted.

## Component

model-gateway (core routing)

## Routing Policy (if applicable)

N/A — this is an API-layer / registry-layer issue independent of routing policy.

## Connection Mode

N/A — affects HTTP API path regardless of detected worker protocol.

## Configuration

Not configuration-dependent. The bug is present in any deployment that exposes `POST /workers` (i.e. the default).

## Logs / Error Output

No error is emitted by SMG for the URL-divergence case — the AddWorker workflow logs success, and the orphan reservation is silently leaked. The visible symptom is just the client-side 404 on the polled `worker_id`.

For the workflow-failure case, the workflow's failure logs do appear, but they don't connect to the orphan cleanup.

## Environment

Reproducible against `lightseekorg/smg@main` at the time of filing (post-merge of #1523). Code-level issue, build-environment agnostic.

```
OS: any (Linux / macOS / etc.)
cargo: any modern stable
rustc: any modern stable
```

## Deployment Environment

Bare metal / VM, Kubernetes, Docker, local development — all affected equally; this is an API-contract bug.

## Streaming Context

N/A — the bug is in the management API (`POST /workers`), not the inference path.

## Additional Context

### Detailed case-by-case walkthrough

What follows is the full behavior matrix the team worked through while investigating #1523, describing how `main` behaves today for each scenario. Each timeline starts at `t=0` when the API handler is invoked (or when service_discovery acts, for K8s cases).

#### Case A — K8s pod is discovered, registered, deleted

The primary scenario PR #1523 fixed. **Not affected by the orphan bug** — `service_discovery` submits `Job::AddWorker` directly to the job queue and never calls `WorkerService::create_worker`, so no reservation is ever written.

```
t=0.000s  service_discovery::worker_url("10.0.0.5", 8000) -> "10.0.0.5:8000"  (bare, post-#1523)
t=0.001s  Job::AddWorker { config { url: "10.0.0.5:8000" } } submitted directly to queue
t=0.05s   Workflow: DetectConnectionMode probes the pod
t=2.30s   Detection -> ConnectionMode::Grpc
t=2.31s   CreateWorkerStep: normalize_url("10.0.0.5:8000", Grpc) = "grpc://10.0.0.5:8000"
t=3.50s   RegisterWorkersStep -> register_or_replace
          register_inner: url_to_id.entry("grpc://10.0.0.5:8000") VACANT -> mints liveId
          url_to_id["grpc://10.0.0.5:8000"] = liveId
          workers[liveId] = worker
t=...s    Pod deleted -> Job::RemoveWorker { url: "10.0.0.5:8000" }
          resolve_url_to_id("10.0.0.5:8000"):
            exact "10.0.0.5:8000" -> miss
            fallback "http://10.0.0.5:8000" -> miss
            fallback "grpc://10.0.0.5:8000" -> liveId (is_live True) -> returned
          Worker removed cleanly.
```

No reservation, no orphan, no 404. End state is consistent.

#### Case B — User POSTs `{"url":"http://10.0.0.5:8000"}` (already-schemed)

The intended happy path of `POST /workers`. **Not affected by the orphan bug.**

```
t=0.000s  HTTP handler -> WorkerService::create_worker
t=0.001s  reserve_id_for_url("http://10.0.0.5:8000")
          -> url_to_id["http://10.0.0.5:8000"] = reservedId
t=0.002s  Job::AddWorker { url: "http://10.0.0.5:8000" } submitted to queue
t=0.003s  202 Accepted, Location: /workers/{reservedId}, body worker_id = reservedId
t=0.05s   Workflow: detect HTTP, normalize_url no-op (scheme already present)
t=3.50s   RegisterWorkersStep -> register_or_replace(worker, url="http://10.0.0.5:8000")
          register_inner: url_to_id.entry("http://10.0.0.5:8000") OCCUPIED with reservedId
          -> reuses reservedId; workers[reservedId] = worker
t=...s    Client polls /workers/{reservedId} -> 200 with worker info ✓
```

Same key on both sides of the reservation -> no orphan. `worker_id` in the 202 is the one the worker actually has.

#### Case C — User POSTs `{"url":"10.0.0.5:8000"}` (bare host:port) — THE BUG

```
t=0.000s  HTTP handler -> WorkerService::create_worker
t=0.001s  reserve_id_for_url("10.0.0.5:8000")
          -> url_to_id["10.0.0.5:8000"] = reservedId
          -> workers is unchanged (no worker at this id yet)
t=0.002s  service.rs:258 conflict check: get(&reservedId).is_some() -> false -> no 409
t=0.003s  Job::AddWorker { url: "10.0.0.5:8000" } submitted to queue
t=0.004s  202 Accepted, Location: /workers/{reservedId}, body worker_id = reservedId

          ── HTTP response done. Client now starts polling. ──

t=0.05s   AddWorker workflow starts in background
t=0.05s   DetectConnectionMode probes 10.0.0.5:8000
t=2.30s   Detection -> ConnectionMode::Grpc
t=2.31s   CreateWorkerStep:
          normalize_url("10.0.0.5:8000", Grpc) = "grpc://10.0.0.5:8000"
          worker = BasicWorkerBuilder::new("grpc://10.0.0.5:8000").build()
          worker.url() = "grpc://10.0.0.5:8000"   <-- NEW string, ≠ reservation key
t=3.50s   RegisterWorkersStep -> register_or_replace(worker)
          register_inner:
            url_to_id.entry("grpc://10.0.0.5:8000")  -> VACANT
            (the existing reservation is keyed on "10.0.0.5:8000", NOT "grpc://...")
            -> mints a NEW WorkerId (liveId, ≠ reservedId)
            -> workers[liveId] = worker
            -> url_to_id["grpc://10.0.0.5:8000"] = liveId

Final registry state at t=3.50s:
  url_to_id["10.0.0.5:8000"]        = reservedId   ← orphan, points at nothing
  url_to_id["grpc://10.0.0.5:8000"] = liveId       ← live worker
  workers[liveId]                   = <Worker>
  workers[reservedId]               = (absent)

What the client sees:
  202 response body: worker_id = reservedId
  Polling GET /workers/reservedId  -> 404 Not Found (forever)
  Polling GET /workers/liveId      -> the worker exists, but the client doesn't know liveId
```

The orphan at `url_to_id["10.0.0.5:8000"]` is never reaped — the only `url_to_id.remove(...)` sites in the registry (`registry.rs:920` inside `remove()`, and `registry.rs:643` inside `register_or_replace`'s error-recovery branch) both key on `worker.url()`, which is `"grpc://10.0.0.5:8000"`, never `"10.0.0.5:8000"`.

#### Case D — Mixed source: a human POSTs the same pod K8s is already managing

##### D1: POST schemed URL while K8s also discovers the same pod

```
t=0.000s  service_discovery (K8s) registers the pod first:
          url_to_id["grpc://10.0.0.5:8000"] = liveId
          workers[liveId]                   = worker_k8s

t=1.000s  User POSTs {"url":"http://10.0.0.5:8000"}
t=1.001s  reserve_id_for_url("http://10.0.0.5:8000")
          -> url_to_id["http://10.0.0.5:8000"] = newReservedId
             (different key from K8s's "grpc://" entry, so this is a fresh insert)
t=1.002s  service.rs:258 conflict check: get(&newReservedId).is_some() -> false -> no 409
t=1.003s  Job::AddWorker { url: "http://10.0.0.5:8000" } submitted
t=1.004s  202 Accepted with worker_id = newReservedId
t=1.05s   Workflow: detect HTTP, normalize_url no-op
t=4.50s   RegisterWorkersStep -> register_or_replace(worker, url="http://10.0.0.5:8000")
          register_inner: url_to_id.entry("http://10.0.0.5:8000") OCCUPIED with newReservedId
          -> reuses newReservedId; workers[newReservedId] = worker_http
```

Result: two distinct workers exist for the same host:port under different schemes. The 202's `worker_id` is honest (polling it returns `worker_http`). No orphan. (Whether SMG *should* permit two workers on the same host:port is a separate design question; the registry has historically allowed it.)

##### D2: POST bare URL while K8s also discovers the same pod

Produces the same orphan + 404 as Case C. The K8s-side registration is unaffected and remains correct. The mixed-source aspect doesn't change the orphan dynamics; the orphan exists purely on the HTTP API side.

#### Case E — Workflow or job-submit fails after a successful reservation

##### E1: queue submit fails

```
t=0.000s  HTTP handler -> WorkerService::create_worker
t=0.001s  reserve_id_for_url("http://10.0.0.5:8000") -> reservedId    ← orphan-to-be
t=0.002s  Job::AddWorker submitted via job_queue.submit().await
t=0.003s  submit() returns Err(...)
t=0.003s  create_worker returns Err(QueueSubmitFailed). HTTP 500 to client.

End state:
  url_to_id["http://10.0.0.5:8000"] = reservedId   ← orphan, never reaped
  workers[reservedId]               = (absent)
```

The client got an error and probably won't poll, so the 404 wart is less visible — but the registry leak is real and unbounded.

##### E2: queue submit succeeds, workflow fails downstream

```
t=0.000s  HTTP handler -> WorkerService::create_worker
t=0.001s  reserve_id_for_url("http://10.0.0.5:8000") -> reservedId
t=0.002s  Job::AddWorker submitted successfully
t=0.003s  202 Accepted with worker_id = reservedId
t=0.05s   Workflow runs:
          - DetectConnectionMode times out, OR
          - DetectBackend fails, OR
          - CreateWorker fails (e.g. invalid model labels), OR
          - RegisterWorkers fails (rare, but possible under contention)
t=...s    Workflow terminates in failure. register_or_replace never runs.

End state:
  url_to_id["http://10.0.0.5:8000"] = reservedId   ← orphan
  workers[reservedId]               = (absent)
  Client polls /workers/{reservedId} -> 404 Not Found (correctly says "not found",
  but gives no signal that the workflow itself failed).
```

E1 and E2 exist independently of input URL format — even a perfectly-schemed input creates an orphan when the downstream registration fails.

### Summary table

| Scenario | Reservation? | Registration? | Orphan? | Client visible? |
|---|---|---|---|---|
| A. K8s discover -> register -> delete | No | Yes | No | — |
| B. POST schemed URL, success | Yes | Yes (same key) | No (reused) | Correct 202 |
| C. POST bare URL, success | Yes | Yes (different key) | **Yes** | **404 on poll** |
| D1. POST schemed + K8s same pod | Yes | Yes (same key) | No | Correct 202 (two workers exist) |
| D2. POST bare + K8s same pod | Yes | Yes (different key) | **Yes** | **404 on poll** (K8s side is fine) |
| E1. Submit failure after reserve | Yes | No | **Yes** | 500, no poll |
| E2. Workflow failure after reserve | Yes | No | **Yes** | **404 on poll** |

### Possible fix directions

Two orthogonal pieces would close all the cases above:

1. **Reject schemeless / unparsable URLs at the API boundary** before `reserve_id_for_url` runs. Closes Case C and the URL-divergence half of D2. Cheap; preserves the existing reservation model.

2. **Reservation lifecycle**: add a `release_reservation(url) -> bool` primitive on `WorkerRegistry` that drops a `url_to_id` entry **only when** its `WorkerId` is not in `workers` (an orphan; never touch live workers). Wire it into:
   - `WorkerService::create_worker`'s error-return paths (closes E1)
   - The AddWorker workflow's terminal-failure handler, called for `config.url` when the workflow ends in failure (closes E2)
   - Optionally, `register_inner` when it inserts under a key different from any prior reservation for the same caller (defense-in-depth against future divergence sources)

(2) is the strictly more correct fix; (1) is a cheap partial fix that closes the most user-visible case (Case C) without changing the registry's lifecycle model.

## Pre-submission Checklist

- [x] I have searched existing issues and discussions
- [x] I can reproduce this issue consistently
- [x] I am using the latest version of SMG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: WorkerService::create_worker orphans url_to_id entries; clients get 202 with worker_id that 404s on poll #1533

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Component

Routing Policy (if applicable)

Connection Mode

Configuration

Logs / Error Output

Environment

Deployment Environment

Streaming Context

Additional Context

Detailed case-by-case walkthrough

Case A — K8s pod is discovered, registered, deleted

Case B — User POSTs `{"url":"http://10.0.0.5:8000"}` (already-schemed)

Case C — User POSTs `{"url":"10.0.0.5:8000"}` (bare host:port) — THE BUG

Case D — Mixed source: a human POSTs the same pod K8s is already managing

D1: POST schemed URL while K8s also discovers the same pod

D2: POST bare URL while K8s also discovers the same pod

Case E — Workflow or job-submit fails after a successful reservation

E1: queue submit fails

E2: queue submit succeeds, workflow fails downstream

Summary table

Possible fix directions

Pre-submission Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario	Reservation?	Registration?	Orphan?	Client visible?
A. K8s discover -> register -> delete	No	Yes	No	—
B. POST schemed URL, success	Yes	Yes (same key)	No (reused)	Correct 202
C. POST bare URL, success	Yes	Yes (different key)	Yes	404 on poll
D1. POST schemed + K8s same pod	Yes	Yes (same key)	No	Correct 202 (two workers exist)
D2. POST bare + K8s same pod	Yes	Yes (different key)	Yes	404 on poll (K8s side is fine)
E1. Submit failure after reserve	Yes	No	Yes	500, no poll
E2. Workflow failure after reserve	Yes	No	Yes	404 on poll

[Bug]: WorkerService::create_worker orphans url_to_id entries; clients get 202 with worker_id that 404s on poll #1533

Description

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Component

Routing Policy (if applicable)

Connection Mode

Configuration

Logs / Error Output

Environment

Deployment Environment

Streaming Context

Additional Context

Detailed case-by-case walkthrough

Case A — K8s pod is discovered, registered, deleted

Case B — User POSTs {"url":"http://10.0.0.5:8000"} (already-schemed)

Case C — User POSTs {"url":"10.0.0.5:8000"} (bare host:port) — THE BUG

Case D — Mixed source: a human POSTs the same pod K8s is already managing

D1: POST schemed URL while K8s also discovers the same pod

D2: POST bare URL while K8s also discovers the same pod

Case E — Workflow or job-submit fails after a successful reservation

E1: queue submit fails

E2: queue submit succeeds, workflow fails downstream

Summary table

Possible fix directions

Pre-submission Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Case B — User POSTs `{"url":"http://10.0.0.5:8000"}` (already-schemed)

Case C — User POSTs `{"url":"10.0.0.5:8000"}` (bare host:port) — THE BUG