Skip to content

fix(api): reject schemeless worker URLs in POST /workers#1532

Open
CatherineSue wants to merge 1 commit into
mainfrom
fix/api-reject-schemeless-worker-url
Open

fix(api): reject schemeless worker URLs in POST /workers#1532
CatherineSue wants to merge 1 commit into
mainfrom
fix/api-reject-schemeless-worker-url

Conversation

@CatherineSue
Copy link
Copy Markdown
Member

@CatherineSue CatherineSue commented May 24, 2026

Description

Problem

WorkerService::create_worker (model_gateway/src/worker/service.rs:240) calls reserve_id_for_url(config.url) (registry.rs:889) before the AddWorker workflow runs, so that the HTTP handler can return 202 Accepted synchronously with Location: /workers/{worker_id}. The contract requires the reservation key (config.url at submission time) and the registration key (worker.url() after workflow normalization) to be the same string — only then does register_inner's url_to_id.entry(...) find the pre-reserved entry and reuse the reserved WorkerId.

CreateWorkerStep rewrites the URL via normalize_url (model_gateway/src/workflow/steps/local/create_worker.rs:300-313):

fn normalize_url(url: &str, connection_mode: ConnectionMode) -> String {
    if url.starts_with("http://") || url.starts_with("https://")
        || url.starts_with("grpc://") || url.starts_with("grpcs://") {
        url.to_string()
    } else {
        match connection_mode {
            ConnectionMode::Http => format!("http://{url}"),
            ConnectionMode::Grpc => format!("grpc://{url}"),
        }
    }
}

When the submitted URL has no recognized scheme (e.g. bare host:port), normalize_url prepends one — the reservation key and registration key diverge — and the worker_id returned in the 202 response is permanently orphaned. The client polls /workers/{worker_id} and gets 404 Not Found. The live worker is registered under a different WorkerId that the client was never told about.

Adjacent #1523 (fix(service-discovery): drop http:// prefix so dual-probe detects gRPC) added a resolve_url_to_id canonicalization helper that makes registry lookups tolerate the orphan, but does not address the API contract violation. This PR closes the orphan-creation path on the HTTP API side.

Solution

Validate config.url at the API boundary, before reserve_id_for_url runs. Reject any URL that:

  • is empty,
  • does not start with one of http://, https://, grpc://, grpcs:// (case-insensitive scheme allow-list), or
  • does not parse via ::url::Url::parse or has no host.

The exact rule already lives in config::validation::ConfigValidator::validate_urls (used for static router_config URLs). Extracted it into a free pub(crate) fn validate_worker_url(url: &str) -> Result<(), String> so both the static-config validator and the new service-layer wrapper share one implementation. The service-layer wrapper (validate_worker_url_request) translates the plain error string into WorkerServiceError::BadRequest, yielding 400 Bad Request with a message like Worker URL '10.0.0.5:8000' is invalid: URL must start with http://, https://, grpc://, or grpcs://.

Changes

  • model_gateway/src/config/validation.rs: extract pub(crate) fn validate_worker_url(url: &str) -> Result<(), String>. ConfigValidator::validate_urls now delegates to it and wraps the error string in ConfigError::InvalidValue.
  • model_gateway/src/worker/service.rs: add validate_worker_url_request (wraps the shared helper into WorkerServiceError::BadRequest), call it as the first statement of create_worker so no state is mutated for invalid input. Seven unit tests covering accept (four schemes + case-insensitive) and reject (bare host:port, empty, unknown scheme, missing host, unparsable URL) paths.

Test Plan

Behavior matrix after this PR, walking every scenario the team discussed for #1523. Each timeline begins at t=0 when the API handler is invoked.

Case A — K8s pod is discovered, registered, deleted (the primary scenario #1523 fixed)

This PR does not change this path. service_discovery submits Job::AddWorker directly to the job queue and never calls WorkerService::create_worker; no validation runs.

t=0.000s  service_discovery::worker_url("10.0.0.5", 8000) -> "10.0.0.5:8000"
t=0.001s  Job::AddWorker { config { url: "10.0.0.5:8000" } } submitted
t=0.05s   Workflow starts: DetectConnectionMode probes pod
t=2.30s   detection completes -> ConnectionMode::Grpc
t=2.31s   CreateWorkerStep: normalize_url("10.0.0.5:8000", Grpc) = "grpc://10.0.0.5:8000"
t=3.50s   RegisterWorkersStep -> url_to_id["grpc://10.0.0.5:8000"] = liveId; workers[liveId] = w
t=...s    Pod deleted -> Job::RemoveWorker { url: "10.0.0.5:8000" }
          resolve_url_to_id("10.0.0.5:8000"):
            exact "10.0.0.5:8000" -> miss
            fallback "http://10.0.0.5:8000" -> miss
            fallback "grpc://10.0.0.5:8000" -> liveId (is_live True) -> returned
          Worker removed cleanly.

✅ Unchanged. Result: correct end-to-end registration and removal.

Case B — User POSTs {"url":"http://10.0.0.5:8000"} (already-schemed)

t=0.000s  HTTP handler -> WorkerService::create_worker
t=0.001s  validate_worker_url_request("http://10.0.0.5:8000") -> Ok
t=0.001s  reserve_id_for_url("http://10.0.0.5:8000") -> url_to_id["http://10.0.0.5:8000"] = reservedId
t=0.002s  Job::AddWorker { url: "http://10.0.0.5:8000" } submitted
t=0.003s  202 Accepted, Location: /workers/{reservedId}
t=0.05s   Workflow: detect HTTP, normalize_url no-op (scheme preserved)
t=3.50s   RegisterWorkersStep -> register_or_replace(worker)
          register_inner: url_to_id.entry("http://10.0.0.5:8000") is OCCUPIED with reservedId
          -> reuses reservedId, workers[reservedId] = worker
t=...s    Client polls /workers/{reservedId} -> 200 with worker info ✓

✅ Unchanged. Validation passes, no orphan.

Case C — User POSTs {"url":"10.0.0.5:8000"} (bare) — THIS IS THE CASE THIS PR FIXES

Before this PR:

t=0.000s  HTTP handler -> WorkerService::create_worker
t=0.001s  reserve_id_for_url("10.0.0.5:8000") -> url_to_id["10.0.0.5:8000"] = reservedId
t=0.002s  Job::AddWorker submitted
t=0.003s  202 Accepted, Location: /workers/{reservedId}, body worker_id = reservedId
t=0.05s   Workflow: detect Grpc, normalize_url("10.0.0.5:8000", Grpc) = "grpc://10.0.0.5:8000"
t=3.50s   RegisterWorkersStep -> register_or_replace(worker, url="grpc://10.0.0.5:8000")
          register_inner: url_to_id.entry("grpc://10.0.0.5:8000") is VACANT
          -> mints NEW liveId, workers[liveId] = worker, url_to_id["grpc://10.0.0.5:8000"] = liveId
t=3.51s   Final registry state:
            url_to_id["10.0.0.5:8000"]      = reservedId  <- ORPHAN, no worker behind it
            url_to_id["grpc://10.0.0.5:8000"] = liveId    <- live worker
            workers[liveId]                 = w
            workers[reservedId]             = (absent)
t=...s    Client polls /workers/{reservedId} -> 404 Not Found
          Client has no way to discover that the actual worker_id is liveId.

After this PR:

t=0.000s  HTTP handler -> WorkerService::create_worker
t=0.001s  validate_worker_url_request("10.0.0.5:8000")
          -> Err(BadRequest { message: "Worker URL '10.0.0.5:8000' is invalid:
             URL must start with http://, https://, grpc://, or grpcs://" })
t=0.002s  400 Bad Request returned to client
          No reservation written. No job submitted. Registry unchanged.

✅ Fixed. Client gets an immediate, actionable 400 instead of 202 -> 404. No orphan can be created via this path.

Case D — Mixed source: a human POSTs the same pod K8s is already managing

Two sub-cases.

D1: POST schemed URL while K8s also discovers the same pod.

t=0.000s  service_discovery registers pod first:
          url_to_id["grpc://10.0.0.5:8000"] = liveId; workers[liveId] = w
t=1.000s  User POSTs {"url":"http://10.0.0.5:8000"}
t=1.001s  validate_worker_url_request -> Ok
t=1.001s  reserve_id_for_url("http://10.0.0.5:8000")
          -> url_to_id["http://10.0.0.5:8000"] = newReservedId (different from K8s's liveId because key differs)
t=1.002s  conflict check: get(newReservedId).is_some() -> false (no worker at this id) -> no 409
t=1.003s  Job::AddWorker { url: "http://10.0.0.5:8000" } submitted
t=1.004s  202 Accepted with worker_id = newReservedId
t=1.05s   Workflow: detect HTTP, normalize_url no-op
t=4.50s   RegisterWorkersStep:
          register_inner: url_to_id.entry("http://10.0.0.5:8000") is OCCUPIED with newReservedId
          -> reuses it; workers[newReservedId] = w'
          (Two distinct workers now exist for the same pod, one under each scheme.)

Two valid workers exist for the same host:port under different schemes — a pre-existing data model permissiveness, unchanged by this PR. The 202's worker_id is honest; polling it returns the worker the user created. No orphan, no 404.

D2: POST bare URL while K8s also discovers the same pod.

Before this PR: produces the same orphan + 404 wart as Case C, plus the pod is also tracked separately by service_discovery under the canonical scheme.

After this PR: 400 at the API boundary as in Case C. The K8s-side registration is unaffected.

✅ The only behavior change is in the bare-URL sub-case, where we now reject upfront instead of silently producing an orphan.

Case E — Workflow or job-submit fails after a successful reservation

t=0.000s  HTTP handler -> WorkerService::create_worker
t=0.001s  validate_worker_url_request("http://10.0.0.5:8000") -> Ok
t=0.001s  reserve_id_for_url(...) -> url_to_id["http://10.0.0.5:8000"] = reservedId
t=0.002s  Submit Job::AddWorker

Path E1: queue submit fails
t=0.003s  submit().await -> Err(...)
t=0.003s  create_worker returns Err(QueueSubmitFailed). HTTP 500.
          The reservation written at t=0.001 IS NOT REMOVED.

Path E2: queue submit succeeds, workflow fails downstream
t=0.003s  202 returned with worker_id = reservedId
t=0.05s   Workflow runs: DetectConnectionMode times out, or build fails
t=...s    Workflow errors out. register_or_replace never runs.
          The reservation written at t=0.001 IS NOT REMOVED.
          Client polls /workers/{reservedId} -> 404.

⚠️ Not addressed by this PR. These leak paths exist on main today regardless of whether the submitted URL was schemed or bare. Validation at the API boundary cannot fix them — the URL is well-formed; it's the downstream failure that orphans the reservation.

Properly fixing this requires a release_reservation(url) -> bool registry primitive that drops a url_to_id entry only when its WorkerId is not in workers, wired into:

  • WorkerService::create_worker's error-return paths (so E1 is reaped synchronously)
  • The AddWorker workflow's terminal-failure hook (so E2 is reaped asynchronously)

Tracked separately; out of scope for this PR.

Test summary

  • cargo test -p smg --lib worker::service::tests:: — 7/7 new tests passing (covering each accept and reject path of validate_worker_url_request, including verification that rejections surface as WorkerServiceError::BadRequest with StatusCode::BAD_REQUEST and the expected error message text).
  • cargo test -p smg --lib config::validation::tests:: — 23/23 pre-existing config-side tests passing after the helper extraction; test_validate_invalid_urls still exercises the same accept/reject rules via the new shared path.
  • cargo +nightly fmt --check clean.
  • cargo clippy -p smg --all-targets --all-features -- -D warnings clean.
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • Bug Fixes

    • Worker URL validation now enforces stricter requirements during creation, returning HTTP 400 errors for invalid URLs.
    • Validation rejects empty URLs, unsupported schemes, unparsable URLs, and missing hosts.
    • Added support for http, https, grpc, and grpcs schemes.
  • Tests

    • Added unit tests covering URL validation scenarios and error cases.

Review Change Stack

`WorkerService::create_worker` calls `reserve_id_for_url(config.url)`
before the AddWorker workflow runs, then returns a 202 with that
WorkerId in the Location header. The reservation key is the submitted
URL string; the workflow's `CreateWorkerStep` rewrites the URL via
`normalize_url`, which prepends `http://` or `grpc://` for any input
that doesn't already start with one of `http://`, `https://`, `grpc://`,
or `grpcs://`. When normalization changes the string, the reservation
is keyed on the bare URL while the live worker is registered under the
canonical URL — two `url_to_id` entries, the bare one orphaned forever,
and the WorkerId returned in the 202 points at nothing. The client polls
its Location and gets 404.

Validate the URL at the API boundary so the orphan can't be created
through this path. Extract the existing scheme + host validation in
`config::validation::ConfigValidator::validate_urls` into a free
`pub(crate) fn validate_worker_url(url: &str) -> Result<(), String>`,
have both the static-config validator and the new service-layer
wrapper call it. Service-layer wrapper returns
`WorkerServiceError::BadRequest` so the API returns 400 with a clear
message instead of 202 -> 404.

This does not address workflow-failure orphans (the reservation also
leaks when the AddWorker workflow itself errors after a successful
submission). That class needs a `release_reservation` lifecycle on
the registry and is tracked as a separate follow-up.

Signed-off-by: Chang Su <8605658+CatherineSue@users.noreply.github.com>
@CatherineSue CatherineSue requested a review from slin1237 as a code owner May 24, 2026 18:34
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 24, 2026

📝 Walkthrough

Walkthrough

This PR extracts worker URL validation logic into a reusable helper function at the config layer, integrates it into config validation to produce ConfigError exceptions, and adds service-layer request validation that wraps validation failures into HTTP 400-style WorkerServiceError::BadRequest responses with comprehensive test coverage.

Changes

Worker URL validation extraction and integration

Layer / File(s) Summary
Core worker URL validation function
model_gateway/src/config/validation.rs
New validate_worker_url function rejects empty URLs, enforces allowed schemes (http, https, grpc, grpcs case-insensitive), rejects unparsable URLs and parsed URLs without a valid host; returns descriptive string errors.
Config validation integration
model_gateway/src/config/validation.rs
validate_urls loop now delegates to validate_worker_url and wraps string errors into ConfigError::InvalidValue with field: "worker_url" and the original URL.
Service request validation and tests
model_gateway/src/worker/service.rs
Service layer imports the validation function, adds validate_worker_url_request wrapper that converts validation failures to WorkerServiceError::BadRequest, integrates validation into create_worker, and includes tests covering accepted schemes, case-insensitivity, and rejection of bare host/port, empty string, unknown scheme, missing host, and unparsable URLs.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • lightseekorg/smg#1485: Both PRs modify model_gateway/src/config/validation.rs to change how worker_url schemes are accepted and validated (including grpcs:// and gRPC handling), so this PR's new validate_worker_url logic builds on the same validation concerns.

Suggested labels

tests, model-gateway

Suggested reviewers

  • slin1237
  • key4ng

Poem

🐰 A URL takes shape, schemes get validated clear,
From config to service, the checks travel near,
Empty strings rejected, grpc/https aligned,
With tests holding firm what the logic designed!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: rejecting schemeless worker URLs in the POST /workers endpoint, which directly addresses the bug fixed in this PR.
Docstring Coverage ✅ Passed Docstring coverage is 92.31% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/api-reject-schemeless-worker-url

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the model-gateway Model gateway crate changes label May 24, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors worker URL validation by extracting the logic into a reusable validate_worker_url function within the configuration module. This function is now utilized in the worker service to validate URLs during worker creation, ensuring that invalid inputs result in a 400 Bad Request error. The changes also include a comprehensive suite of unit tests covering various URL validation scenarios, such as scheme checks and host parsing. I have no feedback to provide as there were no review comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@model_gateway/src/worker/service.rs`:
- Around line 445-497: Add a service-level regression test that calls
WorkerService::create_worker with the bare host string "10.0.0.5:8000" and
asserts it returns a BadRequest (same failure class as
validate_worker_url_request); wire WorkerService with test doubles/mocks for the
ID reservation and job-queue submission used by create_worker and assert those
mocks were NOT invoked (i.e., no ID reservation and no queue submission
occurred) to ensure invalid URLs fail before side effects.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: df342086-617a-4ede-b206-9f0a7499ec65

📥 Commits

Reviewing files that changed from the base of the PR and between 0f30a41 and 5dd82de.

📒 Files selected for processing (2)
  • model_gateway/src/config/validation.rs
  • model_gateway/src/worker/service.rs

Comment on lines +445 to +497
#[cfg(test)]
mod tests {
use super::*;

#[test]
fn validate_worker_url_request_accepts_all_four_schemes() {
assert!(validate_worker_url_request("http://10.0.0.5:8000").is_ok());
assert!(validate_worker_url_request("https://10.0.0.5:8000").is_ok());
assert!(validate_worker_url_request("grpc://10.0.0.5:8000").is_ok());
assert!(validate_worker_url_request("grpcs://10.0.0.5:8000").is_ok());
}

#[test]
fn validate_worker_url_request_accepts_case_insensitive_schemes() {
assert!(validate_worker_url_request("HTTP://10.0.0.5:8000").is_ok());
assert!(validate_worker_url_request("GrPc://10.0.0.5:8000").is_ok());
}

#[test]
fn validate_worker_url_request_rejects_bare_host_port_as_400() {
let err = validate_worker_url_request("10.0.0.5:8000").unwrap_err();
assert!(matches!(err, WorkerServiceError::BadRequest { .. }));
assert_eq!(err.status_code(), StatusCode::BAD_REQUEST);
assert!(err
.to_string()
.contains("http://, https://, grpc://, or grpcs://"));
}

#[test]
fn validate_worker_url_request_rejects_empty_as_400() {
let err = validate_worker_url_request("").unwrap_err();
assert!(matches!(err, WorkerServiceError::BadRequest { .. }));
assert!(err.to_string().contains("empty"));
}

#[test]
fn validate_worker_url_request_rejects_unknown_scheme() {
let err = validate_worker_url_request("ftp://10.0.0.5:8000").unwrap_err();
assert!(matches!(err, WorkerServiceError::BadRequest { .. }));
}

#[test]
fn validate_worker_url_request_rejects_missing_host() {
let err = validate_worker_url_request("http://").unwrap_err();
assert!(matches!(err, WorkerServiceError::BadRequest { .. }));
}

#[test]
fn validate_worker_url_request_rejects_unparsable_url() {
let err = validate_worker_url_request("http://[invalid").unwrap_err();
assert!(matches!(err, WorkerServiceError::BadRequest { .. }));
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add a service-level regression test for create_worker side effects on invalid URLs.

These tests validate the wrapper, but they don’t assert the core regression behavior in WorkerService::create_worker: invalid URLs should fail before reservation and queue submission. Please add a test that calls create_worker("10.0.0.5:8000") and verifies no ID reservation/job submission occurs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model_gateway/src/worker/service.rs` around lines 445 - 497, Add a
service-level regression test that calls WorkerService::create_worker with the
bare host string "10.0.0.5:8000" and asserts it returns a BadRequest (same
failure class as validate_worker_url_request); wire WorkerService with test
doubles/mocks for the ID reservation and job-queue submission used by
create_worker and assert those mocks were NOT invoked (i.e., no ID reservation
and no queue submission occurred) to ensure invalid URLs fail before side
effects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model-gateway Model gateway crate changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant