Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 24 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ replayable scheduling traces, and canary/shadow release decisions.
outcomes.
- Backend mirror normalization for vLLM/SGLang-style serving observations
before the release gate runs.
- Streaming token-event normalization with route and scheduler provenance for
mirrored serving traces.
- Exact output checks, model-aware numeric tolerances for backend drift,
per-segment release summaries, error-rate deltas, p95 latency regression
policy, TTFT and decode-token p95 checks, KV memory-pressure reporting,
Expand Down Expand Up @@ -54,6 +56,10 @@ cargo run --release -- gate \
cargo run --release -- mirror-gate \
--input fixtures/backend_mirror_vllm_sglang.json \
--output artifacts/backend-mirror-report.json

cargo run --release -- mirror-gate \
--input fixtures/backend_mirror_streaming_vllm_sglang.json \
--output artifacts/backend-mirror-streaming-report.json
```

The safe fixture produces `promote`. The candidate with an output mismatch and
Expand All @@ -64,6 +70,11 @@ The backend-mirror fixture converts vLLM/SGLang-style request observations into
the same release gate and produces `promote` with a vLLM to SGLang segment,
model-version transition metadata, queue depth, KV memory pressure, TTFT, and
decode-token p95 telemetry.
The streaming mirror fixture uses per-token stream events instead of compact
token arrays and requires complete candidate route and scheduler provenance. It
produces `promote` with `candidate_routing_provenance_rate: 1.0`,
`candidate_streaming_trace_rate: 1.0`, two candidate routes, and
`continuous-batching` scheduler evidence.

The checked workload fixture completes four requests in 11 scheduler ticks,
accounts for 224 prompt tokens, 18 decode tokens, and 18 reserved KV pages,
Expand Down Expand Up @@ -108,14 +119,18 @@ gate input. `runtime-lab mirror-gate` performs the conversion and immediately
evaluates the release policy.

The adapter accepts per-request latency, health, model, backend, accelerator,
output token IDs, explicit output fingerprints, and optional numeric output
vectors. Successful observations must carry output material so correctness
checks remain auditable. Token IDs and numeric vectors are converted into
stable FNV-1a fingerprints when an engine-specific fingerprint is not supplied.
Observations may also carry model version, queue depth, KV page usage, TTFT,
decode-token latencies, and token-trace fingerprints. Those fields let the gate
surface rollout context and hold a candidate when latency or memory-pressure
telemetry crosses policy even if output correctness is intact.
output token IDs, explicit output fingerprints, optional numeric output
vectors, and optional streaming token events. Successful observations must
carry output material so correctness checks remain auditable. Token IDs,
streaming token events, and numeric vectors are converted into stable FNV-1a
fingerprints when an engine-specific fingerprint is not supplied. Observations
may also carry model version, route ID, replica ID, scheduler policy, queue
depth, KV page usage, TTFT, decode-token latencies, and token-trace
fingerprints. When per-token stream events are provided, the adapter derives
TTFT and decode-token gaps from their elapsed timestamps. Those fields let the
gate surface rollout context and hold a candidate when latency, memory
pressure, routing provenance, or streaming trace coverage crosses policy even
if output correctness is intact.

## Release Policy

Expand All @@ -134,6 +149,7 @@ hint, and the next investigation action.
| Error-rate increase above policy | `rollback` |
| p95 latency regression above policy | `hold` |
| TTFT, decode-token p95, or memory-pressure regression above policy | `hold` |
| Missing required candidate route/scheduler or streaming-token evidence | `hold` |
| Missing or insufficient matched traffic | `hold` |
| Complete evidence within policy | `promote` |

Expand Down
10 changes: 9 additions & 1 deletion artifacts/backend-mirror-report.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"schema_version": 3,
"schema_version": 4,
"decision": "promote",
"matched_requests": 4,
"baseline_requests": 4,
Expand All @@ -25,6 +25,10 @@
"decode_token_p95_regression_pct": -6.666667,
"max_candidate_queue_depth": 6,
"max_candidate_memory_pressure_pct": 60.0,
"candidate_routing_provenance_rate": 0.0,
"candidate_streaming_trace_rate": 0.0,
"candidate_route_count": 0,
"candidate_scheduler_policies": [],
"token_trace_pairs": 4,
"token_trace_mismatch_rate": 0.0,
"segments": [
Expand All @@ -50,6 +54,10 @@
"decode_token_p95_regression_pct": -6.666667,
"max_candidate_queue_depth": 6,
"max_candidate_memory_pressure_pct": 60.0,
"candidate_routing_provenance_rate": 0.0,
"candidate_streaming_trace_rate": 0.0,
"candidate_route_count": 0,
"candidate_scheduler_policies": [],
"token_trace_pairs": 4,
"token_trace_mismatch_rate": 0.0
}
Expand Down
73 changes: 73 additions & 0 deletions artifacts/backend-mirror-streaming-report.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
{
"schema_version": 4,
"decision": "promote",
"matched_requests": 4,
"baseline_requests": 4,
"candidate_requests": 4,
"coverage_rate": 1.0,
"output_mismatch_rate": 0.0,
"numeric_pairs": 0,
"tolerated_numeric_outputs": 0,
"numeric_drift_rate": 0.0,
"max_numeric_abs_error": null,
"max_numeric_rel_error": null,
"baseline_error_rate": 0.0,
"candidate_error_rate": 0.0,
"error_rate_increase": 0.0,
"baseline_p95_latency_ms": 28.0,
"candidate_p95_latency_ms": 27.2,
"p95_latency_regression_pct": -2.857143,
"baseline_p95_ttft_ms": 9.0,
"candidate_p95_ttft_ms": 8.5,
"ttft_regression_pct": -5.555556,
"baseline_decode_token_p95_ms": 7.0,
"candidate_decode_token_p95_ms": 6.5,
"decode_token_p95_regression_pct": -7.142857,
"max_candidate_queue_depth": 6,
"max_candidate_memory_pressure_pct": 60.0,
"candidate_routing_provenance_rate": 1.0,
"candidate_streaming_trace_rate": 1.0,
"candidate_route_count": 2,
"candidate_scheduler_policies": [
"continuous-batching"
],
"token_trace_pairs": 4,
"token_trace_mismatch_rate": 0.0,
"segments": [
{
"model": "decoder-7b",
"baseline_backend": "vllm",
"candidate_backend": "sglang",
"accelerator": "h100",
"baseline_model_version": "decoder-7b@baseline-2026-06-28",
"candidate_model_version": "decoder-7b@candidate-2026-06-28",
"matched_requests": 4,
"output_mismatch_rate": 0.0,
"baseline_error_rate": 0.0,
"candidate_error_rate": 0.0,
"baseline_p95_latency_ms": 28.0,
"candidate_p95_latency_ms": 27.2,
"p95_latency_regression_pct": -2.857143,
"baseline_p95_ttft_ms": 9.0,
"candidate_p95_ttft_ms": 8.5,
"ttft_regression_pct": -5.555556,
"baseline_decode_token_p95_ms": 7.0,
"candidate_decode_token_p95_ms": 6.5,
"decode_token_p95_regression_pct": -7.142857,
"max_candidate_queue_depth": 6,
"max_candidate_memory_pressure_pct": 60.0,
"candidate_routing_provenance_rate": 1.0,
"candidate_streaming_trace_rate": 1.0,
"candidate_route_count": 2,
"candidate_scheduler_policies": [
"continuous-batching"
],
"token_trace_pairs": 4,
"token_trace_mismatch_rate": 0.0
}
],
"triage": [],
"reasons": [
"candidate stayed within correctness, reliability, latency, and telemetry policy"
]
}
10 changes: 9 additions & 1 deletion artifacts/release-gate-numeric-tolerance.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"schema_version": 3,
"schema_version": 4,
"decision": "promote",
"matched_requests": 4,
"baseline_requests": 4,
Expand All @@ -25,6 +25,10 @@
"decode_token_p95_regression_pct": null,
"max_candidate_queue_depth": null,
"max_candidate_memory_pressure_pct": null,
"candidate_routing_provenance_rate": 0.0,
"candidate_streaming_trace_rate": 0.0,
"candidate_route_count": 0,
"candidate_scheduler_policies": [],
"token_trace_pairs": 0,
"token_trace_mismatch_rate": 0.0,
"segments": [
Expand All @@ -50,6 +54,10 @@
"decode_token_p95_regression_pct": null,
"max_candidate_queue_depth": null,
"max_candidate_memory_pressure_pct": null,
"candidate_routing_provenance_rate": 0.0,
"candidate_streaming_trace_rate": 0.0,
"candidate_route_count": 0,
"candidate_scheduler_policies": [],
"token_trace_pairs": 0,
"token_trace_mismatch_rate": 0.0
}
Expand Down
10 changes: 9 additions & 1 deletion artifacts/release-gate-promote.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"schema_version": 3,
"schema_version": 4,
"decision": "promote",
"matched_requests": 4,
"baseline_requests": 4,
Expand All @@ -25,6 +25,10 @@
"decode_token_p95_regression_pct": null,
"max_candidate_queue_depth": null,
"max_candidate_memory_pressure_pct": null,
"candidate_routing_provenance_rate": 0.0,
"candidate_streaming_trace_rate": 0.0,
"candidate_route_count": 0,
"candidate_scheduler_policies": [],
"token_trace_pairs": 0,
"token_trace_mismatch_rate": 0.0,
"segments": [
Expand All @@ -50,6 +54,10 @@
"decode_token_p95_regression_pct": null,
"max_candidate_queue_depth": null,
"max_candidate_memory_pressure_pct": null,
"candidate_routing_provenance_rate": 0.0,
"candidate_streaming_trace_rate": 0.0,
"candidate_route_count": 0,
"candidate_scheduler_policies": [],
"token_trace_pairs": 0,
"token_trace_mismatch_rate": 0.0
}
Expand Down
10 changes: 9 additions & 1 deletion artifacts/release-gate-rollback.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"schema_version": 3,
"schema_version": 4,
"decision": "rollback",
"matched_requests": 4,
"baseline_requests": 4,
Expand All @@ -25,6 +25,10 @@
"decode_token_p95_regression_pct": null,
"max_candidate_queue_depth": null,
"max_candidate_memory_pressure_pct": null,
"candidate_routing_provenance_rate": 0.0,
"candidate_streaming_trace_rate": 0.0,
"candidate_route_count": 0,
"candidate_scheduler_policies": [],
"token_trace_pairs": 0,
"token_trace_mismatch_rate": 0.0,
"segments": [
Expand All @@ -50,6 +54,10 @@
"decode_token_p95_regression_pct": null,
"max_candidate_queue_depth": null,
"max_candidate_memory_pressure_pct": null,
"candidate_routing_provenance_rate": 0.0,
"candidate_streaming_trace_rate": 0.0,
"candidate_route_count": 0,
"candidate_scheduler_policies": [],
"token_trace_pairs": 0,
"token_trace_mismatch_rate": 0.0
}
Expand Down
26 changes: 15 additions & 11 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,13 +48,15 @@ It computes:
- successful-request p95 latency; and
- candidate p95 regression;
- TTFT p95 and decode-token p95 regression;
- candidate queue depth and KV memory pressure; and
- candidate queue depth and KV memory pressure;
- candidate route/scheduler provenance and streaming trace coverage; and
- segment summaries by model, model version, baseline backend, candidate
backend, and accelerator.

Correctness, numeric drift, or reliability regressions produce `rollback`.
Latency, token-path, memory-pressure regressions, or incomplete evidence produce
`hold`. A complete candidate within policy produces `promote`.
Latency, token-path, memory-pressure, missing routing provenance, missing
streaming trace coverage, or incomplete evidence produce `hold`. A complete
candidate within policy produces `promote`.
Hold and rollback reports include structured triage items so CI or rollout
tooling can route the failed signal to a likely owner without parsing prose.

Expand All @@ -68,11 +70,13 @@ The adapter sits before the release gate. It normalizes backend-specific
mirrored observations into `GateInput` without changing the gate policy. This
keeps ingestion concerns separate from rollout decisions.

The adapter currently accepts compact vLLM/SGLang-style request summaries:
request ID, latency, health, model, backend, accelerator, output token IDs,
optional explicit fingerprints, optional numeric vectors, model version, queue
depth, KV page usage, TTFT, decode-token latencies, and optional token-trace
fingerprints. If an engine does not provide an output fingerprint, the adapter
computes a stable FNV-1a fingerprint from token IDs or numeric values.
Successful observations without output material are rejected so a candidate
cannot be promoted from latency-only evidence.
The adapter currently accepts compact vLLM/SGLang-style request summaries and
streaming request traces: request ID, latency, health, model, backend,
accelerator, output token IDs, streaming token events, optional explicit
fingerprints, optional numeric vectors, model version, route ID, replica ID,
scheduler policy, queue depth, KV page usage, TTFT, decode-token latencies, and
optional token-trace fingerprints. If an engine does not provide an output
fingerprint, the adapter computes a stable FNV-1a fingerprint from token IDs,
streaming token events, or numeric values. Successful observations without
output material are rejected so a candidate cannot be promoted from
latency-only evidence.
33 changes: 22 additions & 11 deletions docs/RELEASE_VALIDATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ Use `promote` when:
- the candidate error-rate increase stays within policy; and
- candidate p95 latency, TTFT p95, decode-token p95, and KV memory pressure
stay within the configured regression budgets.
- required candidate route/scheduler provenance and streaming token traces are
complete when those checks are enabled.

## Hold

Expand All @@ -21,6 +23,9 @@ output pairs, or a p95 latency regression without a correctness failure.
The same response is used for excessive TTFT regression, decode-token p95
regression, or candidate KV memory pressure because those are operational
signals that need investigation before rollout.
If route/scheduler provenance or streaming token traces are required but
missing, the gate also returns `hold`; the candidate may still be correct, but
the rollout evidence is not complete enough to trust the serving path.

## Rollback

Expand Down Expand Up @@ -54,17 +59,20 @@ The report includes:
Mirrored observations can include rollout context and token-path telemetry:

- model version;
- route ID, replica ID, and scheduler policy;
- queue depth;
- KV pages used and available;
- time to first token;
- per-token decode latencies; and
- token-trace fingerprints.
- streaming token events and token-trace fingerprints.

The gate reports aggregate and per-segment TTFT p95, decode-token p95, maximum
candidate queue depth, maximum candidate memory pressure, and token-trace
mismatch rate. Correct outputs with excessive latency or memory pressure produce
`hold`, not `rollback`, because the evidence points to performance or capacity
risk rather than a correctness failure.
candidate queue depth, maximum candidate memory pressure, candidate
route/scheduler provenance coverage, candidate streaming-trace coverage,
candidate route count, scheduler policies, and token-trace mismatch rate.
Correct outputs with excessive latency, memory pressure, missing provenance, or
missing streaming traces produce `hold`, not `rollback`, because the evidence
points to operational risk rather than a correctness failure.

## Triage Output

Expand All @@ -87,11 +95,13 @@ baseline/candidate comparisons such as vLLM versus SGLang, or a current
runtime versus a candidate runtime behind shadow traffic.

Each observation records request ID, latency, health, model, backend,
accelerator, output material, and optional operational telemetry. Engines may provide their own
`output_fingerprint`; otherwise the adapter hashes output token IDs or numeric
output vectors with a stable FNV-1a fingerprint. Successful observations
without output material are rejected because the release gate cannot audit
correctness from latency alone.
accelerator, output material, and optional operational telemetry. Engines may
provide their own `output_fingerprint`; otherwise the adapter hashes output
token IDs, streaming token events, or numeric output vectors with a stable
FNV-1a fingerprint. Streaming token events also let the adapter derive TTFT and
decode-token gaps from elapsed timestamps. Successful observations without
output material are rejected because the release gate cannot audit correctness
from latency alone.

## Production Extension Points

Expand All @@ -101,7 +111,8 @@ A real rollout system should add:
- prompt-class and region segmentation;
- SLO burn-rate and saturation signals;
- canary population controls and audited rollback execution; and
- provenance linking every decision to build, model, and configuration IDs.
- provenance linking every decision to build, model, route, scheduler, and
configuration IDs.

The checked fixtures are synthetic and exist to make the policy executable in
CI. They are not claims about production traffic or fleet scale.
Loading
Loading