WaffleBits · WaffleBits · Jun 28, 2026 · Jun 28, 2026
diff --git a/README.md b/README.md
@@ -20,6 +20,8 @@ replayable scheduling traces, and canary/shadow release decisions.
   outcomes.
 - Backend mirror normalization for vLLM/SGLang-style serving observations
   before the release gate runs.
+- Streaming token-event normalization with route and scheduler provenance for
+  mirrored serving traces.
 - Exact output checks, model-aware numeric tolerances for backend drift,
   per-segment release summaries, error-rate deltas, p95 latency regression
   policy, TTFT and decode-token p95 checks, KV memory-pressure reporting,
@@ -54,6 +56,10 @@ cargo run --release -- gate \
 cargo run --release -- mirror-gate \
   --input fixtures/backend_mirror_vllm_sglang.json \
   --output artifacts/backend-mirror-report.json
+
+cargo run --release -- mirror-gate \
+  --input fixtures/backend_mirror_streaming_vllm_sglang.json \
+  --output artifacts/backend-mirror-streaming-report.json
 ```
 
 The safe fixture produces `promote`. The candidate with an output mismatch and
@@ -64,6 +70,11 @@ The backend-mirror fixture converts vLLM/SGLang-style request observations into
 the same release gate and produces `promote` with a vLLM to SGLang segment,
 model-version transition metadata, queue depth, KV memory pressure, TTFT, and
 decode-token p95 telemetry.
+The streaming mirror fixture uses per-token stream events instead of compact
+token arrays and requires complete candidate route and scheduler provenance. It
+produces `promote` with `candidate_routing_provenance_rate: 1.0`,
+`candidate_streaming_trace_rate: 1.0`, two candidate routes, and
+`continuous-batching` scheduler evidence.
 
 The checked workload fixture completes four requests in 11 scheduler ticks,
 accounts for 224 prompt tokens, 18 decode tokens, and 18 reserved KV pages,
@@ -108,14 +119,18 @@ gate input. `runtime-lab mirror-gate` performs the conversion and immediately
 evaluates the release policy.
 
 The adapter accepts per-request latency, health, model, backend, accelerator,
-output token IDs, explicit output fingerprints, and optional numeric output
-vectors. Successful observations must carry output material so correctness
-checks remain auditable. Token IDs and numeric vectors are converted into
-stable FNV-1a fingerprints when an engine-specific fingerprint is not supplied.
-Observations may also carry model version, queue depth, KV page usage, TTFT,
-decode-token latencies, and token-trace fingerprints. Those fields let the gate
-surface rollout context and hold a candidate when latency or memory-pressure
-telemetry crosses policy even if output correctness is intact.
+output token IDs, explicit output fingerprints, optional numeric output
+vectors, and optional streaming token events. Successful observations must
+carry output material so correctness checks remain auditable. Token IDs,
+streaming token events, and numeric vectors are converted into stable FNV-1a
+fingerprints when an engine-specific fingerprint is not supplied. Observations
+may also carry model version, route ID, replica ID, scheduler policy, queue
+depth, KV page usage, TTFT, decode-token latencies, and token-trace
+fingerprints. When per-token stream events are provided, the adapter derives
+TTFT and decode-token gaps from their elapsed timestamps. Those fields let the
+gate surface rollout context and hold a candidate when latency, memory
+pressure, routing provenance, or streaming trace coverage crosses policy even
+if output correctness is intact.
 
 ## Release Policy
 
@@ -134,6 +149,7 @@ hint, and the next investigation action.
 | Error-rate increase above policy | `rollback` |
 | p95 latency regression above policy | `hold` |
 | TTFT, decode-token p95, or memory-pressure regression above policy | `hold` |
+| Missing required candidate route/scheduler or streaming-token evidence | `hold` |
 | Missing or insufficient matched traffic | `hold` |
 | Complete evidence within policy | `promote` |
 

diff --git a/artifacts/backend-mirror-report.json b/artifacts/backend-mirror-report.json
@@ -1,5 +1,5 @@
 {
-  "schema_version": 3,
+  "schema_version": 4,
   "decision": "promote",
   "matched_requests": 4,
   "baseline_requests": 4,
@@ -25,6 +25,10 @@
   "decode_token_p95_regression_pct": -6.666667,
   "max_candidate_queue_depth": 6,
   "max_candidate_memory_pressure_pct": 60.0,
+  "candidate_routing_provenance_rate": 0.0,
+  "candidate_streaming_trace_rate": 0.0,
+  "candidate_route_count": 0,
+  "candidate_scheduler_policies": [],
   "token_trace_pairs": 4,
   "token_trace_mismatch_rate": 0.0,
   "segments": [
@@ -50,6 +54,10 @@
       "decode_token_p95_regression_pct": -6.666667,
       "max_candidate_queue_depth": 6,
       "max_candidate_memory_pressure_pct": 60.0,
+      "candidate_routing_provenance_rate": 0.0,
+      "candidate_streaming_trace_rate": 0.0,
+      "candidate_route_count": 0,
+      "candidate_scheduler_policies": [],
       "token_trace_pairs": 4,
       "token_trace_mismatch_rate": 0.0
     }

diff --git a/artifacts/backend-mirror-streaming-report.json b/artifacts/backend-mirror-streaming-report.json
@@ -0,0 +1,73 @@
+{
+  "schema_version": 4,
+  "decision": "promote",
+  "matched_requests": 4,
+  "baseline_requests": 4,
+  "candidate_requests": 4,
+  "coverage_rate": 1.0,
+  "output_mismatch_rate": 0.0,
+  "numeric_pairs": 0,
+  "tolerated_numeric_outputs": 0,
+  "numeric_drift_rate": 0.0,
+  "max_numeric_abs_error": null,
+  "max_numeric_rel_error": null,
+  "baseline_error_rate": 0.0,
+  "candidate_error_rate": 0.0,
+  "error_rate_increase": 0.0,
+  "baseline_p95_latency_ms": 28.0,
+  "candidate_p95_latency_ms": 27.2,
+  "p95_latency_regression_pct": -2.857143,
+  "baseline_p95_ttft_ms": 9.0,
+  "candidate_p95_ttft_ms": 8.5,
+  "ttft_regression_pct": -5.555556,
+  "baseline_decode_token_p95_ms": 7.0,
+  "candidate_decode_token_p95_ms": 6.5,
+  "decode_token_p95_regression_pct": -7.142857,
+  "max_candidate_queue_depth": 6,
+  "max_candidate_memory_pressure_pct": 60.0,
+  "candidate_routing_provenance_rate": 1.0,
+  "candidate_streaming_trace_rate": 1.0,
+  "candidate_route_count": 2,
+  "candidate_scheduler_policies": [
+    "continuous-batching"
+  ],
+  "token_trace_pairs": 4,
+  "token_trace_mismatch_rate": 0.0,
+  "segments": [
+    {
+      "model": "decoder-7b",
+      "baseline_backend": "vllm",
+      "candidate_backend": "sglang",
+      "accelerator": "h100",
+      "baseline_model_version": "decoder-7b@baseline-2026-06-28",
+      "candidate_model_version": "decoder-7b@candidate-2026-06-28",
+      "matched_requests": 4,
+      "output_mismatch_rate": 0.0,
+      "baseline_error_rate": 0.0,
+      "candidate_error_rate": 0.0,
+      "baseline_p95_latency_ms": 28.0,
+      "candidate_p95_latency_ms": 27.2,
+      "p95_latency_regression_pct": -2.857143,
+      "baseline_p95_ttft_ms": 9.0,
+      "candidate_p95_ttft_ms": 8.5,
+      "ttft_regression_pct": -5.555556,
+      "baseline_decode_token_p95_ms": 7.0,
+      "candidate_decode_token_p95_ms": 6.5,
+      "decode_token_p95_regression_pct": -7.142857,
+      "max_candidate_queue_depth": 6,
+      "max_candidate_memory_pressure_pct": 60.0,
+      "candidate_routing_provenance_rate": 1.0,
+      "candidate_streaming_trace_rate": 1.0,
+      "candidate_route_count": 2,
+      "candidate_scheduler_policies": [
+        "continuous-batching"
+      ],
+      "token_trace_pairs": 4,
+      "token_trace_mismatch_rate": 0.0
+    }
+  ],
+  "triage": [],
+  "reasons": [
+    "candidate stayed within correctness, reliability, latency, and telemetry policy"
+  ]
+}
diff --git a/artifacts/release-gate-numeric-tolerance.json b/artifacts/release-gate-numeric-tolerance.json
@@ -1,5 +1,5 @@
 {
-  "schema_version": 3,
+  "schema_version": 4,
   "decision": "promote",
   "matched_requests": 4,
   "baseline_requests": 4,
@@ -25,6 +25,10 @@
   "decode_token_p95_regression_pct": null,
   "max_candidate_queue_depth": null,
   "max_candidate_memory_pressure_pct": null,
+  "candidate_routing_provenance_rate": 0.0,
+  "candidate_streaming_trace_rate": 0.0,
+  "candidate_route_count": 0,
+  "candidate_scheduler_policies": [],
   "token_trace_pairs": 0,
   "token_trace_mismatch_rate": 0.0,
   "segments": [
@@ -50,6 +54,10 @@
       "decode_token_p95_regression_pct": null,
       "max_candidate_queue_depth": null,
       "max_candidate_memory_pressure_pct": null,
+      "candidate_routing_provenance_rate": 0.0,
+      "candidate_streaming_trace_rate": 0.0,
+      "candidate_route_count": 0,
+      "candidate_scheduler_policies": [],
       "token_trace_pairs": 0,
       "token_trace_mismatch_rate": 0.0
     }

diff --git a/artifacts/release-gate-promote.json b/artifacts/release-gate-promote.json
@@ -1,5 +1,5 @@
 {
-  "schema_version": 3,
+  "schema_version": 4,
   "decision": "promote",
   "matched_requests": 4,
   "baseline_requests": 4,
@@ -25,6 +25,10 @@
   "decode_token_p95_regression_pct": null,
   "max_candidate_queue_depth": null,
   "max_candidate_memory_pressure_pct": null,
+  "candidate_routing_provenance_rate": 0.0,
+  "candidate_streaming_trace_rate": 0.0,
+  "candidate_route_count": 0,
+  "candidate_scheduler_policies": [],
   "token_trace_pairs": 0,
   "token_trace_mismatch_rate": 0.0,
   "segments": [
@@ -50,6 +54,10 @@
       "decode_token_p95_regression_pct": null,
       "max_candidate_queue_depth": null,
       "max_candidate_memory_pressure_pct": null,
+      "candidate_routing_provenance_rate": 0.0,
+      "candidate_streaming_trace_rate": 0.0,
+      "candidate_route_count": 0,
+      "candidate_scheduler_policies": [],
       "token_trace_pairs": 0,
       "token_trace_mismatch_rate": 0.0
     }

diff --git a/artifacts/release-gate-rollback.json b/artifacts/release-gate-rollback.json
@@ -1,5 +1,5 @@
 {
-  "schema_version": 3,
+  "schema_version": 4,
   "decision": "rollback",
   "matched_requests": 4,
   "baseline_requests": 4,
@@ -25,6 +25,10 @@
   "decode_token_p95_regression_pct": null,
   "max_candidate_queue_depth": null,
   "max_candidate_memory_pressure_pct": null,
+  "candidate_routing_provenance_rate": 0.0,
+  "candidate_streaming_trace_rate": 0.0,
+  "candidate_route_count": 0,
+  "candidate_scheduler_policies": [],
   "token_trace_pairs": 0,
   "token_trace_mismatch_rate": 0.0,
   "segments": [
@@ -50,6 +54,10 @@
       "decode_token_p95_regression_pct": null,
       "max_candidate_queue_depth": null,
       "max_candidate_memory_pressure_pct": null,
+      "candidate_routing_provenance_rate": 0.0,
+      "candidate_streaming_trace_rate": 0.0,
+      "candidate_route_count": 0,
+      "candidate_scheduler_policies": [],
       "token_trace_pairs": 0,
       "token_trace_mismatch_rate": 0.0
     }

diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
@@ -48,13 +48,15 @@ It computes:
 - successful-request p95 latency; and
 - candidate p95 regression;
 - TTFT p95 and decode-token p95 regression;
-- candidate queue depth and KV memory pressure; and
+- candidate queue depth and KV memory pressure;
+- candidate route/scheduler provenance and streaming trace coverage; and
 - segment summaries by model, model version, baseline backend, candidate
   backend, and accelerator.
 
 Correctness, numeric drift, or reliability regressions produce `rollback`.
-Latency, token-path, memory-pressure regressions, or incomplete evidence produce
-`hold`. A complete candidate within policy produces `promote`.
+Latency, token-path, memory-pressure, missing routing provenance, missing
+streaming trace coverage, or incomplete evidence produce `hold`. A complete
+candidate within policy produces `promote`.
 Hold and rollback reports include structured triage items so CI or rollout
 tooling can route the failed signal to a likely owner without parsing prose.
 
@@ -68,11 +70,13 @@ The adapter sits before the release gate. It normalizes backend-specific
 mirrored observations into `GateInput` without changing the gate policy. This
 keeps ingestion concerns separate from rollout decisions.
 
-The adapter currently accepts compact vLLM/SGLang-style request summaries:
-request ID, latency, health, model, backend, accelerator, output token IDs,
-optional explicit fingerprints, optional numeric vectors, model version, queue
-depth, KV page usage, TTFT, decode-token latencies, and optional token-trace
-fingerprints. If an engine does not provide an output fingerprint, the adapter
-computes a stable FNV-1a fingerprint from token IDs or numeric values.
-Successful observations without output material are rejected so a candidate
-cannot be promoted from latency-only evidence.
+The adapter currently accepts compact vLLM/SGLang-style request summaries and
+streaming request traces: request ID, latency, health, model, backend,
+accelerator, output token IDs, streaming token events, optional explicit
+fingerprints, optional numeric vectors, model version, route ID, replica ID,
+scheduler policy, queue depth, KV page usage, TTFT, decode-token latencies, and
+optional token-trace fingerprints. If an engine does not provide an output
+fingerprint, the adapter computes a stable FNV-1a fingerprint from token IDs,
+streaming token events, or numeric values. Successful observations without
+output material are rejected so a candidate cannot be promoted from
+latency-only evidence.
diff --git a/docs/RELEASE_VALIDATION.md b/docs/RELEASE_VALIDATION.md
@@ -12,6 +12,8 @@ Use `promote` when:
 - the candidate error-rate increase stays within policy; and
 - candidate p95 latency, TTFT p95, decode-token p95, and KV memory pressure
   stay within the configured regression budgets.
+- required candidate route/scheduler provenance and streaming token traces are
+  complete when those checks are enabled.
 
 ## Hold
 
@@ -21,6 +23,9 @@ output pairs, or a p95 latency regression without a correctness failure.
 The same response is used for excessive TTFT regression, decode-token p95
 regression, or candidate KV memory pressure because those are operational
 signals that need investigation before rollout.
+If route/scheduler provenance or streaming token traces are required but
+missing, the gate also returns `hold`; the candidate may still be correct, but
+the rollout evidence is not complete enough to trust the serving path.
 
 ## Rollback
 
@@ -54,17 +59,20 @@ The report includes:
 Mirrored observations can include rollout context and token-path telemetry:
 
 - model version;
+- route ID, replica ID, and scheduler policy;
 - queue depth;
 - KV pages used and available;
 - time to first token;
 - per-token decode latencies; and
-- token-trace fingerprints.
+- streaming token events and token-trace fingerprints.
 
 The gate reports aggregate and per-segment TTFT p95, decode-token p95, maximum
-candidate queue depth, maximum candidate memory pressure, and token-trace
-mismatch rate. Correct outputs with excessive latency or memory pressure produce
-`hold`, not `rollback`, because the evidence points to performance or capacity
-risk rather than a correctness failure.
+candidate queue depth, maximum candidate memory pressure, candidate
+route/scheduler provenance coverage, candidate streaming-trace coverage,
+candidate route count, scheduler policies, and token-trace mismatch rate.
+Correct outputs with excessive latency, memory pressure, missing provenance, or
+missing streaming traces produce `hold`, not `rollback`, because the evidence
+points to operational risk rather than a correctness failure.
 
 ## Triage Output
 
@@ -87,11 +95,13 @@ baseline/candidate comparisons such as vLLM versus SGLang, or a current
 runtime versus a candidate runtime behind shadow traffic.
 
 Each observation records request ID, latency, health, model, backend,
-accelerator, output material, and optional operational telemetry. Engines may provide their own
-`output_fingerprint`; otherwise the adapter hashes output token IDs or numeric
-output vectors with a stable FNV-1a fingerprint. Successful observations
-without output material are rejected because the release gate cannot audit
-correctness from latency alone.
+accelerator, output material, and optional operational telemetry. Engines may
+provide their own `output_fingerprint`; otherwise the adapter hashes output
+token IDs, streaming token events, or numeric output vectors with a stable
+FNV-1a fingerprint. Streaming token events also let the adapter derive TTFT and
+decode-token gaps from elapsed timestamps. Successful observations without
+output material are rejected because the release gate cannot audit correctness
+from latency alone.
 
 ## Production Extension Points
 
@@ -101,7 +111,8 @@ A real rollout system should add:
 - prompt-class and region segmentation;
 - SLO burn-rate and saturation signals;
 - canary population controls and audited rollback execution; and
-- provenance linking every decision to build, model, and configuration IDs.
+- provenance linking every decision to build, model, route, scheduler, and
+  configuration IDs.
 
 The checked fixtures are synthetic and exist to make the policy executable in
 CI. They are not claims about production traffic or fleet scale.