Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ replayable scheduling traces, and canary/shadow release decisions.
- Deterministic workload replay with a machine-readable trace fingerprint.
- Baseline/candidate release validation with `promote`, `hold`, and `rollback`
outcomes.
- Backend mirror normalization for vLLM/SGLang-style serving observations
before the release gate runs.
- Exact output checks, model-aware numeric tolerances for backend drift,
per-segment release summaries, error-rate deltas, p95 latency regression
policy, tests, and CI.
Expand All @@ -40,12 +42,18 @@ cargo run --release -- gate \
cargo run --release -- gate \
--input fixtures/release_gate_numeric_tolerance.json \
--output artifacts/release-gate-numeric-tolerance.json

cargo run --release -- mirror-gate \
--input fixtures/backend_mirror_vllm_sglang.json \
--output artifacts/backend-mirror-report.json
```

The safe fixture produces `promote`. The candidate with an output mismatch and
an added error produces `rollback`.
The numeric-tolerance fixture produces `promote` while reporting four tolerated
numeric comparisons across a baseline-runtime to candidate-runtime segment.
The backend-mirror fixture converts vLLM/SGLang-style request observations into
the same release gate and produces `promote` with a vLLM to SGLang segment.

The checked workload fixture completes four requests in 11 scheduler ticks,
peaks at 12 of 20 KV pages, returns all pages on completion, and emits trace
Expand All @@ -69,6 +77,18 @@ Every tick records:
The replay report includes a stable trace fingerprint, peak KV pages, total
ticks, and completion count.

## Backend Mirror Adapter

`runtime-lab mirror` converts backend-specific mirrored observations into a
gate input. `runtime-lab mirror-gate` performs the conversion and immediately
evaluates the release policy.

The adapter accepts per-request latency, health, model, backend, accelerator,
output token IDs, explicit output fingerprints, and optional numeric output
vectors. Successful observations must carry output material so correctness
checks remain auditable. Token IDs and numeric vectors are converted into
stable FNV-1a fingerprints when an engine-specific fingerprint is not supplied.

## Release Policy

The gate joins mirrored baseline and candidate observations by request ID.
Expand Down
38 changes: 38 additions & 0 deletions artifacts/backend-mirror-report.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
{
"schema_version": 2,
"decision": "promote",
"matched_requests": 4,
"baseline_requests": 4,
"candidate_requests": 4,
"coverage_rate": 1.0,
"output_mismatch_rate": 0.0,
"numeric_pairs": 0,
"tolerated_numeric_outputs": 0,
"numeric_drift_rate": 0.0,
"max_numeric_abs_error": null,
"max_numeric_rel_error": null,
"baseline_error_rate": 0.0,
"candidate_error_rate": 0.0,
"error_rate_increase": 0.0,
"baseline_p95_latency_ms": 28.0,
"candidate_p95_latency_ms": 27.2,
"p95_latency_regression_pct": -2.857143,
"segments": [
{
"model": "decoder-7b",
"baseline_backend": "vllm",
"candidate_backend": "sglang",
"accelerator": "h100",
"matched_requests": 4,
"output_mismatch_rate": 0.0,
"baseline_error_rate": 0.0,
"candidate_error_rate": 0.0,
"baseline_p95_latency_ms": 28.0,
"candidate_p95_latency_ms": 27.2,
"p95_latency_regression_pct": -2.857143
}
],
"reasons": [
"candidate stayed within correctness, reliability, and latency policy"
]
}
14 changes: 14 additions & 0 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,17 @@ within policy produces `promote`.
This is a local validation component, not a deployment controller. Production
integration would obtain observations from mirrored traffic, canary
populations, telemetry, and an audited rollout system.

## Backend Mirror Adapter

The adapter sits before the release gate. It normalizes backend-specific
mirrored observations into `GateInput` without changing the gate policy. This
keeps ingestion concerns separate from rollout decisions.

The adapter currently accepts compact vLLM/SGLang-style request summaries:
request ID, latency, health, model, backend, accelerator, output token IDs,
optional explicit fingerprints, and optional numeric vectors. If an engine does
not provide a fingerprint, the adapter computes a stable FNV-1a fingerprint
from token IDs or numeric values. Successful observations without output
material are rejected so a candidate cannot be promoted from latency-only
evidence.
14 changes: 14 additions & 0 deletions docs/RELEASE_VALIDATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,20 @@ The report includes:
- segment summaries by model, baseline backend, candidate backend, and
accelerator.

## Backend Mirror Adapter

The `mirror` command normalizes request observations from backend-specific
serving traces into the release gate input format. It is intended for mirrored
baseline/candidate comparisons such as vLLM versus SGLang, or a current
runtime versus a candidate runtime behind shadow traffic.

Each observation records request ID, latency, health, model, backend,
accelerator, and output material. Engines may provide their own
`output_fingerprint`; otherwise the adapter hashes output token IDs or numeric
output vectors with a stable FNV-1a fingerprint. Successful observations
without output material are rejected because the release gate cannot audit
correctness from latency alone.

## Production Extension Points

A real rollout system should add:
Expand Down
72 changes: 72 additions & 0 deletions fixtures/backend_mirror_vllm_sglang.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
{
"thresholds": {
"min_matched_requests": 4,
"max_output_mismatch_rate": 0.0,
"max_error_rate_increase": 0.01,
"max_p95_latency_regression_pct": 10.0,
"max_numeric_drift_rate": 0.0,
"numeric_tolerances": []
},
"baseline": {
"backend": "vllm",
"model": "decoder-7b",
"accelerator": "h100",
"observations": [
{
"request_id": "prompt-a",
"latency_ms": 18.0,
"ok": true,
"output_token_ids": [101, 1402, 13]
},
{
"request_id": "prompt-b",
"latency_ms": 21.0,
"ok": true,
"output_token_ids": [205, 778, 990]
},
{
"request_id": "prompt-c",
"latency_ms": 24.0,
"ok": true,
"output_token_ids": [42, 42, 7]
},
{
"request_id": "prompt-d",
"latency_ms": 28.0,
"ok": true,
"output_token_ids": [301, 302, 303, 2]
}
]
},
"candidate": {
"backend": "sglang",
"model": "decoder-7b",
"accelerator": "h100",
"observations": [
{
"request_id": "prompt-a",
"latency_ms": 17.5,
"ok": true,
"output_token_ids": [101, 1402, 13]
},
{
"request_id": "prompt-b",
"latency_ms": 20.6,
"ok": true,
"output_token_ids": [205, 778, 990]
},
{
"request_id": "prompt-c",
"latency_ms": 23.5,
"ok": true,
"output_token_ids": [42, 42, 7]
},
{
"request_id": "prompt-d",
"latency_ms": 27.2,
"ok": true,
"output_token_ids": [301, 302, 303, 2]
}
]
}
}
178 changes: 178 additions & 0 deletions src/adapter.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
use std::error::Error;
use std::fmt;

use serde::{Deserialize, Serialize};

use crate::release::{GateInput, GateThresholds, Observation};

#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct BackendMirrorInput {
#[serde(default)]
pub thresholds: GateThresholds,
pub baseline: BackendObservationSet,
pub candidate: BackendObservationSet,
}

#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct BackendObservationSet {
pub backend: String,
pub model: String,
#[serde(default)]
pub accelerator: Option<String>,
pub observations: Vec<BackendObservation>,
}

#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct BackendObservation {
pub request_id: String,
pub latency_ms: f64,
#[serde(default)]
pub ok: Option<bool>,
#[serde(default)]
pub output_fingerprint: Option<String>,
#[serde(default)]
pub output_token_ids: Vec<i64>,
#[serde(default)]
pub output_values: Option<Vec<f64>>,
#[serde(default)]
pub error: Option<String>,
}

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct AdapterError {
message: String,
}

impl AdapterError {
fn new(message: impl Into<String>) -> Self {
Self {
message: message.into(),
}
}
}

impl fmt::Display for AdapterError {
fn fmt(&self, formatter: &mut fmt::Formatter<'_>) -> fmt::Result {
formatter.write_str(&self.message)
}
}

impl Error for AdapterError {}

pub fn mirror_to_gate_input(input: &BackendMirrorInput) -> Result<GateInput, AdapterError> {
Ok(GateInput {
thresholds: input.thresholds.clone(),
baseline: normalize_set(&input.baseline)?,
candidate: normalize_set(&input.candidate)?,
})
}

fn normalize_set(set: &BackendObservationSet) -> Result<Vec<Observation>, AdapterError> {
if set.backend.trim().is_empty() {
return Err(AdapterError::new("backend must not be empty"));
}
if set.model.trim().is_empty() {
return Err(AdapterError::new("model must not be empty"));
}

set.observations
.iter()
.map(|observation| normalize_observation(set, observation))
.collect()
}

fn normalize_observation(
set: &BackendObservationSet,
observation: &BackendObservation,
) -> Result<Observation, AdapterError> {
if observation.request_id.trim().is_empty() {
return Err(AdapterError::new("request_id must not be empty"));
}
if !observation.latency_ms.is_finite() || observation.latency_ms < 0.0 {
return Err(AdapterError::new(format!(
"request {} has invalid latency_ms",
observation.request_id
)));
}

let ok = observation.ok.unwrap_or_else(|| {
observation
.error
.as_ref()
.is_none_or(|error| error.trim().is_empty())
});
let output_fingerprint = output_fingerprint(observation, ok)?;

Ok(Observation {
request_id: observation.request_id.clone(),
output_fingerprint,
latency_ms: observation.latency_ms,
ok,
model: Some(set.model.clone()),
backend: Some(set.backend.clone()),
accelerator: set.accelerator.clone(),
output_values: observation.output_values.clone(),
})
}

fn output_fingerprint(observation: &BackendObservation, ok: bool) -> Result<String, AdapterError> {
if let Some(fingerprint) = observation.output_fingerprint.as_ref()
&& !fingerprint.trim().is_empty()
{
return Ok(fingerprint.clone());
}

if !observation.output_token_ids.is_empty() {
return Ok(format!(
"tokens-fnv64:{:016x}",
hash_i64_values(&observation.output_token_ids)
));
}

if let Some(values) = observation.output_values.as_ref()
&& !values.is_empty()
{
return Ok(format!("values-fnv64:{:016x}", hash_f64_values(values)));
}

if !ok {
return Ok("error".into());
}

Err(AdapterError::new(format!(
"request {} is successful but has no output fingerprint, token ids, or numeric values",
observation.request_id
)))
}

fn hash_i64_values(values: &[i64]) -> u64 {
let mut hash = FNV_OFFSET_BASIS;
feed_usize(&mut hash, values.len());
for value in values {
feed_bytes(&mut hash, &value.to_le_bytes());
}
hash
}

fn hash_f64_values(values: &[f64]) -> u64 {
let mut hash = FNV_OFFSET_BASIS;
feed_usize(&mut hash, values.len());
for value in values {
feed_bytes(&mut hash, &value.to_bits().to_le_bytes());
}
hash
}

const FNV_OFFSET_BASIS: u64 = 0xcbf2_9ce4_8422_2325;
const FNV_PRIME: u64 = 0x0000_0100_0000_01b3;

fn feed_usize(hash: &mut u64, value: usize) {
feed_bytes(hash, &value.to_le_bytes());
}

fn feed_bytes(hash: &mut u64, bytes: &[u8]) {
for byte in bytes {
*hash ^= u64::from(*byte);
*hash = hash.wrapping_mul(FNV_PRIME);
}
}
5 changes: 5 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
pub mod adapter;
pub mod release;
pub mod scheduler;

pub use adapter::{
AdapterError, BackendMirrorInput, BackendObservation, BackendObservationSet,
mirror_to_gate_input,
};
pub use release::{
GateDecision, GateInput, GateReport, GateThresholds, NumericTolerance, Observation,
SegmentReport, evaluate_release,
Expand Down
Loading
Loading