Purpose: Runtime visibility, metrics, and operational control for Fitz broker.
Protocol: HTTP REST API (coexists with WebSocket data plane on same port)
Port: Same as data plane (default: 8080)
Route Structure:
/- Single Page Application (SPA) static files/api/v1/*- Admin REST API endpoints/metrics- Prometheus metrics endpoint/healthz,/readyz,/startupz- Kubernetes health probes/ws- WebSocket upgrade for data plane Data Plane: WebSocket upgrade on/ws, TCP on same port (protocol detection)
Authentication:/- No authentication (SPA public access)/healthz,/readyz,/startupz- No authentication (for load balancer health checks)/metrics- Requires authentication (JWT or API key)/api/v1/*- Requires authentication (JWT or API key with admin permissions)
- Read-heavy: Most operations are queries for visibility
- Safe by default: Dangerous operations (force rollback, cancel) require explicit confirmation
- Realm-scoped: Queries that expose realm filters operate on the application-defined realm label used in Fitz routes and resources, never on
route_family - Prometheus-compatible: Metrics endpoint follows Prometheus format
- Minimal dependencies: No external monitoring system required for basic visibility
- SPA-first: Web interface served at root, all API routes namespaced
/ → SPA (index.html)
/assets/* → SPA static assets (JS, CSS, images)
/ws → WebSocket upgrade (data plane)
/healthz → Kubernetes liveness probe
/readyz → Kubernetes readiness probe
/startupz → Kubernetes startup probe
/metrics → Prometheus metrics (auth required)
/api/v1/stats → Global broker statistics (auth required)
/api/v1/kv/stats → KV domain statistics (auth required)
/api/v1/stream/stats → Stream domain statistics (auth required)
/api/v1/notice/stats → Notice domain statistics (auth required)
/api/v1/queue/stats → Queue domain statistics (auth required)
/api/v1/rpc/stats → RPC domain statistics (auth required)
/api/v1/lease/stats → Lease domain statistics (auth required)
Authentication Rules:
- SPA (
/,/assets/*) - Public access - Health probes (
/healthz,/readyz,/startupz) - Public access (for K8s/load balancers) - Metrics (
/metrics) - Requires JWT Bearer token - Admin API (
/api/v1/*) - Requires JWT Bearer token with admin scope
GET /healthz
Authentication: None (public endpoint for kubelet) Purpose: Indicates if the application is alive and should be restarted if unhealthy. Response:
200 OK- Application is alive503 Service Unavailable- Application is stuck/deadlocked, should be restarted
{
"status": "ok"
}Criteria:
- Runtime is responsive
- No critical failures (panics, deadlocks)
- Does NOT check downstream dependencies
GET /readyz
Authentication: None (public endpoint for kubelet) Purpose: Indicates if the application is ready to accept traffic. Response:
200 OK- Ready to accept traffic503 Service Unavailable- Not ready, remove from load balancer
{
"status": "ready",
"checks": {
"storage": "ok",
"domains_initialized": "ok"
}
}Criteria:
- Storage engine initialized
- All domain actors started
- TCP/WebSocket listeners bound
- Ready to process requests
GET /startupz
Authentication: None (public endpoint for kubelet) Purpose: Indicates if the application has completed startup. Prevents premature liveness checks during slow startup. Response:
200 OK- Startup complete503 Service Unavailable- Still starting up
{
"status": "started",
"startup_time_seconds": 2.5
}Criteria:
- All initialization complete
- Domain actors ready
- Listeners bound
GET /health
Authentication: None (public endpoint for load balancers) Purpose: General health check for non-Kubernetes environments. Response: 200 OK if healthy, 503 Service Unavailable if degraded
{
"status": "healthy",
"uptime_seconds": 86400,
"version": "0.1.0"
}GET /metrics
Authentication: Required (JWT or API key) Response: Prometheus text format
# HELP fitz_connections_total Total number of active connections
# TYPE fitz_connections_total gauge
fitz_connections_total 142
# HELP fitz_messages_received_total Total messages received
# TYPE fitz_messages_received_total counter
fitz_messages_received_total 1847392
# HELP fitz_messages_sent_total Total messages sent
# TYPE fitz_messages_sent_total counter
fitz_messages_sent_total 1847390
# HELP fitz_message_latency_seconds Message processing latency
# TYPE fitz_message_latency_seconds histogram
fitz_message_latency_seconds_bucket{le="0.001"} 1500000
fitz_message_latency_seconds_bucket{le="0.01"} 1800000
fitz_message_latency_seconds_bucket{le="0.1"} 1847000
...
GET /api/v1/stats
Response:
{
"broker": {
"uptime_seconds": 86400,
"connections": 142,
"sessions": 142,
"realms": ["prod", "staging", "dev"],
"messages_per_second": 450
},
"domains": {
"kv": {
"transactions_active": 23,
"keys_total": 12847,
"operations_per_second": 120
},
"stream": {
"streams_active": 45,
"events_total": 384921,
"operations_per_second": 85
},
"notice": {
"subscriptions_active": 312,
"publishes_per_second": 95
},
"queue": {
"messages_pending": 1847,
"inflight_active": 67,
"operations_per_second": 78
},
"rpc": {
"workers_registered": 34,
"requests_pending": 12,
"operations_per_second": 42
},
"lease": {
"leases_active": 18,
"operations_per_second": 5
},
"schedule": {
"schedules_active": 56,
"executions_per_minute": 23
}
}
}All KV admin responses separate durable committed data from live transaction coordination. Committed values persist according to storage commit semantics, but open transactions shown here are current-process in-memory state only. They disappear on disconnect cleanup or broker restart and do not imply durable transaction recovery.
GET /api/v1/kv/realms/{realm}/areas/{area}/resources/{resource}
transactions_active counts only live session-scoped transactions for the
current broker process. It resets on disconnect cleanup or broker restart.
Response:
{
"realm": "prod",
"area": "app",
"resource": "users",
"transactions_active": 1
}GET /api/v1/kv/realms/{realm}/areas/{area}/resources/{resource}/transactions
tx_id is a session-scoped runtime handle for the currently running broker
process. It is not a durable recovery token, and the listed transactions do not
survive disconnect or restart.
Response:
{
"transactions": [
{
"tx_id": 1234567890,
"realm": "prod",
"area": "app",
"resource": "users",
"mode": "ReadWrite",
"started_at": "2026-01-31T10:30:00Z",
"operations_count": 5,
"idle_seconds": 12
}
]
}Stream admin responses combine durable committed stream metadata with live
current-process append-session counts. Committed streams remain visible after
restart because offset, watermark, and size_bytes come from durable
metadata. sessions_active counts only live append sessions on the current
broker process and resets on disconnect cleanup or broker restart. Consumer
cursors remain client-managed; there are no durable broker-side cursor groups.
GET /api/v1/stream/realms/{realm}/areas/{area}/resources
Response:
{
"realm": "prod",
"area": "events",
"resources": [
{ "resource": "orders" },
{ "resource": "payments" }
]
}GET /api/v1/stream/realms/{realm}/areas/{area}/resources/{resource}
Response:
{
"realm": "prod",
"area": "events",
"resource": "orders",
"offset": 384921,
"watermark": 384921,
"size_bytes": 52847392,
"sessions_active": 3
}offsetis the last committed resource offset.watermarkis the highest committed visible offset for that resource.size_bytesis derived from durable committed-byte metadata.sessions_activeis a live append-session count only; it is not a durable writer inventory.- Stream subscriptions remain session-scoped best-effort delivery and are not represented as durable admin state.
All Notice admin responses reflect live in-memory broker state only. Notice subscriptions are session-scoped, disappear on disconnect, and are not restored after broker restart.
GET /admin/notice/subscriptions?realm={realm}&route_pattern={pattern}
created_at is the time the current in-memory subscription was created. notifications_received is a live delivery counter for the current in-memory subscription and resets when the client reconnects or the broker restarts.
Response:
{
"subscriptions": [
{
"subscription_id": 42,
"session_id": "sess_abc123",
"realm": "prod",
"pattern": "notice://prod/events/**",
"created_at": "2026-01-31T10:30:00Z",
"notifications_received": 1847
}
]
}GET /admin/notice/routes?realm={realm}
publishes_total and publishes_per_minute describe live broker-observed activity for the current process lifetime. They are not durable replay or history counters.
Response:
{
"routes": [
{
"route": "notice://prod/events/orders",
"subscribers": 23,
"publishes_total": 8456,
"publishes_per_minute": 45
}
]
}GET /admin/notice/stats?realm={realm}
These values are point-in-time in-memory statistics for the running broker instance.
Response:
{
"subscriptions_active": 312,
"routes_registered": 67,
"publishes_total": 1847392,
"publishes_per_second": 95,
"fanout_ratio": 4.2
}POST /admin/notice/subscriptions/{subscription_id}/cancel
Force-removes an active in-memory notice subscription from the current broker instance. This endpoint does not cancel durable state and has no effect after the owning session disconnects.
Headers: X-Confirm: true
Response: 200 OK or 404 Not Found
Queue admin responses reflect only the current broker's warm in-memory actor state unless otherwise noted. Committed queue data remains durable in storage, but warm resource counts and live lease rows can disappear after disconnect cleanup, idle actor eviction, or broker restart until traffic rehydrates that queue.
GET /api/v1/queue/realms/{realm}/areas/{area}/resources
GET /api/v1/queue/realms/{realm}/areas/{area}/resources/{resource}?family={family}
family is optional on read routes. When omitted, queue detail aggregates warm state across route families that share the same {realm}/{area}/{resource} on the current broker. When provided, the response is filtered to that exact queue identity.
messages_ready, messages_delayed, messages_inflight, messages_dead_lettered, and messages_total are point-in-time counts for the current broker only. They are not a durable catalog of every committed queue in storage.
Response:
{
"realm": "prod",
"area": "jobs",
"resource": "emails",
"messages_ready": 1847,
"messages_delayed": 12,
"messages_inflight": 67,
"messages_dead_lettered": 4,
"messages_total": 1930,
"oldest_message_age_seconds": 0
}GET /api/v1/queue/realms/{realm}/areas/{area}/resources/{resource}/inflight?family={family}
family is optional. When provided, only inflight entries for that exact queue identity are returned.
inflight_token and session_id describe live in-memory inflight ownership only. They are dropped on disconnect cleanup, invalidated on expiry, and never survive broker restart.
Response:
{
"inflight": [
{
"message_id": 123456,
"family": 1,
"realm": "prod",
"area": "jobs",
"resource": "emails",
"inflight_token": "987654321",
"session_id": "12345",
"expires_at": "2026-01-31T10:35:00Z",
"attempts": 2
}
]
}GET /api/v1/queue/realms/{realm}/areas/{area}/resources/{resource}/dead-letters?family={family}
family is optional on reads. Dead-letter rows remain durably stored, but this endpoint only exposes DLQ rows for queue actors that are currently warm on this broker.
Response:
{
"messages": [
{
"message_id": 123456,
"family": 1,
"realm": "prod",
"area": "jobs",
"resource": "emails",
"dead_lettered_at": "2026-01-31T10:35:00Z",
"attempts": 3,
"reason": "max_attempts_exceeded"
}
]
}POST /api/v1/queue/realms/{realm}/areas/{area}/resources/{resource}/dead-letters/{message_id}/replay?family={family}
family is required for destructive queue actions because queue identity includes route family. On success the retained DLQ row is moved back to ready state, its attempts counter resets, and the endpoint returns 204 No Content.
DELETE /api/v1/queue/realms/{realm}/areas/{area}/resources/{resource}/dead-letters/{message_id}?family={family}
family is required. On success the retained DLQ row is permanently removed from storage and the endpoint returns 204 No Content.
All RPC admin endpoints expose live in-memory state for the current broker instance only. Worker registrations and pending requests disappear on disconnect or broker restart and are not durable recovery queues. The broker updates this read model as a coalesced operational snapshot, so very recent subscribe, unsubscribe, timeout, and cleanup events can lag briefly in admin responses. Treat these endpoints as near-live diagnostics, not strongly consistent reads of the hot path.
GET /admin/rpc/workers?realm={realm}
Response:
{
"workers": [
{
"session_id": "sess_abc123",
"realm": "prod",
"route": "rpc://prod/compute/tasks/heavy-task",
"registered_at": "2026-01-31T10:00:00Z",
"requests_handled": 1847,
"average_latency_ms": 145
}
]
}GET /admin/rpc/pending?realm={realm}
Pending requests shown here are only those still tracked in memory by the running broker. A restart clears this list immediately.
Response:
{
"requests": [
{
"correlation_id": "0123456789abcdef",
"route": "rpc://prod/compute/tasks/heavy-task",
"submitted_at": "2026-01-31T10:34:50Z",
"age_seconds": 10,
"worker_session_id": "sess_abc123"
}
]
}GET /admin/rpc/stats?realm={realm}
workers_registered and requests_pending are point-in-time in-memory counts for the running broker process. They reset on restart and should not be interpreted as durable backlog or recovery state.
Like the worker and pending endpoints, these counters are served from the current broker's coalesced admin snapshot and can lag the latest in-flight mutations briefly.
Response:
{
"workers_registered": 34,
"requests_pending": 12,
"requests_completed_total": 184739,
"requests_timed_out_total": 42,
"operations_per_second": 42,
"average_latency_ms": 125
}POST /admin/rpc/requests/{correlation_id}/cancel
Headers: X-Confirm: true
Response: 200 OK or 404 Not Found
All Lease admin responses reflect live in-memory state for the current broker process only. Lease ownership disappears on disconnect cleanup or broker restart, and fencing_token values are process-local rather than durable or cross-node identifiers.
GET /admin/lease/leases?realm={realm}
acquired_at and expires_at describe the current in-memory lease window only. fencing_token is valid only within the running broker process and resets after restart.
Response:
{
"leases": [
{
"realm": "prod",
"area": "locks",
"resource": "job-executor",
"owner_session_id": "sess_abc123",
"acquired_at": "2026-01-31T10:30:00Z",
"expires_at": "2026-01-31T10:35:00Z",
"renewals": 5,
"fencing_token": 42
}
]
}GET /admin/lease/stats?realm={realm}
These values are point-in-time in-memory counts for the running broker process and should not be interpreted as durable recovery state.
Response:
{
"leases_active": 18,
"leases_acquired_total": 8456,
"leases_expired_total": 342,
"operations_per_second": 5
}POST /admin/lease/leases/{lease_id}/release
Force-releases an active in-memory lease on the current broker instance only. This endpoint does not recover or revoke durable state, and it has no effect after the owning session has already disconnected or the broker has restarted.
Headers: X-Confirm: true
Response: 200 OK or 404 Not Found
Schedule definitions are durable and are preloaded into per-family Schedule actors during broker boot. Admin schedule views therefore reflect persisted definitions before any schedule-domain traffic reaches that family. Schedule notifications and subscriptions remain live session-scoped delivery only, and last_run / executions_total are still non-authoritative placeholders in this round.
GET /api/v1/schedule/realms/{realm}/areas/{area}/resources/{resource}
Response:
{
"realm": "prod",
"area": "jobs",
"resource": "cleanup",
"enabled": true,
"cron": "0 2 * * *",
"next_run": "2026-02-01T02:00:00Z",
"executions_total": 0
}If multiple operations exist under the same resource, cron is omitted and next_run is the earliest next durable fire among that resource's persisted schedules.
GET /admin/sessions?realm={realm}
realm filters by the session's application-visible realm label only. It is not a route_family selector, fallback, or alias.
Response:
{
"sessions": [
{
"session_id": "sess_abc123",
"realm": "prod",
"connected_at": "2026-01-31T10:00:00Z",
"idle_seconds": 45,
"messages_received": 1847,
"messages_sent": 1845,
"transport": "websocket",
"remote_addr": "192.168.1.100:54321"
}
]
}POST /admin/sessions/{session_id}/close
Headers: X-Confirm: true
Response: 200 OK or 404 Not Found
Health Probes (No Auth):
GET /healthz- Liveness probeGET /readyz- Readiness probeGET /startupz- Startup probeGET /health- Legacy health check Metrics (Auth Required):GET /metrics- Prometheus metrics Global Stats (Admin Auth):GET /api/v1/stats- Global broker and domain statistics Domain Stats (Admin Auth):GET /api/v1/kv/stats- KV domain statisticsGET /api/v1/stream/stats- Stream domain statisticsGET /api/v1/notice/stats- Notice domain statisticsGET /api/v1/queue/stats- Queue domain statisticsGET /api/v1/rpc/stats- RPC domain statisticsGET /api/v1/lease/stats- Lease domain statisticsGET /api/v1/schedule/stats- Schedule domain statistics List Endpoints (Admin Auth) - current surface plus remaining follow-up:GET /api/v1/kv/realms/{realm}/areas/{area}/resources/{resource}- Get live KV resource detailGET /api/v1/kv/realms/{realm}/areas/{area}/resources/{resource}/transactions- List live session-scoped KV transactions for a resource- KV transaction snapshots are current-process only. They disappear on disconnect cleanup and are not restored after broker restart.
GET /api/v1/stream/realms/{realm}/areas/{area}/resources- List stream resources in an areaGET /api/v1/stream/realms/{realm}/areas/{area}/resources/{resource}- Get stream resource detail- Stream resource detail combines durable committed metadata with the current broker's live append-session count. It does not represent durable consumer cursors or broker-restored sessions.
GET /api/v1/admin/notice/subscriptions?realm={realm}&route_pattern={pattern}- List subscriptionsGET /api/v1/admin/notice/routes?realm={realm}- List routes with subscriber countsGET /api/v1/queue/realms/{realm}/areas/{area}/resources/{resource}- Get warm Queue resource detailGET /api/v1/queue/realms/{realm}/areas/{area}/resources/{resource}/inflight- List live queue inflight entries for a resourceGET /api/v1/queue/realms/{realm}/areas/{area}/resources/{resource}/dead-letters- List retained DLQ rows for a resource- Queue resource and lease snapshots are current-process only. They can disappear after disconnect cleanup, idle actor eviction, or broker restart until traffic rehydrates the queue.
- Queue detail and list routes accept an optional
familyquery parameter. Replay and purge require it.
POST /api/v1/queue/realms/{realm}/areas/{area}/resources/{resource}/dead-letters/{message_id}/replay?family={family}- Replay a retained DLQ rowDELETE /api/v1/queue/realms/{realm}/areas/{area}/resources/{resource}/dead-letters/{message_id}?family={family}- Purge a retained DLQ rowGET /api/v1/admin/rpc/workers?realm={realm}- List registered RPC workersGET /api/v1/admin/rpc/pending?realm={realm}- List pending RPC requestsGET /api/v1/admin/lease/leases?realm={realm}- List active in-memory leases- Lease snapshots are live only. They disappear on disconnect cleanup and are not restored after broker restart.
GET /api/v1/schedule/realms/{realm}/areas/{area}/resources/{resource}- Get durable Schedule resource detail- Schedule definitions are durable and boot-loaded. Notification delivery remains live only, and
last_run/executions_totalare placeholders rather than durable execution history.
- Schedule definitions are durable and boot-loaded. Notification delivery remains live only, and
GET /api/v1/admin/sessions?realm={realm}- List active sessions
Admin Commands (Admin Auth + X-Confirm Header):
POST /api/v1/admin/kv/transactions/{tx_id}/rollback- Force rollback transactionPOST /api/v1/admin/notice/subscriptions/{subscription_id}/cancel- Cancel subscriptionPOST /api/v1/admin/rpc/requests/{correlation_id}/cancel- Cancel RPC requestPOST /api/v1/admin/lease/leases/{lease_id}/release- Force release leasePOST /api/v1/admin/schedule/schedules/{schedule_id}/trigger- Trigger schedule manuallyPOST /api/v1/admin/sessions/{session_id}/close- Close session Pagination Support:- Add
?limit=and?offset=query parameters to list endpoints Domain Integration: - Each domain needs to implement methods to provide list data
- KV: Live transaction snapshots are exposed per resource and reflect only active in-memory session-scoped state
- Stream: Rebuild resource detail from durable metadata and expose live append-session counts separately
- Notice: Track subscriptions and routes, expose via admin query
- Queue: Track queue depths and leases, expose via admin query
- RPC: Track workers and pending requests, expose via admin query
- Lease: Track active leases, expose via admin query
- Sessions: Track active sessions, expose via admin query
Each domain should maintain lightweight counters:
pub struct DomainMetrics {
// Counters (monotonic)
operations_total: AtomicU64,
errors_total: AtomicU64,
// Gauges (current value)
active_count: AtomicU64,
// Histograms (latency tracking)
latency_histogram: HistogramVec,
}Create a dedicated AdminActor that:
- Receives HTTP requests from admin REST handler
- Queries domain actors for stats (via internal message passing)
- Aggregates responses
- Returns JSON
pub enum AdminQuery {
KvStats { realm: Option<String> },
KvTransactions { realm: Option<String> },
NoticeSubscriptions { realm: Option<String>, pattern: Option<String> },
// ... etc
}
pub enum AdminCommand {
RollbackTransaction { tx_id: u64 },
CancelSubscription { subscription_id: u64 },
ExpireLease { lease_id: u64 },
// ... etc
}- Authentication:
/healthz,/readyz,/startupz,/health- No auth (for kubelet/load balancers)/metrics- Requires JWT or API key (prevents information disclosure)/admin/*- Requires JWT or API key withadmin:readpermission
- Authorization:
- Admin queries (GET) require
admin:readpermission - Admin commands (POST) require
admin:writepermission
- Admin queries (GET) require
- Confirmation Header: Dangerous operations require
X-Confirm: trueheader - Audit Logging: All admin commands should be logged with timestamp, user, and action
- Rate Limiting: Prevent abuse of admin queries (especially metrics scraping)
use hyper::{Body, Response, StatusCode};
use serde_json::json;
pub async fn handle_liveness() -> Result<Response<Body>, Infallible> {
// Check if runtime is responsive
// Return 503 only if deadlocked/panicked
let response = json!({ "status": "ok" });
Ok(Response::builder()
.status(StatusCode::OK)
.header("Content-Type", "application/json")
.body(Body::from(serde_json::to_string(&response).unwrap()))
.unwrap())
}
pub async fn handle_readiness(runtime: Arc<Runtime>) -> Result<Response<Body>, Infallible> {
// Check if ready to accept traffic
if !runtime.storage_initialized() || !runtime.domains_ready() {
let response = json!({
"status": "not_ready",
"checks": {
"storage": if runtime.storage_initialized() { "ok" } else { "not_ready" },
"domains_initialized": if runtime.domains_ready() { "ok" } else { "not_ready" }
}
});
return Ok(Response::builder()
.status(StatusCode::SERVICE_UNAVAILABLE)
.header("Content-Type", "application/json")
.body(Body::from(serde_json::to_string(&response).unwrap()))
.unwrap());
}
let response = json!({
"status": "ready",
"checks": {
"storage": "ok",
"domains_initialized": "ok"
}
});
Ok(Response::builder()
.status(StatusCode::OK)
.header("Content-Type", "application/json")
.body(Body::from(serde_json::to_string(&response).unwrap()))
.unwrap())
}
pub async fn handle_startup(runtime: Arc<Runtime>) -> Result<Response<Body>, Infallible> {
// Check if startup complete
if !runtime.startup_complete() {
let response = json!({ "status": "starting" });
return Ok(Response::builder()
.status(StatusCode::SERVICE_UNAVAILABLE)
.header("Content-Type", "application/json")
.body(Body::from(serde_json::to_string(&response).unwrap()))
.unwrap());
}
let response = json!({
"status": "started",
"startup_time_seconds": runtime.startup_duration().as_secs_f64()
});
Ok(Response::builder()
.status(StatusCode::OK)
.header("Content-Type", "application/json")
.body(Body::from(serde_json::to_string(&response).unwrap()))
.unwrap())
}- Caching: Cache stats with 1-second TTL to avoid overwhelming domain actors
- Pagination: List endpoints should support
?limit=and?offset= - Filtering: All list endpoints should support realm filtering
- Async: Admin queries should not block domain actors
use hyper::{Body, Request, Response, StatusCode};
use serde_json::json;
async fn handle_admin(
req: Request<Body>,
runtime: Arc<Runtime>,
) -> Result<Response<Body>, Infallible> {
let path = req.uri().path();
// Parse query params
let query = req.uri().query();
let realm = parse_realm_filter(query);
match path {
"/admin/stats" => {
let stats = get_global_stats(runtime).await;
json_response(stats)
}
"/admin/kv/stats" => {
let stats = get_kv_stats(runtime, realm).await;
json_response(stats)
}
"/admin/kv/transactions" => {
let txs = get_kv_transactions(runtime, realm).await;
json_response(txs)
}
"/admin/notice/subscriptions" => {
let subs = get_notice_subscriptions(runtime, realm).await;
json_response(subs)
}
_ => Ok(not_found()),
}
}
fn json_response<T: Serialize>(data: T) -> Result<Response<Body>, Infallible> {
let json = serde_json::to_string(&data).unwrap();
Ok(Response::builder()
.status(StatusCode::OK)
.header("Content-Type", "application/json")
.body(Body::from(json))
.unwrap())
}
fn unauthorized() -> Response<Body> {
Response::builder()
.status(StatusCode::UNAUTHORIZED)
.body(Body::from("Unauthorized"))
.unwrap()
}
fn not_found() -> Response<Body> {
Response::builder()
.status(StatusCode::NOT_FOUND)
.body(Body::from("Not Found"))
.unwrap()
}### Domain Stats Trait
```rust
pub trait DomainStats {
fn get_stats(&self, realm: Option<&str>) -> DomainStatsSnapshot;
fn get_active_items(&self, realm: Option<&str>) -> Vec<ActiveItem>;
}
impl DomainStats for KvActor {
fn get_stats(&self, realm: Option<&str>) -> DomainStatsSnapshot {
// Return current transaction count, key count, etc.
}
}
- Admin API on same port as data plane (e.g., 8080) for cloud platform compatibility
- Path-based routing:
/healthz- Kubernetes liveness probe (no auth)/readyz- Kubernetes readiness probe (no auth)/startupz- Kubernetes startup probe (no auth)/health- Legacy health check (no auth)/ws- WebSocket upgrade for data plane/metrics- Prometheus metrics (requires auth)/admin/*- Admin API (requires auth + admin permissions)
- TCP connections use protocol detection on same port (raw TCP vs HTTP)
- Enable Prometheus scraping at
/metrics - Dashboard: Use Grafana to visualize metrics
- Alerting: Set up alerts on key metrics (high latency, error rates)
Works with single-port constraints:
- ✅ Azure Container Apps - Single port with HTTP/WebSocket
- ✅ Google Cloud Run - Single port (HTTP/WebSocket)
- ✅ AWS App Runner - Single port
- ✅ Kubernetes - Single service port with proper probes
- ✅ Docker Compose - Simple port mapping
apiVersion: apps/v1
kind: Deployment
metadata:
name: fitz
spec:
replicas: 3
selector:
matchLabels:
app: fitz
template:
metadata:
labels:
app: fitz
spec:
containers:
- name: fitz
image: fitz:latest
ports:
- containerPort: 8080
name: http
# Startup probe - gives app time to initialize
startupProbe:
httpGet:
path: /startupz
port: 8080
initialDelaySeconds: 0
periodSeconds: 2
timeoutSeconds: 1
failureThreshold: 30 # 60 seconds max startup time
# Liveness probe - restart if unhealthy
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
# Readiness probe - remove from service if not ready
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 2
env:
- name: RUST_LOG
value: "info"
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
apiVersion: v1
kind: Service
metadata:
name: fitz
spec:
type: ClusterIP
selector:
app: fitz
ports:
- port: 8080
targetPort: 8080
name: httpuse hyper::{Body, Request, Response, StatusCode, Method};
use hyper::service::{make_service_fn, service_fn};
use std::convert::Infallible;
pub async fn handle_request(
req: Request<Body>,
runtime: Arc<Runtime>,
) -> Result<Response<Body>, Infallible> {
let path = req.uri().path();
let method = req.method();
match (method, path) {
// Kubernetes probes (no auth)
(&Method::GET, "/healthz") => handle_liveness().await,
(&Method::GET, "/readyz") => handle_readiness(runtime).await,
(&Method::GET, "/startupz") => handle_startup(runtime).await,
(&Method::GET, "/health") => handle_health().await,
// Metrics (requires auth)
(&Method::GET, "/metrics") => {
if !check_auth(&req).await {
return Ok(unauthorized());
}
handle_metrics(runtime).await
}
// Admin API (requires auth + admin permission)
(&Method::GET, path) if path.starts_with("/admin/") => {
if !check_admin_auth(&req).await {
return Ok(unauthorized());
}
handle_admin(req, runtime).await
}
// WebSocket upgrade
(&Method::GET, "/ws") => handle_websocket_upgrade(req).await,
_ => Ok(not_found()),
}
}
pub async fn serve(addr: SocketAddr, runtime: Arc<Runtime>) {
let runtime = Arc::clone(&runtime);
let make_svc = make_service_fn(move |_conn| {
let runtime = Arc::clone(&runtime);
async move {
Ok::<_, Infallible>(service_fn(move |req| {
handle_request(req, Arc::clone(&runtime))
}))
}
});
let server = hyper::Server::bind(&addr).serve(make_svc);
if let Err(e) = server.await {
eprintln!("server error: {}", e);
}
}fn admin_router() -> Router { Router::new() .route("/stats", get(stats)) .route("/kv/stats", get(kv_stats)) .route("/kv/transactions", get(kv_transactions)) .route("/notice/subscriptions", get(notice_subscriptions)) // ... etc .layer(AuthLayer::new()) // Require JWT auth }
### Protocol Detection
On port 8080:
1. **HTTP GET/POST** → Check path:
- `/healthz`, `/readyz`, `/startupz`, `/health` → Health probes (no auth)
- `/metrics` → Metrics (check JWT/API key)
- `/admin/*` → Admin API (check JWT/API key + admin permission)
- `/ws` with Upgrade header → WebSocket handler
2. **Raw TCP** (binary data, no HTTP headers) → TCP frame handler
## Minimal Initial Implementation
Start with these essential endpoints:
### Phase 1: Health & Observability
1. `GET /healthz` - Kubernetes liveness probe
2. `GET /readyz` - Kubernetes readiness probe
3. `GET /startupz` - Kubernetes startup probe
4. `GET /health` - Legacy health check
5. `GET /metrics` - Prometheus metrics (with auth)
6. `GET /admin/stats` - Human-readable overview
### Phase 2: Domain Visibility
7. `GET /admin/kv/stats` - KV visibility
8. `GET /admin/notice/subscriptions` - Notice visibility
9. `GET /api/v1/queue/realms/{realm}/areas/{area}/resources/{resource}` - Queue warm-state visibility
10. `GET /admin/sessions` - Active connections
### Phase 3: Admin Commands
11. Domain-specific commands (rollback, cancel, release) with `X-Confirm: true`