From 304924b46cd274a6227a6d76a3706184d41c3055 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Fri, 24 Apr 2026 13:31:56 +0200 Subject: [PATCH 01/39] jep: add JEP-0011 Metrics, Tracing, and Log Observability Proposes an optional, cross-component observability model for Jumpstarter covering lease context metadata, structured operational events, exporter/driver metrics, and standardized logging with direct Prometheus/Loki/Perses integration. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 789 ++++++++++++++++++ python/docs/source/internal/jeps/README.md | 2 + 2 files changed, 791 insertions(+) create mode 100644 python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md new file mode 100644 index 000000000..304b922fc --- /dev/null +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -0,0 +1,789 @@ +# JEP-0011: Metrics, Tracing, and Log Observability + +| Field | Value | +| ----------------- | --------------------------------------------------------------------- | +| **JEP** | 0011 | +| **Title** | Metrics, Tracing, and Log Observability | +| **Author(s)** | @mangelajo (Miguel Angel Ajo Pelayo) | +| **Status** | Draft | +| **Type** | Standards Track | +| **Created** | 2026-04-23 | +| **Updated** | 2026-04-23 | +| **Discussion** | *TODO: Matrix thread or GitHub issue/PR when opened* | +| **Requires** | — | +| **Supersedes** | — | +| **Superseded-By** | — | + +--- + +## Abstract + +This JEP defines an optional, cross-component observability model for +Jumpstarter covering lease context metadata, structured operational events, +exporter/driver metrics, and standardized logging. It targets direct integration +with Prometheus (scrape), Loki (log aggregation), and Perses (dashboards) — +without mandating OpenTelemetry — and introduces an optional in-cluster +Jumpstarter Telemetry service that aggregates data from exporters and clients so +that edge processes never need Loki or cluster-scrape credentials. +Implementation is expected to land in phases; this JEP describes the end state +and compatibility rules. + +## Motivation + +Today, operators and CI maintainers need to answer questions that raw Kubernetes +objects and ad hoc text logs do not always answer in one place: +- *Which pipeline or image was being tested on this lease?* +- *How often do flashes fail on this exporter?* +- *What lease or user correlates a controller line with a failure on the client?* + +The `Lease` API already models scheduling and assignment; it does +not yet provide a first-class, documented place for run metadata or a standard +for lease-scoped operational events (beyond generic `conditions`). + +Exporters expose work to drivers, but there is no shared model for driver- or +exporter-level metrics that a monitoring stack can scrape or receive. + +### User Stories + +- **As a** lab operator, **I want to** see flash success/failure rates per + exporter in a Prometheus dashboard, **so that** I can spot failing hardware + before CI teams notice. +- **As a** CI pipeline author, **I want to** attach my build ID and image + digest to a lease, **so that** post-mortem queries in Loki can filter all + logs for one pipeline run across controller, exporter, and client. +- **As a** platform engineer, **I want** exporter processes to send telemetry + without holding Loki or Prometheus credentials, **so that** I do not have to + distribute and rotate secrets on every lab machine. + +## Proposal + +### Concepts + +- **Lease context** — Identifiers and labels supplied by a client or CI and + associated for the life of a lease, propagated where safe so metrics, logs, + and traces can be filtered and joined. +- **Lease events** (or *operations*) — Annotated, structured log entries + recording significant actions (for example *flash started*, *flash failed*, + *image reference*) with typed fields, queryable in **Loki** alongside + regular logs and distinct from higher-frequency debug output (see **DD-2**). +- **Exporter metrics** — Counters, histograms, and gauges (naming and labels + TBD) exposed from the exporter and optionally enriched by individual drivers + (for example storage operations per type). +- **Jumpstarter Telemetry** (optional) — a dedicated + component with a well-known ingest path and the same trust + model (mTLS, ServiceAccount) as Controller/Router; + it isolates Loki/series work from the reconciler hot path (see + **DD-7**). Multi-replica HA and PromQL `sum` aggregation are + covered in **DD-8**; best-effort idempotency for informative metrics in + **DD-9**. + +### What users see + +- When creating a lease, clients (or their tooling) can attach metadata via + CRD fields and/or `spec.context` using documented + keys and size limits. Example keys might include a build / pipeline + identifier, image digest, or VCS. +- The controller and/or data plane write structured, annotated log events + (see **DD-2**) for significant operations such as flash attempts and outcomes. +- Exporters send increments to the Jumpstarter Telemetry + service over the existing exporter↔control-plane trust boundary; + the in-cluster side then POSTs to Loki and exposes `/metrics` + for scrape (see **DD-3**, **DD-7**), with cluster credentials, avoiding + per-exporter Loki and metrics secrets. The same path can carry operator-chosen structured log lines + and events (not unbounded default client chatter — see *Control-plane + aggregation* below). +- The `jmp` CLI logs remain readable, but also submits logs through the jumpstarter telemetry + endpoint, in machine parseable format for loki ingest. + +### API / Protocol Changes + +*High level — to be refined during review.* + +- **CRD (Lease)**: Additive changes only for the `spec.context` field. Backwards + compatibility by making this field empty by default. +- **gRPC (if applicable)**: Additional controller methods to discover the availability + of a metrics, or set of metrics endpoint(s). Optional propagation of `traceparent` and lease + identifiers in metadata; must remain backward compatible for existing clients + (unknown metadata ignored by older servers). + +### Hardware Considerations + +- No hardware considerations. + +## Design Decisions + +### DD-1: How lease-scoped *context* metadata is stored + +**Scope:** This decision is about where to store generic metadata on a +`Lease` that describes *why* a run exists or *where* it came from — for example +an external build id, pipeline id, VCS revision, or other +operator-defined keys (team, environment), within cardinality and +size limits documented elsewhere in this JEP. The same stored context +is the intended source to propagate (where safe) into metric series +labels and into log line fields for emissions that occur during the +lease and for logs produced during client access to the platform +(for example `jmp`) or during exporter and control-plane handling, so +Prometheus and Loki can correlate on one lease-level +identity without re-typing it on every line. + +**Alternatives considered:** + +1. **Annotation and label only** on the `Lease` object — Kube-native, no spec + change; limited size for annotations; labels for select queries only. +2. **Typed subfields under `spec`** (for example `observability` or `context`) + — easier validation, clearer API, migration path in CRD. +3. **Only client-side** (environment / local config) — no cluster visibility; + hard for operators to audit; no stable object-level link to per-lease + metrics and server logs in the cluster. + +**Decision:** the JEP leans toward **(2) for first-class, validated context** +with **(1) allowed for integration with generic tooling**, pending contributor consensus. + +**Rationale:** Typed fields make validation and documentation clear; labels +are still useful for selection and for tools that only understand metadata. + +### DD-2: Where operational events (flash, image) live + +**Alternatives considered:** + +1. **Kubernetes `Event` objects** — built-in, TTL-limited, good for + "what happened" in `kubectl get events` but not long-term history by default. +2. **`Lease.status.conditions` only** — compact but poor for a sequence of + operations with payloads (image id, size). +3. **Dedicated CRD** (for example per-event or a single stream object) — more + design and RBAC, better long-term retention and querying if backed properly. +4. **Annotated log events** Provides a lightweight alternative that can be traced + and filtered along logs. + +**Decision:** (4), since the other alternatives add additional pressure to the cluster + etcd via CRDs, annotated logs still provide the same level of functionality and can + be browsed together with logs. + +**Rationale:** Annotated log events naturally flow through the Loki + pipeline this JEP already establishes (**DD-5**, **DD-7**), so operational + records (flash started, flash failed, image reference) are queryable, + filterable, and correlated with surrounding exporter and controller logs + using the same label set (`lease_id`, `exporter`, `result`, …) without a + second query domain. Kubernetes `Event` objects **(1)** have a short + default TTL (~1 h) and still write to etcd on every occurrence; + `status.conditions` **(2)** is a poor fit for a sequence of operations with + variable payloads (image digest, byte count, duration); a dedicated CRD + **(3)** adds schema versioning, RBAC surface, and per-event etcd writes + that scale with flash volume — all pressure the cluster does not need + for data whose primary consumers are dashboards and post-mortem + queries, not reconciliation loops. Structured log events carry arbitrary + fields without CRD migration, support configurable retention in Loki, + and keep the etcd write budget reserved for scheduling and assignment + where it matters most. + +### DD-3: Metrics: Prometheus scrape of `/metrics` as the reference path + +**Alternatives considered:** + +1. **HTTP `GET /metrics` in Prometheus text format** (pull) — the default + for in-cluster Prometheus in scrape mode; works + with the Prometheus Operator (`ServiceMonitor`), `kube-prometheus`, and + self-hosted jobs. The optional Jumpstarter Telemetry service exposes + this for aggregated counters it holds after receiving +1 / +N + from exporters. +2. **Prometheus remote write** (or a Mimir / Cortex receiver) + from a Jumpstarter component — useful in advanced topologies; not + part of the reference implementation in this JEP; operators can add a + federation or `remote_write` from Prometheus to long-term + storage without the application pushing to Prometheus. +3. **Both** — **(1)** is required for the documented path; **(2)** is + optional infrastructure behind Prometheus, not a second + required app protocol. + +**Decision:** **(1)** for how cluster Prometheus ingests Jumpstarter + aggregated metrics (scrape the Telemetry, Controller, + and Router services). + +**Rationale:** Scrape is standard, debuggable, and scalable; it matches + `ServiceMonitor`; it avoids app-side remote-write credentials and + complexity in Jumpstarter. See **DD-6** (no OTel), **DD-7** (Telemetry + Deployment), **DD-8** (HA replicas). + +### DD-4: Log format for services vs CLI + +**Alternatives considered:** + +1. **JSON always** for every process — best for machines; hard for humans + debugging a laptop. +2. **Human text default for `jmp`**, **JSON for long-running services** and an + optional cli push via the metrics endpoint in JSON format (in addition to the + human friendly output) +3. **Single format** with a pretty-printer in front of developers — more moving + parts. + +**Decision:** **(2)**. Long-running services (`jumpstarter-controller`, + `jumpstarter-router`, `jumpstarter-telemetry`, Exporter) emit + structured JSON to stdout. The Controller and Router do not + push logs directly to Loki; instead, a cluster-level log shipper + (Promtail, Grafana Alloy, Vector, or equivalent DaemonSet) scrapes their + pod logs and delivers them to Loki. Only `jumpstarter-telemetry` writes + to Loki directly (push API) because the exporter/client data it + aggregates does not originate as any pod's stdout. + +**Rationale:** Matches the requirement that *clients* stay human-readable, and at + the same time all services get parseable, joinable log lines. Writing JSON + to stdout and relying on the cluster log shipper for Loki delivery + decouples the Controller reconciler and Router session handling from + Loki availability — a Loki outage does not affect lease operations. + The Telemetry service retains a direct Loki-push because it is an + isolated workload (**DD-7**) whose core job is Loki ingest. + +### DD-5: Where Loki and Prometheus (or remote-write) credentials live + +**Alternatives considered:** + +1. **Each exporter and edge host** holds credentials (or a sidecar) to push + directly to Loki and to Prometheus (or a metrics gateway) — maximum + flexibility; maximum secret distribution and rotation burden on lab and + remote sites. +2. **Jumpstarter Controller and/or Router** receive metrics and structured + events from exporters and (optionally) from client traffic they already + handle, and forward to the Loki push API and to + Prometheus-compatible sinks (scrape registration) + with in-cluster auth — one + credential surface; enriched with lease, exporter, and client context + in one place; must be non-blocking, bounded, and optional so the + control path does not depend on Loki or Prometheus availability. +3. **Hybrid** — generic in-cluster collectors for raw pod logs and scrape; + (2) for lease-scoped events and aggregated exporter metrics the + platform understands. +4. **Dedicated Jumpstarter Telemetry Deployment** (see **DD-7**) + instead of folding everything into the Controller — only + Telemetry holds Loki-push credentials; isolated failure domain + and scaling for high-volume increments. Router and Controller + write structured JSON to stdout (see **DD-4**) and expose `/metrics` + for Prometheus scrape; a cluster log shipper delivers their pod logs + to Loki without Jumpstarter-specific Loki credentials. + +**Decision:** (4) + +**Rationale:** The goal is to avoid propagating Loki- and + cluster-ingest authentication + to every exporter process while still attaching Jumpstarter-specific + context. Among Jumpstarter components, only `jumpstarter-telemetry` + holds Loki-push credentials — the Controller and Router have no Loki + client dependency (see **DD-4**); their pod logs reach Loki via the + cluster's existing log shipping infrastructure. Generic in-cluster + collectors solve *credentials* but not *semantic* correlation unless + integrated; the hub (2) reuses the existing trust model + (exporter→controller) and can inject labels and tenant headers (for + example `X-Scope-OrgID`) in one place. A separate Deployment (**4** / + **DD-7**) is preferable to overloading the main reconciler when + load or residency of counters matters. + +### DD-6: OpenTelemetry (OTLP / Collector) as a *mandated* layer + +**Alternatives considered:** + +1. **Adopt OpenTelemetry** — instrument Controller, Router, Exporter, and + clients with the OTel SDK, export OTLP to a cluster-local + OpenTelemetry Collector, and let the Collector fan out to Loki, Prometheus + (remote write), and Tempo. +2. **Integrate directly** with each backend: Loki HTTP `POST /loki/api/v1/push` or + gRPC; Prometheus text on `/metrics`; structured JSON + (or logfmt) logs to stdout for shippers; optional W3C `traceparent` in + gRPC metadata for correlation *without* shipping full distributed + traces in the first iteration. If traces are ever needed, use Tempo + ingest where practical, *or* a thin sender — still + without a project-wide requirement on the OTel SDK in every binary. +3. **Hybrid (OTel in one language, direct in another)** — lowest common + implementation cost but inconsistent contributor experience and two + operational models. + +**Decision:** **(2).** This JEP does not make OpenTelemetry (SDK or + Collector) part of the required reference architecture. Vendors and + operators who already run an OpenTelemetry Collector may scrape the + same `/metrics`, receive logs shipped by existing agents, or + receive the Loki body the hub would have sent — compatibility + is welcome; dependency is not mandatory. + +**Rationale:** + +- **Complexity** — the Collector is another versioned, configured service; dual + OTel stacks (Go, Python) add version drift and test matrix. +- **Fit** — most Jumpstarter metrics and lease events map cleanly to + Prometheus and Loki wire protocols operators already use. +- **Narrow scope** — full three-pillar OTel (unified logs via OTLP) is + *optional product territory*; this JEP optimizes for low ceremony and + direct integration. + + +### DD-7: Optional Jumpstarter Telemetry service (dedicated Deployment vs. Controller/Router only) + +**Alternatives considered:** + +1. **In-process** in the Controller (and Router) reconciler — few + moving parts; risk of CPU / GC pressure and stronger coupling + between leases and high-volume increments or Loki writes. +2. A **dedicated** in-cluster Service and Deployment (working name + `jumpstarter-telemetry`, TBD) that: receives gRPC/HTTP increments from + exporters and clients, applies them to counters in memory, + POSTs to Loki, exposes `/metrics`, and uses the same K8s + ServiceAccount / mTLS as other control-plane binaries. +3. **Split** into separate sidecars (Loki-only, metrics-only) — more images to + build and version. + +**Decision:** Prefer **(2)** for the optional aggregated-metrics + Loki + path at scale; allow **(1)** in small or dev clusters; **(3)** only + if review shows a need. Could still offer a centralized log/event source when + Loki is not available by using the pod logs, this could be helpful for testing. + +**Rationale:** A dedicated workload can scale and restart independently; + Loki spikes and ingest load cannot starve lease + reconciliation in the controller by moving it to a separate service. + +### DD-8: Multiple Telemetry replicas (HA) and addable counters + +**Context:** The Telemetry process holds in-memory counters. Exporters send ++1 (e.g. flash success), +N (bytes read/written), or +1 per +reporting interval (e.g. one “inactive” minute for a lease with +labels `exporter`, `lease_id`, `client`). + +**Alternatives considered:** + +1. **Single replica** for Telemetry — no cross-pod `sum` issue; SPOF for + ingest and scrape of that `Service`. +2. **Multiple replicas** behind a load balancer; each RPC updates one + pod, which only advances its partial counters for the label + sets it has seen. Prometheus scrapes all pods (or separate + `PodMonitor` targets). In PromQL, + `sum by (exporter, lease_id, client, …) (…)` after dropping + `pod` / `instance` matches the global total, as long as each real + event is applied at most once in the system (counters are + additive; increments are partitioned by traffic). +3. **Strong consistency** (Raft, Redis as source of truth for + counters) — higher operating cost than this JEP’s v1 scope. + +**Decision:** **(2)** + +**Rationale:** Sums of cumulative counters across replicas are + meaningful when each event is not double-applied; Loki + appends are naturally per-replica as well. A possible failure mode is + duplicate increments (retries, at-least-once RPCs). But this + is informative data, and eventual (and very low chance) of duplication does not + justify a more complex design — see **DD-9**. + +### DD-9: Idempotency vs. best-effort (acceptable over-count for informative metrics) + +**Alternatives considered:** + +1. **Idempotent** increments (deduplication keys or idempotent RPCs) — + appropriate for billing- or SLO-sensitive series; more design + and storage in the ingest path. +2. **Best effort** (at-least-once) `+1` / `+N` without global deduplication + — simpler; rare extra counts on retries or replays (see + **DD-8**). +3. “Exactly once in the exporter; Telemetry is a dumb adder” — + still **(2)** at the edge if the network retries. + +**Decision:** (2) + +**Rationale:** Simpler RPCs and no global dedup store in v1; + operators treat these numbers as order-of-magnitude signals, + not invoices, unless policy changes. + +### DD-10: Perses over Grafana for dashboarding + +**Alternatives considered:** + +1. **Grafana** — mature, widely deployed, massive plugin and datasource + ecosystem; governed by Grafana Labs (commercial); AGPL v3 license; + custom JSON dashboard format; external to Kubernetes architecture. +2. **Perses** — CNCF project (vendor-neutral governance); Apache 2.0 + license; standardized dashboard spec (CUE/JSON) with built-in static + validation and SDKs for GitOps; Kubernetes-native (CRD support for + dashboards-as-code); data-source focus on Prometheus, Loki, and + Tempo — exactly the backends this JEP targets. + +**Decision:** **(2)** + +**Rationale:** + +- **License alignment** — Jumpstarter is Apache 2.0; recommending an + AGPL-licensed dashboard layer introduces license friction for downstream + distributors and embedders. +- **CNCF governance** — vendor-neutral stewardship matches the project's + open-source posture; no single-vendor control over the dashboard layer. +- **Kubernetes-native CRDs** — dashboards can be managed as K8s resources, + fitting the same declarative, reconciler-driven model Jumpstarter already + uses for Leases, Exporters, and the optional Telemetry Deployment. +- **GitOps and validation** — CUE-based specs with static validation and SDKs + enable dashboard-as-code in CI pipelines, consistent with the JEP's emphasis + on automation and CI integration. +- **Backend focus** — Perses targets Prometheus, Loki, and Tempo — exactly the + three backends this JEP standardizes on — without carrying the cost of a + broad plugin ecosystem the project does not need. + +Operators who prefer Grafana can still point it at the same `/metrics` and Loki +endpoints; this DD only governs the recommended dashboard experience. + +## Design Details + +### Correlation and fields + +*Subject to review — names and cardinality rules should be fixed before +"Implemented".* + +| Field / label | Where | Notes | +| ------------- | ----- | ----- | +| `lease_id` (or UIDs) | Logs, traces, some metrics | K8s object name or UID. | +| `exporter` | Metrics, logs | `Exporter` name. | +| `client` (identifier) | Logs, optional | Opaque; avoid PII by default. | +| W3C `trace_id` / `span_id` | If tracing enabled | Propagate across client ↔ exporter when viable. | + +Additional lease.spec.context correlation fields can be added in runtime. + +### Control-plane aggregation (Controller / Router / optional Telemetry) + +When this mode is enabled in a deployment: + +- Exporters and clients (`jmp`) send increments (`+1` / + `+N`) and structured log/event records to the optional + `jumpstarter-telemetry` service (name TBD, see **DD-7**). This dedicated + `Service` uses the same mTLS / ServiceAccount / NetworkPolicy model as + Controller and Router; it holds in-memory counters, POSTs to + the Loki API, and exposes `/metrics` for Prometheus scrape + (**DD-3**). HA (multiple replicas) uses `sum` in PromQL (**DD-8**); + best-effort duplicate tolerance (**DD-9**). Exporter and edge processes never + need Loki or cluster-scrape credentials directly (**DD-5**). +- Controller and Router emit structured JSON logs to stdout + (see **DD-4**). They do not push logs directly to Loki; a cluster-level + log shipper (Promtail, Grafana Alloy, Vector, or equivalent) scrapes + their pod logs and delivers them to Loki. This decouples the reconciler + and session-handling hot paths from Loki availability. +- **Backpressure** applies to the Telemetry service: its Loki-push and + counter queues must be bounded; on overflow, drop (with a counter) or + sample. Because the Controller and Router no longer push to Loki, their + lease/session operations are inherently isolated from Loki or metrics + path slowdowns. +- **Multi-tenancy:** if Loki is multi-tenant, the Telemetry writer (and the + cluster log shipper for Controller/Router pod logs) applies org or + namespace scoping consistently; label sets are reviewed to avoid + cross-tenant leakage. +- This does not require that *all* metrics *originate* in a single + process: the exporter and drivers still emit the facts; + Telemetry aggregates and ships to Loki; Controller and + Router expose `/metrics` for Prometheus scrape and rely on the + log shipper for their stdout logs. + +### High-level data flow + +#### Client (`jmp`) + +```mermaid +flowchart LR + jmp([jmp CLI]) -->|session gRPC| exp[Exporter] + jmp -->|structured logs| tel[jumpstarter-telemetry] +``` + +The CLI connects to the Exporter for device sessions and sends structured +logs to the Telemetry service for Loki ingest (see **DD-4**). + +#### Exporter + +```mermaid +flowchart LR + ctrl[jumpstarter-controller] -->|lease lifecycle| exp[Exporter] + exp --> drv[Drivers] + exp -->|increments, events, logs| tel[jumpstarter-telemetry] +``` + +The Controller assigns leases; the Exporter delegates to Drivers +and forwards metrics increments, operational events, and logs to +Telemetry (see **DD-2**, **DD-5**, **DD-7**). + +#### Telemetry to backends + +```mermaid +flowchart LR + tel[jumpstarter-telemetry] -->|push API| loki[(Loki)] + tel -->|/metrics| prom[(Prometheus)] +``` + +Telemetry aggregates exporter and client data and writes to Loki and +exposes `/metrics` for Prometheus scrape (**DD-3**, **DD-7**). + +#### Controller to backends + +```mermaid +flowchart LR + ctrl[jumpstarter-controller] -->|JSON stdout| shipper[Log shipper] + shipper -->|pod logs| loki[(Loki)] + ctrl -->|/metrics| prom[(Prometheus)] +``` + +The Controller writes structured JSON to stdout (see **DD-4**). A +cluster log shipper scrapes pod logs and delivers them to Loki. The +Controller exposes `/metrics` for reconciliation and lease-level counters. + +#### Router to backends + +```mermaid +flowchart LR + router[jumpstarter-router] -->|JSON stdout| shipper[Log shipper] + shipper -->|pod logs| loki[(Loki)] + router -->|/metrics| prom[(Prometheus)] +``` + +The Router writes structured JSON to stdout (see **DD-4**). A +cluster log shipper scrapes pod logs and delivers them to Loki. The +Router exposes `/metrics` for routing and session-level counters. + +The diagrams above summarize the hub model described in *Control-plane +aggregation*. For credential isolation see **DD-5**; for the Telemetry +Deployment see **DD-7**; for HA summing see **DD-8**; for best-effort +semantics see **DD-9**. Optional direct exporter→Loki and `/metrics` +scrape on Exporter Pods remain valid for deployments that prefer them. +No OpenTelemetry Collector is *required* (see **DD-6**); operators may +run one *alongside* and scrape the same targets if they choose. + +### Common open-source backends (direct integration; no mandatory OTel) + +This JEP’s target wire protocols and components are Prometheus and +Loki (and, if trace export is ever added, Tempo or Jaeger with +native ingest or HTTP — not OTLP as a *Jumpstarter* requirement; see +**DD-6**). OpenTelemetry is a parallel ecosystem: teams can run a +Collector next to Jumpstarter and still scrape `/metrics` and ship +logs with Promtail-class agents; the reference design does not depend +on the OTel SDK in application code. + +- Prometheus for metrics (and Alertmanager for routing alerts): scrape + the `/metrics` endpoint, remote-write to long-term store if needed, and drive + dashboards in Perses or self-hosted UIs (see **DD-10**). `kube-state-metrics` and + the Prometheus Operator are common in Kubernetes; vendors often package + the same projects, but this JEP refers to the open-source components by name. +- Loki (Grafana Labs, AGPL) for log storage and querying; it pairs with + Perses (see **DD-10**) for search and with Promtail, Grafana + Agent, or Grafana Alloy to ship logs, or with application push to Loki’s HTTP API as + already discussed in the control-plane path. +- Traces (optional, future work) — if adopted, Grafana Tempo and Jaeger + are typical stores; use W3C Trace Context in RPC metadata for + correlation even when full trace export is off. OTLP may be + *only* a convenience for operators; it is not a JEP-0011 core + dependency. +- A typical Kubernetes integration path: `ServiceMonitor` + Prometheus + (or a compatible remote-write consumer), a Loki endpoint for logs + — any EKS, GKE, AKS, self-managed + Kubernetes, or bare-metal install that runs these same projects can be the + target; the implementation + plan should name tested combinations (Prometheus and Loki version + pairs where relevant) in `Implementation History`, not a single product bundle. + +## Test Plan + +### Unit Tests + +- Log field builders and redaction: ensure defaults strip secrets; optional + fields behind flags. +- Metric registration helpers: label validation and naming conventions. + +### Integration Tests + +- Operator + exporter: scrape or receive metrics; assert presence of a minimal + documented set of series after a known operation. +- If the control-plane forward path is implemented: with a test Loki and + a Prometheus-compatible sink (or mock), assert that records arrive with expected + `lease` / `exporter` labels and that exporter pods do not require + Loki or cluster-scrape credentials in their spec. +- If Telemetry runs with >1 replica: one test verifies that + `sum` by business labels (dropping `pod`/`instance`) matches expected totals after partitioned increments (see **DD-8**). +- Lease with metadata: objects validate; events or status updates match expected + structure. + +### Hardware-in-the-Loop + +- Flashing and power paths: at least one driver records an event and/or + metrics counter on success and failure on real hardware in a lab. +- *Hardware type TBD in implementation.* + +### Manual + +- `jmp` default output remains readable; JSON mode under opt-in shows expected + fields in a real CI job. + +## Acceptance Criteria + +*To be sharpened as design firms up.* + +- [ ] Documented lease metadata and/or annotation keys (with size and + validation rules) are merged with this JEP for reference. +- [ ] At least one event or status mechanism for *flash* (or equivalent) success + and failure is defined and has an integration test. +- [ ] Exporter (or sidecar) exposes a documented metrics surface; drivers + can contribute without reimplementing the HTTP server ad hoc in each + driver. +- [ ] Controller and one data-plane service emit structured logs with a + documented minimum field set; `jmp` documents human vs machine modes. +- [ ] If hub forwarding is implemented, document how operators enable it, + how backpressure and overflow work, and how Loki-push credentials are + mounted (see **DD-4**, **DD-5**, **DD-7**, **DD-8**, **DD-9**). +- [ ] Backward compatibility: existing clients and manifests without the new + fields continue to work; deployments that do not use hub forwarding + behave as today. + +## Graduation Criteria + +### Experimental (first release behind flag or doc-only) + +- JEP in Discussion; partial implementation; known gaps listed in + *Unresolved Questions*. + +### Stable + +- Acceptance criteria met; SLOs for log volume and metric cardinality + documented; upgrade notes for the operator and CLI. + +## Backward Compatibility + +- New CRD fields and labels must be optional; existing lease flows unchanged. +- gRPC: new metadata must be additive; servers tolerate missing trace and + context fields from older clients; clients ignore unknown fields where + applicable. +- No removal of current default CLI behavior; JSON logging only when selected. + +## Consequences + +### Positive + +- **Operators** can route logs and metrics to existing Prometheus, Loki, + and Perses-based stacks (self-hosted or platform-managed under + the hood) without a mandatory OpenTelemetry Collector in front of + Jumpstarter (see **DD-6**, **DD-10**). +- **CI** can correlate a failed run to equipment and build metadata. +- **Driver authors** get a single pattern for operation counters and event + emission. +- **Security-conscious** users can run with minimal log fields and no trace. +- **Operators** can keep Loki, Prometheus, and related API tokens in-cluster + only; exporters keep a single Jumpstarter trust relationship (**DD-5**). +- The optional Telemetry service isolates Loki/series work from the reconciler + (**DD-7**, **DD-8**); Controller and Router carry no Loki client dependency, + so a Loki outage cannot affect lease operations (**DD-4**). + +### Negative + +- More code paths, dependencies (for example a Prometheus client + library, Loki HTTP client, and structured log helpers), and + operability + documentation burden. +- Cardinality mistakes can harm Prometheus or backing stores — requires + guardrails and review. +- A poorly tuned forward path can add resource and SPOF-like + pressure on the Controller; a dedicated Telemetry Deployment or subprocess and strict queue bounds are likely needed at scale. +- Informative counters may be overstated on retries (**DD-9**); + re-labeling a series as SLO-critical without tightening + idempotency is an operational error. +- Operators must run a functioning cluster log shipper (Promtail, Grafana + Alloy, Vector, or equivalent) to see Controller and Router logs in Loki. + This is near-universal in production Kubernetes but worth documenting for + minimal or dev clusters. + +### Risks + +- Over-attachment of metadata to *metrics* as labels could overload TSDB; the + design may restrict labels to a fixed allowlist and push variable data to + event payloads and logs. +- Prometheus / Loki / Perses-stack version drift in the field + — document tested pairs; W3C Trace Context in gRPC remains + best-effort across Python and Go (no OTel SDK requirement to + propagate `traceparent` where needed). +- The Controller (or a bug in the forwarder) mis-labeling or dropping data + during incidents — mitigated by tests, sampling transparency, and + optional parallel scrape/stdout paths for the paranoid. + +## Rejected Alternatives + +- **"All metrics and facts are *generated* only in the controller"** — would + miss per-exporter and per-driver truth; rejected. *Forwarding* + exporter-originated series and events *through* the control-plane (with + stable labels) is not the same and remains in scope (see DD-5). +- *Requiring Loki- and Prometheus-ingest credentials on every exporter + and edge* as the only supported model — rejected in favor of + optional hub + forwarding and of cluster-native collectors that also avoid per-host + secrets, even though those collectors are not Jumpstarter-specific. +- **"Mandatory OpenTelemetry SDK and Collector"** for all metrics, + logs, and traces — rejected for the reference architecture; + rationale in **DD-6** (optional parallel deployment by operators is + still fine). +- **"Unstructured logs everywhere; parse with regex"** — rejected as + unscalable for joins with traces and multi-service incidents. +- **"Mandatory full tracing for every command"** — high overhead; rejected; prefer + sampling and opt-in for heavy paths. + +## Prior Art + +- [Prometheus](https://prometheus.io/) and [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/) + — time-series metrics and alerting; [Prometheus naming and labels](https://prometheus.io/docs/practices/naming/) + on cardinality and naming; remote write for non-scrape topologies. +- [Loki](https://grafana.com/oss/loki/) — log aggregation, label model, and push + and query APIs; often combined with [Perses](https://perses.dev/) (see + **DD-10**) and Grafana Agent / Alloy or + [Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/) for log + shipping. +- [Grafana Tempo](https://grafana.com/oss/tempo/) or [Jaeger](https://www.jaegertracing.io/) — common trace backends + (native or HTTP ingest; OTLP where the operator uses it — not a + Jumpstarter code dependency; see **DD-6**). +- [Perses](https://perses.dev/) — CNCF dashboard project; Apache 2.0; + Kubernetes-native CRDs; CUE/JSON spec with GitOps SDKs; focused on + Prometheus, Loki, and Tempo data sources (see **DD-10**). +- [OpenTelemetry](https://opentelemetry.io/) and the + [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) — + relevant as ecosystem and operator-side *optional* plumbing; + this JEP intentionally does not adopt them in-process by default (**DD-6**). +- Other HiL / test systems often separate "run metadata" (like Jenkins build + id) from device state; similar separation maps well to this JEP’s lease + context + events. + +## Unresolved Questions + +- Exact `Lease` spec shape: single `context` object vs. multiple optional + sub-objects (CI vs. VCS). +- Event retention: Loki retention policy (per-tenant, per-stream retention + classes) for annotated log events (**DD-2**); whether Jumpstarter should + document recommended retention defaults or leave this to operators. +- **Identity**: how "current user" is defined (K8s user, OIDC subject, client + CR name) and what is safe to log in shared environments. +- If distributed traces are ever added, whether to use Jaeger + / Tempo native clients, HTTP-only, or revisit OTel in a + *future* JEP (this document excludes mandatory OTel per **DD-6**). +- Deduplication: multiple flashes of the *same* image in one lease — one event + type with counts vs. one row per attempt. +- **Hub implementation:** in-process forwarder in Controller/Router vs. the + optional Jumpstarter Telemetry Deployment (**DD-7**) — final naming, + resource limits, and protocol for increments (gRPC vs. HTTP) under load; + exporter scrape vs. increments-only mix. +- Whether gauges of “current sessions” (if any) need a single writer to + avoid invalid `sum` on replicas — v1 may stay counters-only per + **DD-8** for simplicity. + +## Future Possibilities + +- SLOs and error budgets on lease acquisition time, flash success rate, and + mean time to recovery of exporters. +- Per-tenant or per-namespace dashboards as samples in the docs. +- *Not* part of this JEP: billing usage metering (could reuse metrics later). + +## Implementation History + +- Still nothing implemented. Everything under discussion. + +## References + +- [JEP-0000 — JEP Process](JEP-0000-jep-process.md) +- [Kubernetes Events](https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/) +- [W3C Trace Context](https://www.w3.org/TR/trace-context/) (`traceparent`) +- Upstream project docs for the Prometheus, Loki, and + Perses versions (and optional Tempo / Jaeger if used) in a + given deployment; pin versions in release notes + and integration tests. + +--- + +*This JEP is licensed under the +[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0), +consistent with the Jumpstarter project.* diff --git a/python/docs/source/internal/jeps/README.md b/python/docs/source/internal/jeps/README.md index bba33ef35..fca115978 100644 --- a/python/docs/source/internal/jeps/README.md +++ b/python/docs/source/internal/jeps/README.md @@ -35,6 +35,7 @@ For the full process definition, see [JEP-0000](JEP-0000-jep-process.md). | JEP | Title | Status | Author(s) | | ---- | ---------------------------------------------------- | ----------- | -------------------- | | 0010 | [Renode Integration](JEP-0010-renode-integration.md) | Implemented | @vtz (Vinicius Zein) | +| 0011 | [Metrics, Tracing, and Log Observability](JEP-0011-observability-telemetry-logs.md) | Draft | @mangelajo (Miguel Angel Ajo Pelayo) | ### Informational JEPs @@ -67,4 +68,5 @@ For the full process definition, see [JEP-0000](JEP-0000-jep-process.md). JEP-0000-jep-process.md JEP-0010-renode-integration.md +JEP-0011-observability-telemetry-logs.md ``` From 9c0e4fd21b3953f21894f3a64192f2935724c4a3 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Mon, 27 Apr 2026 10:15:27 +0200 Subject: [PATCH 02/39] jep-0011: add cardinality guidelines and opt-in client label strategy Add a detailed correlation table distinguishing Prometheus metric labels, Loki stream labels, and log line fields. Define concrete cardinality bounds and introduce a two-tier approach for the client label: off by default (use LogQL for per-client queries), opt-in for small deployments (< 200 stable clients). Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 88 ++++++++++++++++--- 1 file changed, 75 insertions(+), 13 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 304b922fc..277020961 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -117,8 +117,8 @@ exporter-level metrics that a monitoring stack can scrape or receive. **Scope:** This decision is about where to store generic metadata on a `Lease` that describes *why* a run exists or *where* it came from — for example an external build id, pipeline id, VCS revision, or other -operator-defined keys (team, environment), within cardinality and -size limits documented elsewhere in this JEP. The same stored context +operator-defined keys (team, environment), within the cardinality and +size limits defined in *Cardinality guidelines*. The same stored context is the intended source to propagate (where safe) into metric series labels and into log line fields for emissions that occur during the lease and for logs produced during client access to the platform @@ -429,14 +429,76 @@ endpoints; this DD only governs the recommended dashboard experience. *Subject to review — names and cardinality rules should be fixed before "Implemented".* -| Field / label | Where | Notes | -| ------------- | ----- | ----- | -| `lease_id` (or UIDs) | Logs, traces, some metrics | K8s object name or UID. | -| `exporter` | Metrics, logs | `Exporter` name. | -| `client` (identifier) | Logs, optional | Opaque; avoid PII by default. | -| W3C `trace_id` / `span_id` | If tracing enabled | Propagate across client ↔ exporter when viable. | - -Additional lease.spec.context correlation fields can be added in runtime. +| Field / label | Prometheus metric label | Loki stream label | Log line field | Notes | +| ------------- | :--------------------: | :---------------: | :------------: | ----- | +| `exporter` | yes | yes | yes | Bounded by cluster size. | +| `operation` (flash, power, …) | yes | no | yes | Small fixed enum. | +| `result` (success, failure, …) | yes | no | yes | Small fixed enum. | +| `component` (controller, router, telemetry, exporter) | no | yes | yes | Fixed set of service names. | +| `namespace` | no | yes | yes | K8s namespace; bounded. | +| `lease_id` (or UID) | **no** | **no** | yes | Unbounded — use only in log line JSON. | +| `client` (identifier) | opt-in | **no** | yes | See *Cardinality guidelines*; avoid PII. | +| `image_digest`, `build_id`, VCS ref | **no** | **no** | yes | From `spec.context`; unbounded. | +| W3C `trace_id` / `span_id` | **no** | **no** | yes | Propagate in gRPC metadata and log lines. | + +Additional `lease.spec.context` correlation fields can be added at runtime; +they appear as structured log line fields, never as Prometheus or Loki labels. + +### Cardinality guidelines + +Unbounded identifiers (`lease_id`, `image_digest`, `trace_id`, and +any operator-defined `spec.context` keys) must not be used as Prometheus metric +labels or Loki stream labels. They belong inside structured log line JSON, +where Loki filter expressions (`| json | lease_id = "…"`) can query them +without inflating the label index. + +Rules of thumb for this JEP: + +- **Prometheus**: each metric label dimension should have < 100 distinct values + per scrape target. The default label set for Jumpstarter metrics is + `{exporter, operation, result}` — all bounded enums. +- **Loki**: stream labels should be a small fixed set (`{component, exporter, + namespace}`) to keep active stream count per tenant manageable (Grafana's + guidance: < 100 k active streams). High-cardinality fields go inside the log + line body. +- **Lease context fields** from `spec.context` are propagated into log line + JSON only. A future implementation may allow operators to promote specific + context keys to Loki stream labels at their own cardinality risk, but the + default is log-line-only. + +#### `client` label: opt-in two-tier strategy + +Per-client metrics (e.g. flash failures by client) are valuable for operators +but `client` is a semi-bounded dimension whose cardinality depends on +deployment size. The JEP adopts a two-tier approach: + +1. **Default (off)**: metrics use `{exporter, operation, result}` only. + `client` appears in every structured log line, so per-client analysis is + available via LogQL: + `sum by (client) (count_over_time({component="exporter"} | json | operation="flash" [5m]))`. +2. **Opt-in**: an operator flag (e.g. `metrics.includeClientLabel: true`) adds + `client` to the Prometheus label set. This is safe for deployments with a + bounded, stable set of registered clients (rule of thumb: < 200). Operators + with short-lived or ephemeral clients (CI runners, dynamic pods) should + leave it off to avoid series churn. + +Approximate series impact with 20 exporters, 4 operations, 2 results: + +| Clients | Series per metric (without) | Series per metric (with) | +| ------- | :-------------------------: | :----------------------: | +| — | 160 | — | +| 150 | 160 | 24,000 | +| 1,000 | 160 | 160,000 | + +At 150 clients and ~10 metrics the total (~240 k series) is well within a +single Prometheus instance. At 1,000 clients the total (~1.6 M series) is +feasible but approaches the range where long-term retention benefits from +Thanos or Mimir, and series churn from ephemeral clients can degrade +compaction. + +Future work may explore Prometheus exemplars (attaching `client` to individual +samples without creating full series) as a low-cardinality alternative for +"which client caused this spike?" queries. ### Control-plane aggregation (Controller / Router / optional Telemetry) @@ -684,9 +746,9 @@ on the OTel SDK in application code. ### Risks -- Over-attachment of metadata to *metrics* as labels could overload TSDB; the - design may restrict labels to a fixed allowlist and push variable data to - event payloads and logs. +- Over-attachment of metadata to *metrics* as labels could overload TSDB; + *Cardinality guidelines* defines a label allowlist and pushes variable data + to log line fields. - Prometheus / Loki / Perses-stack version drift in the field — document tested pairs; W3C Trace Context in gRPC remains best-effort across Python and Go (no OTel SDK requirement to From 7acbd2765b06ddad429b5ee9248239bc9720c893 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Mon, 27 Apr 2026 10:18:37 +0200 Subject: [PATCH 03/39] jep-0011: fix CI typos and Sphinx mermaid warnings - Replace `mis-labeling` with `mislabeling` to pass typos check - Add AKS (Azure Kubernetes Service) to typos.toml allowlist - Use MyST `{mermaid}` directive syntax instead of plain `mermaid` fenced blocks so sphinxcontrib-mermaid renders without warnings Made-with: Cursor --- .../jeps/JEP-0011-observability-telemetry-logs.md | 12 ++++++------ typos.toml | 3 +++ 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 277020961..90d0ddf3e 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -537,7 +537,7 @@ When this mode is enabled in a deployment: #### Client (`jmp`) -```mermaid +```{mermaid} flowchart LR jmp([jmp CLI]) -->|session gRPC| exp[Exporter] jmp -->|structured logs| tel[jumpstarter-telemetry] @@ -548,7 +548,7 @@ logs to the Telemetry service for Loki ingest (see **DD-4**). #### Exporter -```mermaid +```{mermaid} flowchart LR ctrl[jumpstarter-controller] -->|lease lifecycle| exp[Exporter] exp --> drv[Drivers] @@ -561,7 +561,7 @@ Telemetry (see **DD-2**, **DD-5**, **DD-7**). #### Telemetry to backends -```mermaid +```{mermaid} flowchart LR tel[jumpstarter-telemetry] -->|push API| loki[(Loki)] tel -->|/metrics| prom[(Prometheus)] @@ -572,7 +572,7 @@ exposes `/metrics` for Prometheus scrape (**DD-3**, **DD-7**). #### Controller to backends -```mermaid +```{mermaid} flowchart LR ctrl[jumpstarter-controller] -->|JSON stdout| shipper[Log shipper] shipper -->|pod logs| loki[(Loki)] @@ -585,7 +585,7 @@ Controller exposes `/metrics` for reconciliation and lease-level counters. #### Router to backends -```mermaid +```{mermaid} flowchart LR router[jumpstarter-router] -->|JSON stdout| shipper[Log shipper] shipper -->|pod logs| loki[(Loki)] @@ -753,7 +753,7 @@ on the OTel SDK in application code. — document tested pairs; W3C Trace Context in gRPC remains best-effort across Python and Go (no OTel SDK requirement to propagate `traceparent` where needed). -- The Controller (or a bug in the forwarder) mis-labeling or dropping data +- The Controller (or a bug in the forwarder) mislabeling or dropping data during incidents — mitigated by tests, sampling transparency, and optional parallel scrape/stdout paths for the paranoid. diff --git a/typos.toml b/typos.toml index 3cc13976e..5e5d30957 100644 --- a/typos.toml +++ b/typos.toml @@ -19,6 +19,9 @@ mosquitto = "mosquitto" # ser is short for "serialize" in variable names like ser_json_timedelta ser = "ser" +# AKS is Azure Kubernetes Service +AKS = "AKS" + [type.gomod] # Exclude go.mod and go.sum from spell checking extend-glob = ["go.mod", "go.sum"] From d9b2e5ef0e958976db1504fcc6334f2e400d72a5 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Mon, 27 Apr 2026 11:51:27 +0200 Subject: [PATCH 04/39] jep-0011: adopt exemplars for high-cardinality context Replace the opt-in client label two-tier strategy with Prometheus exemplars as the primary mechanism for attaching high-cardinality identifiers (client, lease_id, trace_id, spec.context keys) to metric samples without inflating series cardinality. - Add Prom exemplar column to the correlation table - Rewrite cardinality guidelines around exemplars - Update DD-3 rationale (OpenMetrics carries exemplars natively) - Fix DD-8 PromQL to use only bounded labels - Update risks with exemplar prerequisites (Prometheus >= 2.26) - Add exemplar references to Prior Art - Fix misc typos and consistency issues Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 211 ++++++++---------- 1 file changed, 95 insertions(+), 116 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 90d0ddf3e..75df1c626 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -5,11 +5,11 @@ | **JEP** | 0011 | | **Title** | Metrics, Tracing, and Log Observability | | **Author(s)** | @mangelajo (Miguel Angel Ajo Pelayo) | -| **Status** | Draft | +| **Status** | Discussion | | **Type** | Standards Track | | **Created** | 2026-04-23 | -| **Updated** | 2026-04-23 | -| **Discussion** | *TODO: Matrix thread or GitHub issue/PR when opened* | +| **Updated** | 2026-04-27 | +| **Discussion** | https://github.com/jumpstarter-dev/jumpstarter/pull/631 | | **Requires** | — | | **Supersedes** | — | | **Superseded-By** | — | @@ -163,8 +163,8 @@ are still useful for selection and for tools that only understand metadata. pipeline this JEP already establishes (**DD-5**, **DD-7**), so operational records (flash started, flash failed, image reference) are queryable, filterable, and correlated with surrounding exporter and controller logs - using the same label set (`lease_id`, `exporter`, `result`, …) without a - second query domain. Kubernetes `Event` objects **(1)** have a short + using the same correlation fields (`lease_id`, `exporter`, `result`, …) + without a second query domain. Kubernetes `Event` objects **(1)** have a short default TTL (~1 h) and still write to etcd on every occurrence; `status.conditions` **(2)** is a poor fit for a sequence of operations with variable payloads (image digest, byte count, duration); a dedicated CRD @@ -201,15 +201,17 @@ are still useful for selection and for tools that only understand metadata. **Rationale:** Scrape is standard, debuggable, and scalable; it matches `ServiceMonitor`; it avoids app-side remote-write credentials and - complexity in Jumpstarter. See **DD-6** (no OTel), **DD-7** (Telemetry + complexity in Jumpstarter. The OpenMetrics exposition format used by + the scrape path natively carries exemplars, enabling high-cardinality + context (`client`, `lease_id`, `trace_id`) on individual samples without + additional infrastructure. See **DD-6** (no OTel), **DD-7** (Telemetry Deployment), **DD-8** (HA replicas). ### DD-4: Log format for services vs CLI **Alternatives considered:** -1. **JSON always** for every process — best for machines; hard for humans - debugging a laptop. +1. **JSON always** for every process — best for machines; hard for humans. 2. **Human text default for `jmp`**, **JSON for long-running services** and an optional cli push via the metrics endpoint in JSON format (in addition to the human friendly output) @@ -342,7 +344,8 @@ are still useful for selection and for tools that only understand metadata. **Context:** The Telemetry process holds in-memory counters. Exporters send +1 (e.g. flash success), +N (bytes read/written), or +1 per reporting interval (e.g. one “inactive” minute for a lease with -labels `exporter`, `lease_id`, `client`). +labels `exporter`, `operation`, `result`). High-cardinality context +(`lease_id`, `client`, `trace_id`) is attached via exemplars, not labels. **Alternatives considered:** @@ -352,7 +355,7 @@ labels `exporter`, `lease_id`, `client`). pod, which only advances its partial counters for the label sets it has seen. Prometheus scrapes all pods (or separate `PodMonitor` targets). In PromQL, - `sum by (exporter, lease_id, client, …) (…)` after dropping + `sum by (exporter, operation, result) (…)` after dropping `pod` / `instance` matches the global total, as long as each real event is applied at most once in the system (counters are additive; increments are partitioned by traffic). @@ -429,76 +432,75 @@ endpoints; this DD only governs the recommended dashboard experience. *Subject to review — names and cardinality rules should be fixed before "Implemented".* -| Field / label | Prometheus metric label | Loki stream label | Log line field | Notes | -| ------------- | :--------------------: | :---------------: | :------------: | ----- | -| `exporter` | yes | yes | yes | Bounded by cluster size. | -| `operation` (flash, power, …) | yes | no | yes | Small fixed enum. | -| `result` (success, failure, …) | yes | no | yes | Small fixed enum. | -| `component` (controller, router, telemetry, exporter) | no | yes | yes | Fixed set of service names. | -| `namespace` | no | yes | yes | K8s namespace; bounded. | -| `lease_id` (or UID) | **no** | **no** | yes | Unbounded — use only in log line JSON. | -| `client` (identifier) | opt-in | **no** | yes | See *Cardinality guidelines*; avoid PII. | -| `image_digest`, `build_id`, VCS ref | **no** | **no** | yes | From `spec.context`; unbounded. | -| W3C `trace_id` / `span_id` | **no** | **no** | yes | Propagate in gRPC metadata and log lines. | +| Field / label | Prom label | Prom exemplar | Loki stream | Log line | Notes | +| -------------------------------- | :--------: | :-----------: | :---------: | :------: | --------------------------------------------------- | +| `exporter` | yes | — | yes | yes | Bounded by cluster size. | +| `operation` | yes | — | no | yes | Small fixed enum (flash, power, …). | +| `result` | yes | — | no | yes | Small fixed enum (success, failure, …). | +| `component` | no | — | yes | yes | Fixed set (controller, router, telemetry, exporter).| +| `namespace` | no | — | yes | yes | K8s namespace; bounded. | +| `lease_id` | **no** | yes | **no** | yes | Unbounded; exemplar for drill-down. | +| `client` | **no** | yes | **no** | yes | Exemplar; avoid PII. | +| `image_digest`, `build_id`, etc. | **no** | yes | **no** | yes | From `spec.context`; always included. | +| `trace_id` / `span_id` | **no** | yes | **no** | yes | W3C; links metrics to traces via exemplars. | Additional `lease.spec.context` correlation fields can be added at runtime; -they appear as structured log line fields, never as Prometheus or Loki labels. +they appear as structured log line fields and as Prometheus exemplar keys +(see *Exemplars for high-cardinality context* below). ### Cardinality guidelines -Unbounded identifiers (`lease_id`, `image_digest`, `trace_id`, and +Unbounded identifiers (`lease_id`, `client`, `image_digest`, `trace_id`, and any operator-defined `spec.context` keys) must not be used as Prometheus metric -labels or Loki stream labels. They belong inside structured log line JSON, -where Loki filter expressions (`| json | lease_id = "…"`) can query them -without inflating the label index. +labels or Loki stream labels. They belong inside structured log line JSON +and Prometheus exemplars (see below), where Loki filter expressions +(`| json | lease_id = "…"`) and dashboard exemplar overlays can surface them +without inflating the label index or TSDB series count. Rules of thumb for this JEP: -- **Prometheus**: each metric label dimension should have < 100 distinct values - per scrape target. The default label set for Jumpstarter metrics is - `{exporter, operation, result}` — all bounded enums. +- **Prometheus labels**: each metric label dimension should have < 100 distinct + values per scrape target. The default label set for Jumpstarter metrics is + `{exporter, operation, result}` — all bounded enums. High-cardinality + context is carried via exemplars, not labels. - **Loki**: stream labels should be a small fixed set (`{component, exporter, namespace}`) to keep active stream count per tenant manageable (Grafana's guidance: < 100 k active streams). High-cardinality fields go inside the log line body. - **Lease context fields** from `spec.context` are propagated into log line - JSON only. A future implementation may allow operators to promote specific - context keys to Loki stream labels at their own cardinality risk, but the - default is log-line-only. - -#### `client` label: opt-in two-tier strategy - -Per-client metrics (e.g. flash failures by client) are valuable for operators -but `client` is a semi-bounded dimension whose cardinality depends on -deployment size. The JEP adopts a two-tier approach: - -1. **Default (off)**: metrics use `{exporter, operation, result}` only. - `client` appears in every structured log line, so per-client analysis is - available via LogQL: - `sum by (client) (count_over_time({component="exporter"} | json | operation="flash" [5m]))`. -2. **Opt-in**: an operator flag (e.g. `metrics.includeClientLabel: true`) adds - `client` to the Prometheus label set. This is safe for deployments with a - bounded, stable set of registered clients (rule of thumb: < 200). Operators - with short-lived or ephemeral clients (CI runners, dynamic pods) should - leave it off to avoid series churn. - -Approximate series impact with 20 exporters, 4 operations, 2 results: - -| Clients | Series per metric (without) | Series per metric (with) | -| ------- | :-------------------------: | :----------------------: | -| — | 160 | — | -| 150 | 160 | 24,000 | -| 1,000 | 160 | 160,000 | - -At 150 clients and ~10 metrics the total (~240 k series) is well within a -single Prometheus instance. At 1,000 clients the total (~1.6 M series) is -feasible but approaches the range where long-term retention benefits from -Thanos or Mimir, and series churn from ephemeral clients can degrade -compaction. - -Future work may explore Prometheus exemplars (attaching `client` to individual -samples without creating full series) as a low-cardinality alternative for -"which client caused this spike?" queries. + JSON and into Prometheus exemplars. They never become Prometheus labels or + Loki stream labels. + +#### Exemplars for high-cardinality context + +Prometheus exemplars attach arbitrary key-value pairs to individual counter +increments and histogram observations without creating new time series. This +is the primary mechanism this JEP uses to surface per-request context +(`client`, `lease_id`, `trace_id`) on metrics while keeping series cardinality +flat. + +Default exemplar keys emitted on every counter/histogram observation: + +| Key | Source | Purpose | +| --- | ------ | ------- | +| `client` | Lease or session identity | "Which client caused this spike?" | +| `lease_id` | Lease UID | Correlate a metric sample with lease logs. | +| `trace_id` | W3C `traceparent` | Click-through from metric to trace. | + +All `spec.context` keys (e.g. `build_id`, `image_digest`) are automatically +included as exemplar keys. Because exemplars are per-observation metadata — +not label dimensions — they have zero impact on series cardinality regardless +of how many distinct values appear. + +**Dashboard visualization**: when exemplars are enabled on a Prometheus data +source, metric panels render clickable dots on each sample that carries +exemplar data. Clicking a dot reveals the attached keys and can link to +Loki log queries (filtered by `lease_id`) or a Tempo trace view (filtered +by `trace_id`). + +Per-client analysis remains available via LogQL for operators who do not +use exemplars: +`sum by (client) (count_over_time({component="exporter"} | json | operation="flash" [5m]))`. ### Control-plane aggregation (Controller / Router / optional Telemetry) @@ -551,7 +553,7 @@ logs to the Telemetry service for Loki ingest (see **DD-4**). ```{mermaid} flowchart LR ctrl[jumpstarter-controller] -->|lease lifecycle| exp[Exporter] - exp --> drv[Drivers] + drv[Drivers] --> exp exp -->|increments, events, logs| tel[jumpstarter-telemetry] ``` @@ -563,7 +565,9 @@ Telemetry (see **DD-2**, **DD-5**, **DD-7**). ```{mermaid} flowchart LR - tel[jumpstarter-telemetry] -->|push API| loki[(Loki)] + tel[jumpstarter-telemetry] -->|JSON stdout| shipper[Log shipper] + shipper -->|pod logs| loki[(Loki)] + tel -->|push API| loki tel -->|/metrics| prom[(Prometheus)] ``` @@ -661,29 +665,24 @@ on the OTel SDK in application code. - Flashing and power paths: at least one driver records an event and/or metrics counter on success and failure on real hardware in a lab. -- *Hardware type TBD in implementation.* +- Serial and stream paths expose tx/rx byte counts. ### Manual -- `jmp` default output remains readable; JSON mode under opt-in shows expected - fields in a real CI job. +- `jmp` default output remains readable; JSON structured logs are only sent + to jumpstarter-telemetry for general log ingest. ## Acceptance Criteria -*To be sharpened as design firms up.* - -- [ ] Documented lease metadata and/or annotation keys (with size and - validation rules) are merged with this JEP for reference. -- [ ] At least one event or status mechanism for *flash* (or equivalent) success - and failure is defined and has an integration test. - [ ] Exporter (or sidecar) exposes a documented metrics surface; drivers can contribute without reimplementing the HTTP server ad hoc in each driver. - [ ] Controller and one data-plane service emit structured logs with a - documented minimum field set; `jmp` documents human vs machine modes. -- [ ] If hub forwarding is implemented, document how operators enable it, - how backpressure and overflow work, and how Loki-push credentials are - mounted (see **DD-4**, **DD-5**, **DD-7**, **DD-8**, **DD-9**). + documented minimum field set; +- [ ] Operator provides a section to enable metrics, with the right details/secret + references to integrate with Loki for pushing logs. +- [ ] Operator attempts to auto-configure Prometheus metric scraping on the right + endpoints. - [ ] Backward compatibility: existing clients and manifests without the new fields continue to work; deployments that do not use hub forwarding behave as today. @@ -730,15 +729,7 @@ on the OTel SDK in application code. - More code paths, dependencies (for example a Prometheus client library, Loki HTTP client, and structured log helpers), and - operability - documentation burden. -- Cardinality mistakes can harm Prometheus or backing stores — requires - guardrails and review. -- A poorly tuned forward path can add resource and SPOF-like - pressure on the Controller; a dedicated Telemetry Deployment or subprocess and strict queue bounds are likely needed at scale. -- Informative counters may be overstated on retries (**DD-9**); - re-labeling a series as SLO-critical without tightening - idempotency is an operational error. + operability and documentation burden. - Operators must run a functioning cluster log shipper (Promtail, Grafana Alloy, Vector, or equivalent) to see Controller and Router logs in Loki. This is near-universal in production Kubernetes but worth documenting for @@ -746,16 +737,17 @@ on the OTel SDK in application code. ### Risks -- Over-attachment of metadata to *metrics* as labels could overload TSDB; - *Cardinality guidelines* defines a label allowlist and pushes variable data - to log line fields. +- High-cardinality metadata accidentally promoted to metric *labels* could + overload TSDB. *Cardinality guidelines* restricts labels to bounded enums + and routes variable context through exemplars and log line fields instead. +- Exemplars require the OpenMetrics exposition format and Prometheus >= 2.26 + with exemplar storage enabled (on by default since Prometheus 2.39). + Operators on older Prometheus versions still get full metrics and logs; + exemplar-based drill-down is unavailable until they upgrade. - Prometheus / Loki / Perses-stack version drift in the field — document tested pairs; W3C Trace Context in gRPC remains best-effort across Python and Go (no OTel SDK requirement to propagate `traceparent` where needed). -- The Controller (or a bug in the forwarder) mislabeling or dropping data - during incidents — mitigated by tests, sampling transparency, and - optional parallel scrape/stdout paths for the paranoid. ## Rejected Alternatives @@ -781,7 +773,11 @@ on the OTel SDK in application code. - [Prometheus](https://prometheus.io/) and [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/) — time-series metrics and alerting; [Prometheus naming and labels](https://prometheus.io/docs/practices/naming/) - on cardinality and naming; remote write for non-scrape topologies. + on cardinality and naming; remote write for non-scrape topologies; + [Exemplars](https://prometheus.io/docs/instrumenting/exposition_formats/#exemplars) + for attaching high-cardinality context to individual samples. +- [Grafana exemplar support](https://grafana.com/docs/grafana/latest/fundamentals/exemplars/) + — visualizing exemplars in metric panels and linking to traces or logs. - [Loki](https://grafana.com/oss/loki/) — log aggregation, label model, and push and query APIs; often combined with [Perses](https://perses.dev/) (see **DD-10**) and Grafana Agent / Alloy or @@ -803,25 +799,9 @@ on the OTel SDK in application code. ## Unresolved Questions -- Exact `Lease` spec shape: single `context` object vs. multiple optional - sub-objects (CI vs. VCS). - Event retention: Loki retention policy (per-tenant, per-stream retention classes) for annotated log events (**DD-2**); whether Jumpstarter should document recommended retention defaults or leave this to operators. -- **Identity**: how "current user" is defined (K8s user, OIDC subject, client - CR name) and what is safe to log in shared environments. -- If distributed traces are ever added, whether to use Jaeger - / Tempo native clients, HTTP-only, or revisit OTel in a - *future* JEP (this document excludes mandatory OTel per **DD-6**). -- Deduplication: multiple flashes of the *same* image in one lease — one event - type with counts vs. one row per attempt. -- **Hub implementation:** in-process forwarder in Controller/Router vs. the - optional Jumpstarter Telemetry Deployment (**DD-7**) — final naming, - resource limits, and protocol for increments (gRPC vs. HTTP) under load; - exporter scrape vs. increments-only mix. -- Whether gauges of “current sessions” (if any) need a single writer to - avoid invalid `sum` on replicas — v1 may stay counters-only per - **DD-8** for simplicity. ## Future Possibilities @@ -832,7 +812,7 @@ on the OTel SDK in application code. ## Implementation History -- Still nothing implemented. Everything under discussion. +- ## References @@ -847,5 +827,4 @@ on the OTel SDK in application code. --- *This JEP is licensed under the -[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0), -consistent with the Jumpstarter project.* +[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)* From 26d35c221a7e208e8c1e53d56825aa07ecc6f5be Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Mon, 27 Apr 2026 12:03:12 +0200 Subject: [PATCH 05/39] jep-0011: add driver/error_type/direction labels and example queries MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Expand the Prometheus label set with three new bounded dimensions: - driver: slice metrics by driver type (usbsdmux, dutlink, …) - error_type: classify failures (timeout, device_error, …) - direction: tx/rx for byte-counter and stream metrics Add a Proposed metrics table with illustrative metric names, types, and label sets including duration histograms. Add an Example queries section with practical PromQL and LogQL patterns operators can use for alerting, debugging, and post-mortem analysis. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 129 ++++++++++++++++-- 1 file changed, 121 insertions(+), 8 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 75df1c626..b63f9a0cd 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -66,9 +66,10 @@ exporter-level metrics that a monitoring stack can scrape or receive. recording significant actions (for example *flash started*, *flash failed*, *image reference*) with typed fields, queryable in **Loki** alongside regular logs and distinct from higher-frequency debug output (see **DD-2**). -- **Exporter metrics** — Counters, histograms, and gauges (naming and labels - TBD) exposed from the exporter and optionally enriched by individual drivers - (for example storage operations per type). +- **Exporter metrics** — Counters (operations, bytes), histograms (operation + duration), and gauges (active sessions) exposed from the exporter and + enriched by individual drivers via the `driver` label (for example + `jumpstarter_operation_duration_seconds{driver="usbsdmux"}`). - **Jumpstarter Telemetry** (optional) — a dedicated component with a well-known ingest path and the same trust model (mTLS, ServiceAccount) as Controller/Router; @@ -344,8 +345,9 @@ are still useful for selection and for tools that only understand metadata. **Context:** The Telemetry process holds in-memory counters. Exporters send +1 (e.g. flash success), +N (bytes read/written), or +1 per reporting interval (e.g. one “inactive” minute for a lease with -labels `exporter`, `operation`, `result`). High-cardinality context -(`lease_id`, `client`, `trace_id`) is attached via exemplars, not labels. +labels `exporter`, `operation`, `result`, `driver`). High-cardinality +context (`lease_id`, `client`, `trace_id`) is attached via exemplars, not +labels. **Alternatives considered:** @@ -437,6 +439,9 @@ endpoints; this DD only governs the recommended dashboard experience. | `exporter` | yes | — | yes | yes | Bounded by cluster size. | | `operation` | yes | — | no | yes | Small fixed enum (flash, power, …). | | `result` | yes | — | no | yes | Small fixed enum (success, failure, …). | +| `driver` | yes | — | no | yes | Driver type (usbsdmux, dutlink, …); bounded (~30). | +| `error_type` | yes | — | no | yes | Failure class (timeout, device_error, …); on errors. | +| `direction` | yes | — | no | yes | tx / rx; for byte-counter and stream metrics only. | | `component` | no | — | yes | yes | Fixed set (controller, router, telemetry, exporter).| | `namespace` | no | — | yes | yes | K8s namespace; bounded. | | `lease_id` | **no** | yes | **no** | yes | Unbounded; exemplar for drill-down. | @@ -460,9 +465,11 @@ without inflating the label index or TSDB series count. Rules of thumb for this JEP: - **Prometheus labels**: each metric label dimension should have < 100 distinct - values per scrape target. The default label set for Jumpstarter metrics is - `{exporter, operation, result}` — all bounded enums. High-cardinality - context is carried via exemplars, not labels. + values per scrape target. The label set for Jumpstarter metrics is + `{exporter, operation, result, driver}` — all bounded enums. + `error_type` is added on failure-path metrics and `direction` on + byte-counter metrics. High-cardinality context is carried via exemplars, + not labels. - **Loki**: stream labels should be a small fixed set (`{component, exporter, namespace}`) to keep active stream count per tenant manageable (Grafana's guidance: < 100 k active streams). High-cardinality fields go inside the log @@ -502,6 +509,112 @@ Per-client analysis remains available via LogQL for operators who do not use exemplars: `sum by (client) (count_over_time({component="exporter"} | json | operation="flash" [5m]))`. +### Proposed metrics + +*Names are illustrative; final naming should follow +[Prometheus naming conventions](https://prometheus.io/docs/practices/naming/) +and be fixed before "Implemented".* + +| Metric name | Type | Labels | Description | +| -------------------------------------------- | --------- | -------------------------------------------- | ----------------------------------------- | +| `jumpstarter_operations_total` | counter | `exporter`, `operation`, `result`, `driver` | Total operations performed. | +| `jumpstarter_operation_duration_seconds` | histogram | `exporter`, `operation`, `result`, `driver` | Duration of each operation. | +| `jumpstarter_operation_errors_total` | counter | `exporter`, `operation`, `driver`, `error_type` | Errors by class (timeout, device, …). | +| `jumpstarter_stream_bytes_total` | counter | `exporter`, `driver`, `direction` | Bytes transferred (tx/rx) on streams. | +| `jumpstarter_active_sessions` | gauge | `exporter` | Currently active lease sessions. | +| `jumpstarter_lease_acquisitions_total` | counter | `result` | Lease acquire attempts (controller). | + +All counters and histograms carry exemplar keys (`client`, `lease_id`, +`trace_id`, and `spec.context` fields) on every observation. + +### Example queries + +#### PromQL (Prometheus) + +**Flash failure rate per exporter:** + +``` +sum by (exporter) (rate(jumpstarter_operations_total{operation="flash", result="failure"}[5m])) +/ +sum by (exporter) (rate(jumpstarter_operations_total{operation="flash"}[5m])) +``` + +**p95 flash duration per driver type:** + +``` +histogram_quantile(0.95, + sum by (driver, le) (rate(jumpstarter_operation_duration_seconds_bucket{operation="flash"}[5m])) +) +``` + +**Top 5 busiest exporters (all operations, 1 h window):** + +``` +topk(5, sum by (exporter) (rate(jumpstarter_operations_total[1h]))) +``` + +**Alert: exporter flash failure rate > 20% over 15 min:** + +``` +( + sum by (exporter) (rate(jumpstarter_operations_total{operation="flash", result="failure"}[15m])) + / + sum by (exporter) (rate(jumpstarter_operations_total{operation="flash"}[15m])) +) > 0.2 +``` + +**Error breakdown by class for a specific driver:** + +``` +sum by (error_type) (rate(jumpstarter_operation_errors_total{driver="usbsdmux"}[1h])) +``` + +**Bytes per second by exporter and direction:** + +``` +sum by (exporter, direction) (rate(jumpstarter_stream_bytes_total[5m])) +``` + +**HA Telemetry: aggregate across replicas (drop pod/instance):** + +``` +sum by (exporter, operation, result) (rate(jumpstarter_operations_total[5m])) +``` + +#### LogQL (Loki) + +**All flash events for a specific lease:** + +``` +{component="exporter"} | json | operation="flash" | lease_id="" +``` + +**Flash failures per client over 5 min (log-based, no exemplars needed):** + +``` +sum by (client) ( + count_over_time({component="exporter"} | json | operation="flash" | result="failure" [5m]) +) +``` + +**Controller logs for a specific lease (post-mortem):** + +``` +{component="controller"} | json | lease_id="" +``` + +**Error events across all exporters in a namespace:** + +``` +{component="exporter", namespace="production"} | json | result="failure" +``` + +**Telemetry service health (its own operational logs):** + +``` +{component="telemetry"} | json | level="error" +``` + ### Control-plane aggregation (Controller / Router / optional Telemetry) When this mode is enabled in a deployment: From bb75049e4e7dc1864ebde7053bd235193ec82fff Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Mon, 27 Apr 2026 12:17:13 +0200 Subject: [PATCH 06/39] jep-0011: rename driver to driver_type, add predefined category set Rename the `driver` label to `driver_type` and clarify that each driver selects a category from a predefined set in jumpstarter core (storage, power, network, serial, console, video) rather than using arbitrary driver implementation names. Also fix HA PromQL examples to include driver_type in sum by, and clarify integration test wording around correlation fields. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 29 ++++++++++--------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index b63f9a0cd..56b4b4038 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -68,8 +68,9 @@ exporter-level metrics that a monitoring stack can scrape or receive. regular logs and distinct from higher-frequency debug output (see **DD-2**). - **Exporter metrics** — Counters (operations, bytes), histograms (operation duration), and gauges (active sessions) exposed from the exporter and - enriched by individual drivers via the `driver` label (for example - `jumpstarter_operation_duration_seconds{driver="usbsdmux"}`). + enriched by individual drivers via the `driver_type` label. Each driver + selects a category from a predefined set in jumpstarter core (e.g. + `storage`, `power`, `network`, `serial`, `console`, `video`). - **Jumpstarter Telemetry** (optional) — a dedicated component with a well-known ingest path and the same trust model (mTLS, ServiceAccount) as Controller/Router; @@ -345,7 +346,7 @@ are still useful for selection and for tools that only understand metadata. **Context:** The Telemetry process holds in-memory counters. Exporters send +1 (e.g. flash success), +N (bytes read/written), or +1 per reporting interval (e.g. one “inactive” minute for a lease with -labels `exporter`, `operation`, `result`, `driver`). High-cardinality +labels `exporter`, `operation`, `result`, `driver_type`). High-cardinality context (`lease_id`, `client`, `trace_id`) is attached via exemplars, not labels. @@ -357,7 +358,7 @@ labels. pod, which only advances its partial counters for the label sets it has seen. Prometheus scrapes all pods (or separate `PodMonitor` targets). In PromQL, - `sum by (exporter, operation, result) (…)` after dropping + `sum by (exporter, operation, result, driver_type) (…)` after dropping `pod` / `instance` matches the global total, as long as each real event is applied at most once in the system (counters are additive; increments are partitioned by traffic). @@ -439,7 +440,7 @@ endpoints; this DD only governs the recommended dashboard experience. | `exporter` | yes | — | yes | yes | Bounded by cluster size. | | `operation` | yes | — | no | yes | Small fixed enum (flash, power, …). | | `result` | yes | — | no | yes | Small fixed enum (success, failure, …). | -| `driver` | yes | — | no | yes | Driver type (usbsdmux, dutlink, …); bounded (~30). | +| `driver_type` | yes | — | no | yes | Category from a predefined set in core (storage, power, …). | | `error_type` | yes | — | no | yes | Failure class (timeout, device_error, …); on errors. | | `direction` | yes | — | no | yes | tx / rx; for byte-counter and stream metrics only. | | `component` | no | — | yes | yes | Fixed set (controller, router, telemetry, exporter).| @@ -466,7 +467,7 @@ Rules of thumb for this JEP: - **Prometheus labels**: each metric label dimension should have < 100 distinct values per scrape target. The label set for Jumpstarter metrics is - `{exporter, operation, result, driver}` — all bounded enums. + `{exporter, operation, result, driver_type}` — all bounded enums. `error_type` is added on failure-path metrics and `direction` on byte-counter metrics. High-cardinality context is carried via exemplars, not labels. @@ -517,10 +518,10 @@ and be fixed before "Implemented".* | Metric name | Type | Labels | Description | | -------------------------------------------- | --------- | -------------------------------------------- | ----------------------------------------- | -| `jumpstarter_operations_total` | counter | `exporter`, `operation`, `result`, `driver` | Total operations performed. | -| `jumpstarter_operation_duration_seconds` | histogram | `exporter`, `operation`, `result`, `driver` | Duration of each operation. | -| `jumpstarter_operation_errors_total` | counter | `exporter`, `operation`, `driver`, `error_type` | Errors by class (timeout, device, …). | -| `jumpstarter_stream_bytes_total` | counter | `exporter`, `driver`, `direction` | Bytes transferred (tx/rx) on streams. | +| `jumpstarter_operations_total` | counter | `exporter`, `operation`, `result`, `driver_type` | Total operations performed. | +| `jumpstarter_operation_duration_seconds` | histogram | `exporter`, `operation`, `result`, `driver_type` | Duration of each operation. | +| `jumpstarter_operation_errors_total` | counter | `exporter`, `operation`, `driver_type`, `error_type` | Errors by class (timeout, device, …). | +| `jumpstarter_stream_bytes_total` | counter | `exporter`, `driver_type`, `direction` | Bytes transferred (tx/rx) on streams. | | `jumpstarter_active_sessions` | gauge | `exporter` | Currently active lease sessions. | | `jumpstarter_lease_acquisitions_total` | counter | `result` | Lease acquire attempts (controller). | @@ -543,7 +544,7 @@ sum by (exporter) (rate(jumpstarter_operations_total{operation="flash"}[5m])) ``` histogram_quantile(0.95, - sum by (driver, le) (rate(jumpstarter_operation_duration_seconds_bucket{operation="flash"}[5m])) + sum by (driver_type, le) (rate(jumpstarter_operation_duration_seconds_bucket{operation="flash"}[5m])) ) ``` @@ -566,7 +567,7 @@ topk(5, sum by (exporter) (rate(jumpstarter_operations_total[1h]))) **Error breakdown by class for a specific driver:** ``` -sum by (error_type) (rate(jumpstarter_operation_errors_total{driver="usbsdmux"}[1h])) +sum by (error_type) (rate(jumpstarter_operation_errors_total{driver_type="storage"}[1h])) ``` **Bytes per second by exporter and direction:** @@ -578,7 +579,7 @@ sum by (exporter, direction) (rate(jumpstarter_stream_bytes_total[5m])) **HA Telemetry: aggregate across replicas (drop pod/instance):** ``` -sum by (exporter, operation, result) (rate(jumpstarter_operations_total[5m])) +sum by (exporter, operation, result, driver_type) (rate(jumpstarter_operations_total[5m])) ``` #### LogQL (Loki) @@ -767,7 +768,7 @@ on the OTel SDK in application code. documented set of series after a known operation. - If the control-plane forward path is implemented: with a test Loki and a Prometheus-compatible sink (or mock), assert that records arrive with expected - `lease` / `exporter` labels and that exporter pods do not require + correlation fields (`lease_id`, `exporter`, …) and that exporter pods do not require Loki or cluster-scrape credentials in their spec. - If Telemetry runs with >1 replica: one test verifies that `sum` by business labels (dropping `pod`/`instance`) matches expected totals after partitioned increments (see **DD-8**). From 77877aa74e7dc01c1a8a9d6242c216899440f543 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Mon, 27 Apr 2026 12:55:36 +0200 Subject: [PATCH 07/39] jep-0011: add fence languages to PromQL/LogQL code blocks Use `promql` for PromQL examples (native Pygments lexer) and `text` for LogQL examples (no Pygments logql lexer) to clear Sphinx warnings. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 56b4b4038..d2b9f8c1a 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -534,7 +534,7 @@ All counters and histograms carry exemplar keys (`client`, `lease_id`, **Flash failure rate per exporter:** -``` +```promql sum by (exporter) (rate(jumpstarter_operations_total{operation="flash", result="failure"}[5m])) / sum by (exporter) (rate(jumpstarter_operations_total{operation="flash"}[5m])) @@ -542,7 +542,7 @@ sum by (exporter) (rate(jumpstarter_operations_total{operation="flash"}[5m])) **p95 flash duration per driver type:** -``` +```promql histogram_quantile(0.95, sum by (driver_type, le) (rate(jumpstarter_operation_duration_seconds_bucket{operation="flash"}[5m])) ) @@ -550,13 +550,13 @@ histogram_quantile(0.95, **Top 5 busiest exporters (all operations, 1 h window):** -``` +```promql topk(5, sum by (exporter) (rate(jumpstarter_operations_total[1h]))) ``` **Alert: exporter flash failure rate > 20% over 15 min:** -``` +```promql ( sum by (exporter) (rate(jumpstarter_operations_total{operation="flash", result="failure"}[15m])) / @@ -566,19 +566,19 @@ topk(5, sum by (exporter) (rate(jumpstarter_operations_total[1h]))) **Error breakdown by class for a specific driver:** -``` +```promql sum by (error_type) (rate(jumpstarter_operation_errors_total{driver_type="storage"}[1h])) ``` **Bytes per second by exporter and direction:** -``` +```promql sum by (exporter, direction) (rate(jumpstarter_stream_bytes_total[5m])) ``` **HA Telemetry: aggregate across replicas (drop pod/instance):** -``` +```promql sum by (exporter, operation, result, driver_type) (rate(jumpstarter_operations_total[5m])) ``` @@ -586,13 +586,13 @@ sum by (exporter, operation, result, driver_type) (rate(jumpstarter_operations_t **All flash events for a specific lease:** -``` +```text {component="exporter"} | json | operation="flash" | lease_id="" ``` **Flash failures per client over 5 min (log-based, no exemplars needed):** -``` +```text sum by (client) ( count_over_time({component="exporter"} | json | operation="flash" | result="failure" [5m]) ) @@ -600,19 +600,19 @@ sum by (client) ( **Controller logs for a specific lease (post-mortem):** -``` +```text {component="controller"} | json | lease_id="" ``` **Error events across all exporters in a namespace:** -``` +```text {component="exporter", namespace="production"} | json | result="failure" ``` **Telemetry service health (its own operational logs):** -``` +```text {component="telemetry"} | json | level="error" ``` From f5b20c4868a3d8fee6e45c2c65bb9f4d34d0b1bc Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Tue, 28 Apr 2026 10:48:43 +0200 Subject: [PATCH 08/39] jep-0011: add implementation phases table after Abstract Addresses raballew's request to clearly define what lands in which phase. Five phases: structured logging, metrics endpoints, telemetry service, in-cluster log scraping, dashboards + alerting. Made-with: Cursor --- .../jeps/JEP-0011-observability-telemetry-logs.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index d2b9f8c1a..ecb3803f8 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -28,6 +28,21 @@ that edge processes never need Loki or cluster-scrape credentials. Implementation is expected to land in phases; this JEP describes the end state and compatibility rules. +### Phases + +| Phase | Scope | Key deliverables | +| ----- | ----- | ---------------- | +| 1 | Structured logging + lease context | `spec.context` CRD field; JSON structured logs for all long-running services; correlation fields (`lease_id`, `exporter`, `operation`, `result`) in every log line. | +| 2 | Metrics endpoints | `/metrics` scrape endpoints on Controller and Router; exporter counter/histogram/gauge metrics with `driver_type`; Prometheus exemplars for high-cardinality context. | +| 3 | Telemetry service | Optional `jumpstarter-telemetry` Deployment managed by the operator; exporter and client data aggregation; Loki push for edge-originated logs and events. | +| 4 | In-cluster log scraping | Operator configures log shipper integration (Promtail, Grafana Alloy, Vector) for Controller/Router pod logs; `ServiceMonitor` CRDs for Prometheus autodiscovery. | +| 5 | Dashboards + alerting | Perses CRD dashboards; starter alert rules; documentation and operator integration. | + +Each phase is independently useful and builds on the previous ones. +Phase 1 can ship without any later phase; operators who only need +structured logs benefit immediately. Phase 2 adds scrape-ready metrics +without requiring the Telemetry service. + ## Motivation Today, operators and CI maintainers need to answer questions that raw Kubernetes From 6d670b5a2e2c23a69906f054130a9e3237a5141f Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Tue, 28 Apr 2026 10:50:31 +0200 Subject: [PATCH 09/39] jep-0011: add AI agent persona user story Addresses raballew's request to include at least one non-human persona in the user stories section. Made-with: Cursor --- .../internal/jeps/JEP-0011-observability-telemetry-logs.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index ecb3803f8..2b5025789 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -69,6 +69,10 @@ exporter-level metrics that a monitoring stack can scrape or receive. - **As a** platform engineer, **I want** exporter processes to send telemetry without holding Loki or Prometheus credentials, **so that** I do not have to distribute and rotate secrets on every lab machine. +- **As an** AI agent orchestrating CI, **I want** machine-readable structured + logs and metric exemplars with lease context, **so that** I can + programmatically identify failing exporters and correlate test results + without parsing free-form text. ## Proposal From 79e105eb3934346ed63b37e9a95a91b781feeb4e Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Tue, 28 Apr 2026 10:51:45 +0200 Subject: [PATCH 10/39] jep-0011: clarify tracing scope is correlation only Addresses bkhizgiy's comment: explicitly state that this JEP covers correlation (lease_id, trace_id as log/exemplar fields) and that full distributed tracing is deferred to a future JEP. Made-with: Cursor --- .../internal/jeps/JEP-0011-observability-telemetry-logs.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 2b5025789..0620d5475 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -127,6 +127,12 @@ exporter-level metrics that a monitoring stack can scrape or receive. identifiers in metadata; must remain backward compatible for existing clients (unknown metadata ignored by older servers). +**Tracing scope:** This JEP covers *correlation only* — `lease_id`, `trace_id`, +and `span_id` are propagated as log fields and Prometheus exemplar keys so that +metrics, logs, and (future) traces can be joined. Full distributed tracing +(span creation, sampling policies, trace storage and visualization) is deferred +to a future JEP. + ### Hardware Considerations - No hardware considerations. From feb702ed32a12ac2cd1a6fb32986458874fbbc70 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Tue, 28 Apr 2026 10:57:51 +0200 Subject: [PATCH 11/39] jep-0011: make DD-1 decision statement definitive Replace "leans toward" with a firm decision for spec.context as a typed map. Addresses raballew's feedback that DDs should state clear choices. Made-with: Cursor --- .../internal/jeps/JEP-0011-observability-telemetry-logs.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 0620d5475..329f7a86d 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -163,8 +163,9 @@ identity without re-typing it on every line. hard for operators to audit; no stable object-level link to per-lease metrics and server logs in the cluster. -**Decision:** the JEP leans toward **(2) for first-class, validated context** -with **(1) allowed for integration with generic tooling**, pending contributor consensus. +**Decision:** **(2)** — a typed `spec.context` map under the Lease CRD for +first-class, validated context. **(1)** (labels/annotations) remains allowed +for integration with generic tooling that only understands Kubernetes metadata. **Rationale:** Typed fields make validation and documentation clear; labels are still useful for selection and for tools that only understand metadata. From 68f5ddcfa8ea2b32c1e7acfb57cc54590322d68d Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Tue, 28 Apr 2026 11:34:51 +0200 Subject: [PATCH 12/39] jep-0011: expand DD-3 exemplar trade-offs and details Add wire format example, 128-char size limit with OpenMetrics 1.0/2.0 spec links, sampling behavior, library support, infrastructure requirements (Prometheus >= 2.26, Grafana >= 7.4), and note that Perses does not yet support exemplar rendering. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 46 ++++++++++++++++++- 1 file changed, 45 insertions(+), 1 deletion(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 329f7a86d..ec211e1c9 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -165,7 +165,8 @@ identity without re-typing it on every line. **Decision:** **(2)** — a typed `spec.context` map under the Lease CRD for first-class, validated context. **(1)** (labels/annotations) remains allowed -for integration with generic tooling that only understands Kubernetes metadata. +for integration with generic tooling that only understands Kubernetes metadata +or benefits from lease label filtering. **Rationale:** Typed fields make validation and documentation clear; labels are still useful for selection and for tools that only understand metadata. @@ -235,6 +236,49 @@ are still useful for selection and for tools that only understand metadata. additional infrastructure. See **DD-6** (no OTel), **DD-7** (Telemetry Deployment), **DD-8** (HA replicas). +**Exemplar trade-offs and details:** + +- **Wire format.** On the OpenMetrics `/metrics` endpoint an exemplar is + appended after the sample value: + + ```text + jumpstarter_operations_total{exporter="lab-01",operation="flash",result="success"} 42 # {client="ci-bot",lease_id="abc123",trace_id="def456"} 1.0 1625000000.000 + ``` + + The `# {key=value,...} value timestamp` suffix is the exemplar. Grafana + (≥ 7.4) renders these as clickable dots on metric panels; clicking a dot + reveals the attached keys and can link to a Loki log query (filtered by + `lease_id`) or a trace view (filtered by `trace_id`). + +- **Size limit.** The [OpenMetrics 1.0 spec](https://prometheus.io/docs/specs/om/open_metrics_spec) + imposes a **128 UTF-8 character** limit on the combined length of + exemplar label names and values per exemplar. + [OpenMetrics 2.0](https://github.com/prometheus/docs/blob/main/docs/specs/om/open_metrics_spec_2_0.md) + (experimental, 2026) relaxes this to a soft cap measured in bytes. + The exemplar key budget is discussed further in *Exemplars for + high-cardinality context*. + +- **Sampling.** Client libraries rate-limit exemplar updates internally; + the last-seen exemplar per series is served on each scrape, not one + per data point. For the Jumpstarter use case this is sufficient: + the most recent `lease_id` / `trace_id` on a counter is the value + operators need when investigating a spike. + +- **Library support.** Go client support is mature + (`prometheus/client_golang` ≥ 1.16). The Python ecosystem is less + complete but not required for this JEP since metrics are exposed from + Go services. + +- **Infrastructure requirements.** Prometheus ≥ 2.26 with + `--enable-feature=exemplar-storage` and + `--storage.tsdb.max-exemplars` (e.g. 100 000). Grafana ≥ 7.4 for + exemplar visualization. Perses does not yet support exemplar + rendering; until it does, operators who want exemplar click-through + can use Grafana alongside Perses or wait for upstream support. + + These limitations are acceptable for the correlation use case this JEP + targets. + ### DD-4: Log format for services vs CLI **Alternatives considered:** From 17f88931be23e431c4ee4fae4285c9b8630fd385 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Tue, 28 Apr 2026 15:04:45 +0200 Subject: [PATCH 13/39] jep-0011: DD-4 JSONL format spec, fix jmp contradiction, rename client to CRD name - Add JSONL format specification with base fields table including Loki stream label designation and namespace injection note. - Fix jmp log push to be definitive (always when Telemetry available). - Fix "metrics endpoint" to "Telemetry ingest endpoint". - Use `client` (CRD name) consistently instead of `client_id`, matching Prometheus naming conventions (names without _id suffix). - Add cli to component enum in correlation table. - Add spec.context keys and client field to log fields table. - Add stream cardinality worked example explaining why client must not be a Loki stream label. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 67 ++++++++++++++++--- 1 file changed, 57 insertions(+), 10 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index ec211e1c9..bbf109f61 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -113,8 +113,9 @@ exporter-level metrics that a monitoring stack can scrape or receive. per-exporter Loki and metrics secrets. The same path can carry operator-chosen structured log lines and events (not unbounded default client chatter — see *Control-plane aggregation* below). -- The `jmp` CLI logs remain readable, but also submits logs through the jumpstarter telemetry - endpoint, in machine parseable format for loki ingest. +- The `jmp` CLI output remains human-readable, but when a Telemetry + endpoint is available, `jmp` also pushes structured JSON logs to the + Jumpstarter Telemetry service for Loki ingest. ### API / Protocol Changes @@ -284,9 +285,9 @@ are still useful for selection and for tools that only understand metadata. **Alternatives considered:** 1. **JSON always** for every process — best for machines; hard for humans. -2. **Human text default for `jmp`**, **JSON for long-running services** and an - optional cli push via the metrics endpoint in JSON format (in addition to the - human friendly output) +2. **Human text default for `jmp`**, **JSON for long-running services** and a + CLI push via the Telemetry ingest endpoint in JSON format (in addition to the + human-friendly output) 3. **Single format** with a pretty-printer in front of developers — more moving parts. @@ -307,6 +308,52 @@ are still useful for selection and for tools that only understand metadata. The Telemetry service retains a direct Loki-push because it is an isolated workload (**DD-7**) whose core job is Loki ingest. +**Format:** JSONL (one JSON object per line), produced by setting + `--zap-encoder=json` on the existing `controller-runtime` / Zap logger + (no changes to log call sites — existing `logr` structured fields become + JSON keys automatically). The `ts`, `level`, and `msg` fields follow + Zap's default JSON encoder output; application code adds domain fields + via the standard `logr` `WithValues` / `Info` / `Error` API. + + Base fields present in every log line: + +| Field | Format | Loki label | Description | +| ------------- | ------------------------------------------------------------------- | :--------: | ----------------------------------------- | +| `ts` | ISO-8601 (`2026-04-28T10:15:30.123Z`) | no | Timestamp (Zap default). | +| `level` | Lower-case string (`debug`, `info`, `warn`, `error`) | no | Log severity (Zap default). | +| `msg` | Free-form string | no | Human-readable message (Zap default). | +| `component` | Fixed enum (`cli`, `controller`, `router`, `telemetry`, `exporter`) | **yes** | Emitting service. | +| `exporter` | CRD name (when applicable) | **yes** | Exporter CRD name; bounded by cluster size.| +| `lease_id` | UID string (when applicable) | no | Lease UID (high cardinality). | +| `operation` | String (when applicable) | no | Operation name (flash, power, …). | +| `result` | String (when applicable) | no | Outcome (success, failure, …). | +| `driver_type` | Category from predefined set (when applicable) | no | Driver category (storage, power, …). | +| `client` | CRD name (when applicable) | no | Client CRD name (high cardinality). | +| *`spec.context` keys* | User-defined strings (during active lease) | no | All `lease.spec.context` entries (e.g. `build_id`, `image_digest`, VCS ref) added as JSON fields. High cardinality, never stream labels. | + + `namespace` is **not** emitted by the application. Log shippers + (Promtail, Grafana Alloy, Vector) automatically inject `namespace` + (and `pod`, `container`) from Kubernetes pod metadata via service + discovery, so it is available as a Loki stream label without + application-level awareness. + + Fields marked as **Loki stream labels** are extracted by the log shipper + and used as indexed stream selectors. They must be low-cardinality to + keep the active stream count manageable (Grafana recommends < 100 k + active streams per tenant). With the labels above, a deployment with + 200 exporters across 5 namespaces produces roughly 1 000 streams — + well within budget. High-cardinality fields like `client` or + `lease_id` must stay in the JSON body: promoting `client` to a + stream label in a 1 000-client, 200-exporter cluster would create + up to 1 000 000 streams, overwhelming the Loki ingester. These fields + are instead queried with `| json | client="value"` filter + expressions after selecting the relevant streams. + + Multi-line content (e.g. stack traces) is embedded as an escaped string + within the JSON value (typically in a `stacktrace` or `error` field), + never as bare multi-line text, so each physical line is always one + complete JSON object. + ### DD-5: Where Loki and Prometheus (or remote-write) credentials live **Alternatives considered:** @@ -507,16 +554,16 @@ endpoints; this DD only governs the recommended dashboard experience. | Field / label | Prom label | Prom exemplar | Loki stream | Log line | Notes | | -------------------------------- | :--------: | :-----------: | :---------: | :------: | --------------------------------------------------- | -| `exporter` | yes | — | yes | yes | Bounded by cluster size. | +| `exporter` | yes | — | yes | yes | CRD name; bounded by cluster size. | | `operation` | yes | — | no | yes | Small fixed enum (flash, power, …). | | `result` | yes | — | no | yes | Small fixed enum (success, failure, …). | -| `driver_type` | yes | — | no | yes | Category from a predefined set in core (storage, power, …). | +| `driver_type` | yes | — | no | yes | Category from a predefined set in core (storage, power, …). | | `error_type` | yes | — | no | yes | Failure class (timeout, device_error, …); on errors. | | `direction` | yes | — | no | yes | tx / rx; for byte-counter and stream metrics only. | -| `component` | no | — | yes | yes | Fixed set (controller, router, telemetry, exporter).| +| `component` | no | — | yes | yes | Fixed set (cli, controller, router, telemetry, exporter).| | `namespace` | no | — | yes | yes | K8s namespace; bounded. | | `lease_id` | **no** | yes | **no** | yes | Unbounded; exemplar for drill-down. | -| `client` | **no** | yes | **no** | yes | Exemplar; avoid PII. | +| `client` | **no** | yes | **no** | yes | CRD name; exemplar for client identity. | | `image_digest`, `build_id`, etc. | **no** | yes | **no** | yes | From `spec.context`; always included. | | `trace_id` / `span_id` | **no** | yes | **no** | yes | W3C; links metrics to traces via exemplars. | @@ -561,7 +608,7 @@ Default exemplar keys emitted on every counter/histogram observation: | Key | Source | Purpose | | --- | ------ | ------- | -| `client` | Lease or session identity | "Which client caused this spike?" | +| `client` | Client CRD name | "Which client caused this spike?" | | `lease_id` | Lease UID | Correlate a metric sample with lease logs. | | `trace_id` | W3C `traceparent` | Click-through from metric to trace. | From ab14058b99e3afa35af7d142736f264bdec7a401 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 09:20:37 +0200 Subject: [PATCH 14/39] jep-0011: DD-6 acknowledge complexity tradeoff, trust domain, OTLP future Explain why purpose-built Telemetry complexity is preferable to OTel Collector complexity. Highlight the trust domain advantage (mTLS identity validation prevents impersonation). Note OTLP ingest as a plausible future extension. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 21 +++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index bbf109f61..f0064a416 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -433,6 +433,27 @@ are still useful for selection and for tools that only understand metadata. *optional product territory*; this JEP optimizes for low ceremony and direct integration. +The proposed Jumpstarter Telemetry service (**DD-7**) is itself a +non-trivial component (metric aggregation, Loki forwarding, multi-replica +HA). The distinction is that it is *purpose-built* for Jumpstarter's +narrow scope: a single Go binary with a single config surface, no +separate version matrix, and no generic pipeline DSL to learn. An OTel +Collector serves many use cases but requires operator familiarity with +its configuration model, receivers, processors, and exporters — overhead +that is not justified when the data paths are known in advance. +Additionally, the Telemetry service operates inside Jumpstarter's +existing authentication and trust domain (mTLS, registered client and +exporter identities). It can validate that an incoming increment +actually originates from the claimed exporter or client — preventing +impersonation or label injection — without requiring a separate +auth layer. A generic OTel Collector has no awareness of Jumpstarter +identities and would need external policy to achieve the same guarantee. + +**Future extension:** the Telemetry service's ingest endpoint could +accept OTLP in a future iteration, enabling operators who run OTel +Collectors on exporter hosts (e.g. for host-level stats) to route data +through the same trust boundary without a second credential set. This +is additive and does not require adopting OTel as a project dependency. ### DD-7: Optional Jumpstarter Telemetry service (dedicated Deployment vs. Controller/Router only) From ddc2fddf959e93118d436de042889449fa1abfe4 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 09:28:07 +0200 Subject: [PATCH 15/39] jep-0011: DD-7 add failure modes, identity enforcement, memory budget Address raballew's feedback on missing failure mode design, telemetry auth, and unbounded memory growth. Add identity enforcement via mTLS, failure modes table (unavailability, restart, Loki/Prometheus issues), health endpoints, and memory budget estimate with TTL eviction. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 29 +++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index f0064a416..315b0fefc 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -479,6 +479,35 @@ is additive and does not require adopting OTel as a project dependency. Loki spikes and ingest load cannot starve lease reconciliation in the controller by moving it to a separate service. +**Identity enforcement:** The Telemetry service validates the source + identity of every ingest RPC from the mTLS certificate or + ServiceAccount token. The `exporter` and `client` labels on incoming + increments are enforced server-side to match the authenticated + identity — a compromised or misconfigured exporter cannot submit + metrics under another exporter's name or inject arbitrary labels. + +**Failure modes:** + +| Scenario | Behavior | +| ------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Telemetry service unavailable | Exporters and clients treat telemetry RPCs as fire-and-forget with bounded retry (e.g. 1–2 attempts, exponential backoff). Metrics increments are lost; device operations are unaffected. | +| Telemetry pod restart | In-memory counters reset to zero. This is standard Prometheus counter semantics — `rate()` and `increase()` handle resets transparently. | +| Loki unreachable | The Telemetry service buffers log entries in a bounded queue (see *Backpressure* in the control-plane section). On overflow, entries are dropped and `jumpstarter_telemetry_dropped_total` incremented. | +| Prometheus scrape fails | No data loss — counters remain in memory; the next successful scrape picks up the current values. | + + The Telemetry service exposes `/healthz` (liveness) and `/readyz` + (readiness, gated on Loki and Prometheus reachability) endpoints for + Kubernetes probes. + +**Memory budget:** Each in-memory Prometheus series costs roughly + 200–300 bytes (labels + counter/histogram state). The bounded label + set `{exporter, operation, result, driver_type}` caps total series: + with 200 exporters × 6 operations × 2 results × 6 driver types = + 14 400 series, costing ~3–4 MB. Adding `error_type` and `direction` + on their respective metrics adds a small multiple. Series that receive + no updates for a configurable TTL (default: 10 min) are eligible for + eviction to prevent stale-exporter accumulation after scale-down. + ### DD-8: Multiple Telemetry replicas (HA) and addable counters **Context:** The Telemetry process holds in-memory counters. Exporters send From 06b287347c3bed4ab22745a4f0d189a4c525371f Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 09:36:42 +0200 Subject: [PATCH 16/39] jep-0011: exemplar size budget, CRD validation, trace_id conditional MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Document the 128-char OpenMetrics exemplar limit with a priority strategy (client + lease_id always first, trace_id only when present). Add CRD-level validation for spec.context (key ≤ 32, value ≤ 64, max 8 entries). trace_id is not synthesized — only included when an external caller propagates traceparent. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 46 ++++++++++++++----- 1 file changed, 34 insertions(+), 12 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 315b0fefc..26dd7d0b3 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -233,8 +233,8 @@ are still useful for selection and for tools that only understand metadata. `ServiceMonitor`; it avoids app-side remote-write credentials and complexity in Jumpstarter. The OpenMetrics exposition format used by the scrape path natively carries exemplars, enabling high-cardinality - context (`client`, `lease_id`, `trace_id`) on individual samples without - additional infrastructure. See **DD-6** (no OTel), **DD-7** (Telemetry + context (`client`, `lease_id`, and `trace_id` when present) on individual + samples without additional infrastructure. See **DD-6** (no OTel), **DD-7** (Telemetry Deployment), **DD-8** (HA replicas). **Exemplar trade-offs and details:** @@ -243,7 +243,7 @@ are still useful for selection and for tools that only understand metadata. appended after the sample value: ```text - jumpstarter_operations_total{exporter="lab-01",operation="flash",result="success"} 42 # {client="ci-bot",lease_id="abc123",trace_id="def456"} 1.0 1625000000.000 + jumpstarter_operations_total{exporter="lab-01",operation="flash",result="success"} 42 # {client="ci-bot",lease_id="abc123",build_id="nightly-42"} 1.0 1625000000.000 ``` The `# {key=value,...} value timestamp` suffix is the exemplar. Grafana @@ -499,13 +499,13 @@ is additive and does not require adopting OTel as a project dependency. (readiness, gated on Loki and Prometheus reachability) endpoints for Kubernetes probes. -**Memory budget:** Each in-memory Prometheus series costs roughly +**Memory budget:** Each in-memory Prometheus series is expected to cost around 200–300 bytes (labels + counter/histogram state). The bounded label set `{exporter, operation, result, driver_type}` caps total series: with 200 exporters × 6 operations × 2 results × 6 driver types = 14 400 series, costing ~3–4 MB. Adding `error_type` and `direction` on their respective metrics adds a small multiple. Series that receive - no updates for a configurable TTL (default: 10 min) are eligible for + no updates for a configurable TTL (i.e. a default: 10 min) are eligible for eviction to prevent stale-exporter accumulation after scale-down. ### DD-8: Multiple Telemetry replicas (HA) and addable counters @@ -651,22 +651,44 @@ Rules of thumb for this JEP: Prometheus exemplars attach arbitrary key-value pairs to individual counter increments and histogram observations without creating new time series. This is the primary mechanism this JEP uses to surface per-request context -(`client`, `lease_id`, `trace_id`) on metrics while keeping series cardinality +(`client`, `lease_id`, and `trace_id` when present) on metrics while keeping series cardinality flat. Default exemplar keys emitted on every counter/histogram observation: -| Key | Source | Purpose | -| --- | ------ | ------- | -| `client` | Client CRD name | "Which client caused this spike?" | -| `lease_id` | Lease UID | Correlate a metric sample with lease logs. | -| `trace_id` | W3C `traceparent` | Click-through from metric to trace. | +| Key | Source | Purpose | +| ---------- | --------------------- | ----------------------------------------------- | +| `client` | Client CRD name | "Which client caused this spike?" | +| `lease_id` | Lease UID | Correlate a metric sample with lease logs. | +| `trace_id` | W3C `traceparent` | Included **only when present** in gRPC metadata.| + +`trace_id` is not synthesized by Jumpstarter — it is included only when +an external caller (CI pipeline, user code) propagates a `traceparent`. +Full distributed tracing (spans, storage, visualization) is deferred to +a future JEP; when it lands, `trace_id` becomes a default key. Until +then, omitting it saves ~45 characters of exemplar budget. All `spec.context` keys (e.g. `build_id`, `image_digest`) are automatically included as exemplar keys. Because exemplars are per-observation metadata — not label dimensions — they have zero impact on series cardinality regardless of how many distinct values appear. +**Exemplar size budget:** The OpenMetrics 1.0 limit is 128 UTF-8 +characters for the combined key-value pairs in a single exemplar. +The two default keys (`client`, `lease_id`) consume roughly 30–50 +characters, leaving ~80–100 characters for `spec.context` entries +(or more when `trace_id` is absent). To stay within budget: + +1. Default keys (`client`, `lease_id`) are always included first. + `trace_id` is added when present in the request context. +2. `spec.context` keys are added in alphabetical order until the 128-char + limit is reached; remaining keys are silently dropped from the + exemplar (they remain available in structured log lines). +3. The `Lease` CRD validates `spec.context` at admission time: key names + are limited to 32 characters, values to 64 characters, and the total + number of entries to 8. This prevents accidental budget exhaustion and + ensures exemplar truncation is rare in practice. + **Dashboard visualization**: when exemplars are enabled on a Prometheus data source, metric panels render clickable dots on each sample that carries exemplar data. Clicking a dot reveals the attached keys and can link to @@ -693,7 +715,7 @@ and be fixed before "Implemented".* | `jumpstarter_lease_acquisitions_total` | counter | `result` | Lease acquire attempts (controller). | All counters and histograms carry exemplar keys (`client`, `lease_id`, -`trace_id`, and `spec.context` fields) on every observation. +`trace_id` when present, and `spec.context` fields) on every observation. ### Example queries From f06052b53d0f20a24c8d5999766704bfe324aa7e Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 09:39:02 +0200 Subject: [PATCH 17/39] jep-0011: clarify composite driver_type semantics Sub-drivers emit their own category; top-level composite driver methods emit driver_type="composite". Addresses raballew's question about Renode/QEMU multi-category drivers. Made-with: Cursor --- .../internal/jeps/JEP-0011-observability-telemetry-logs.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 26dd7d0b3..59d738e4d 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -90,6 +90,13 @@ exporter-level metrics that a monitoring stack can scrape or receive. enriched by individual drivers via the `driver_type` label. Each driver selects a category from a predefined set in jumpstarter core (e.g. `storage`, `power`, `network`, `serial`, `console`, `video`). + Composite drivers (e.g. Renode, QEMU) that bundle multiple sub-drivers + do not emit a single top-level category for delegated work. Instead, + each sub-driver emits its own `driver_type` when it performs an + operation — a Renode storage sub-driver emits `driver_type="storage"`, + its power sub-driver emits `driver_type="power"`, and so on. Any + top-level methods on the composite driver itself (e.g. VM lifecycle) + emit `driver_type="composite"`. - **Jumpstarter Telemetry** (optional) — a dedicated component with a well-known ingest path and the same trust model (mTLS, ServiceAccount) as Controller/Router; From 6d4f23d3d2ff433b228327c7980115d73c9dc83e Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 09:43:32 +0200 Subject: [PATCH 18/39] jep-0011: backpressure design with drop markers and byte-counter note Replace vague backpressure requirement with concrete design: bounded ring buffer, drop markers that accumulate into a single reserved slot and flush as a warn-level log entry with count + window, plus a Prometheus dropped_total counter for alerting. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 20 ++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 59d738e4d..db492410d 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -830,11 +830,21 @@ When this mode is enabled in a deployment: log shipper (Promtail, Grafana Alloy, Vector, or equivalent) scrapes their pod logs and delivers them to Loki. This decouples the reconciler and session-handling hot paths from Loki availability. -- **Backpressure** applies to the Telemetry service: its Loki-push and - counter queues must be bounded; on overflow, drop (with a counter) or - sample. Because the Controller and Router no longer push to Loki, their - lease/session operations are inherently isolated from Loki or metrics - path slowdowns. +- **Backpressure:** The Telemetry service uses a bounded ring buffer + per destination (Loki push, metric ingest) with a configurable depth + (default: 10 000 entries). On overflow, dropped entries are replaced + by a single **drop marker** — a synthetic log entry recording the + count of dropped entries and the time window. Subsequent drops while + the buffer is still full accumulate into the same marker rather than + adding new entries, so the queue always retains one slot for the + current drop summary. When the buffer drains and the marker is + flushed, the downstream log contains an explicit record such as + `{"level":"warn","msg":"entries dropped","count":142,"window_seconds":12}`. + A `jumpstarter_telemetry_dropped_total` counter (partitioned by + `destination={loki,metrics}`) is also incremented on `/metrics` for + alerting. Because the Controller and Router no longer push to Loki, + their lease/session operations are inherently isolated from Loki or + metrics path slowdowns. - **Multi-tenancy:** if Loki is multi-tenant, the Telemetry writer (and the cluster log shipper for Controller/Router pod logs) applies org or namespace scoping consistently; label sets are reviewed to avoid From 59c6c0cfda4652a2859fa8bb55836b0f2a072ef8 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 09:44:48 +0200 Subject: [PATCH 19/39] jep-0011: add client-side pre-aggregation for byte counters Exporters batch byte increments locally (every 5s or 64 KiB) to bound telemetry RPC volume on high-frequency serial/video streams. Made-with: Cursor --- .../jeps/JEP-0011-observability-telemetry-logs.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index db492410d..17e897aab 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -724,6 +724,14 @@ and be fixed before "Implemented".* All counters and histograms carry exemplar keys (`client`, `lease_id`, `trace_id` when present, and `spec.context` fields) on every observation. +**High-frequency byte counters:** `jumpstarter_stream_bytes_total` can +be incremented at very high rates on serial and video streams. Exporters +must pre-aggregate byte counts locally and flush a single `+N` increment +to the Telemetry service at a configurable interval (default: every 5 s +or every 64 KiB, whichever comes first) rather than sending a per-read +or per-write RPC. This bounds telemetry RPC volume independently of +stream throughput. + ### Example queries #### PromQL (Prometheus) From e952e3775c7eeb5a705a5e520580bc752da5ee34 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 09:45:31 +0200 Subject: [PATCH 20/39] jep-0011: simplify multi-tenancy, remove X-Scope-OrgID Write-side and read-side tenant scoping are deployment concerns, out of scope for this JEP. Addresses raballew's comment on read-side access control. Made-with: Cursor --- .../jeps/JEP-0011-observability-telemetry-logs.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 17e897aab..b6c3ad724 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -853,10 +853,11 @@ When this mode is enabled in a deployment: alerting. Because the Controller and Router no longer push to Loki, their lease/session operations are inherently isolated from Loki or metrics path slowdowns. -- **Multi-tenancy:** if Loki is multi-tenant, the Telemetry writer (and the - cluster log shipper for Controller/Router pod logs) applies org or - namespace scoping consistently; label sets are reviewed to avoid - cross-tenant leakage. +- **Multi-tenancy:** write-side tenant scoping (e.g. namespace-based + separation in Loki and Prometheus) is a deployment concern handled by + the log shipper and Prometheus configuration. Read-side access control + (who can query which metrics or logs) is likewise a deployment concern + and out of scope for this JEP. - This does not require that *all* metrics *originate* in a single process: the exporter and drivers still emit the facts; Telemetry aggregates and ships to Loki; Controller and From 6dac64d713085b7fc6ac81b289860695b8572264 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 09:57:28 +0200 Subject: [PATCH 21/39] jep-0011: add metric usage and alerting table Clarify which metrics are for dashboards vs alerts with starter thresholds. Operator should ship example PrometheusRule CRDs (opt-in, disabled by default). Addresses bkhizgiy's suggestion. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index b6c3ad724..945448cfe 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -724,6 +724,24 @@ and be fixed before "Implemented".* All counters and histograms carry exemplar keys (`client`, `lease_id`, `trace_id` when present, and `spec.context` fields) on every observation. +### Metric usage and alerting + +| Metric | Primary use | Alert? | Starter threshold | +| -------------------------------------------- | ----------- | :----: | ---------------------------------------------- | +| `jumpstarter_operations_total` | Dashboard | yes | Failure rate > 20 % over 15 min per exporter. | +| `jumpstarter_operation_duration_seconds` | Dashboard | yes | p95 > 60 s per operation type. | +| `jumpstarter_operation_errors_total` | Dashboard | yes | Error rate rising; group by `error_type`. | +| `jumpstarter_stream_bytes_total` | Dashboard | no | — | +| `jumpstarter_active_sessions` | Dashboard | yes | 0 sessions for > 30 min (possible exporter issue). | +| `jumpstarter_lease_acquisitions_total` | Dashboard | yes | Failure rate > 10 % over 15 min. | +| `jumpstarter_telemetry_dropped_total` | Alerting | yes | Any increment (telemetry pipeline saturated). | + +Thresholds are suggestions; operators should tune them to their +environment. The operator should ship a set of example `PrometheusRule` +CRDs based on the table above that operators can enable and customize. +These rules are opt-in and disabled by default to avoid noise in +environments with different baselines. + **High-frequency byte counters:** `jumpstarter_stream_bytes_total` can be incremented at very high rates on serial and video streams. Exporters must pre-aggregate byte counts locally and flush a single `+N` increment From fe29ecb009578a655f3ebfe343ab1b9e3b38118c Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 10:16:43 +0200 Subject: [PATCH 22/39] jep-0011: add independent testability and E2E CI sections to test plan Each component is testable in isolation (in-memory logger, local Prometheus registry, mock gRPC/Loki). E2E tests use the Go/Ginkgo suite for direct /metrics scrapes and a minimal Loki stack or mock for log pipeline verification. Feasibility evaluated in Phase 1. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 43 +++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 945448cfe..a6ba25663 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -1014,6 +1014,49 @@ on the OTel SDK in application code. metrics counter on success and failure on real hardware in a lab. - Serial and stream paths expose tx/rx byte counts. +### Independent testability + +Each component must be testable in isolation without deploying the full +stack: + +- **Structured logging**: unit tests validate JSON output format, base + fields, and `spec.context` propagation using an in-memory logger — no + Loki required. +- **Exporter metrics**: unit tests verify counter/histogram registration, + label correctness, and exemplar attachment using a local Prometheus + registry — no Telemetry service required. +- **Telemetry service**: integration tests use mock gRPC clients and a + mock Loki endpoint to verify ingest, counter aggregation, backpressure + behavior, and drop markers — no real exporters required. +- **Operator configuration**: unit tests validate CRD admission + (e.g. `spec.context` size limits) and `ServiceMonitor` generation. + +### End-to-end (CI) + +The full telemetry pipeline should be exercised in GitHub Actions CI. +Evaluate feasibility of running a minimal Prometheus + Loki stack inside +the CI environment (e.g. single-binary mode containers); if resource +constraints make this impractical, at minimum: + +- **Loki mock or single-binary**: a lightweight Loki instance (or a mock + HTTP/gRPC endpoint that validates the Loki push API contract) receives logs + from the Telemetry service and asserts expected fields, stream labels, + and `spec.context` propagation across the full exporter → Telemetry → + Loki path. +- **Prometheus scrape**: the existing Go/Ginkgo E2E test suite performs + direct HTTP scrapes of the `/metrics` endpoints on Controller, Router, + and Telemetry services — no separate Prometheus instance required. The + test parses the OpenMetrics response and asserts that documented + series, labels, and exemplars appear after a known operation sequence. +- **Correlation round-trip**: an E2E test runs a lease lifecycle (create → + flash → power-cycle → release) and verifies that the same `lease_id` + and `exporter` values appear in both scraped metrics (label or + exemplar) and ingested log entries, confirming cross-signal + correlation. + +Feasibility of this stack should be evaluated early (Phase 1) so that +all subsequent phases have E2E coverage from the start. + ### Manual - `jmp` default output remains readable; JSON structured logs are only sent From f9938811e96a93cd2834eddd74413fd4ae753eaa Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 10:17:35 +0200 Subject: [PATCH 23/39] jep-0011: add JSON schema acceptance criterion Require a machine-readable spec for the structured log format so consumers can validate entries and detect field name/type regressions. Made-with: Cursor --- .../internal/jeps/JEP-0011-observability-telemetry-logs.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index a6ba25663..df9c3e7e0 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -1073,6 +1073,9 @@ all subsequent phases have E2E coverage from the start. references to integrate with Loki for pushing logs. - [ ] Operator attempts to auto-configure Prometheus metric scraping on the right endpoints. +- [ ] A JSON schema (or equivalent machine-readable specification) is + published for the structured log format, enabling consumers to + validate log entries and detect regressions in field names or types. - [ ] Backward compatibility: existing clients and manifests without the new fields continue to work; deployments that do not use hub forwarding behave as today. From bc4772c650e79cae5fb32064517eb5f4ccc69792 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 10:19:53 +0200 Subject: [PATCH 24/39] jep-0011: add Perses vs Grafana comparison table in DD-10 Side-by-side comparison covering license, governance, CRDs, exemplar rendering, and maturity. Highlights that the choice is non-exclusive since Grafana can consume the same standard endpoints. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 20 ++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index df9c3e7e0..c3a6a4a0a 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -599,8 +599,26 @@ labels. three backends this JEP standardizes on — without carrying the cost of a broad plugin ecosystem the project does not need. +**Perses vs Grafana — practical comparison:** + +| Aspect | Perses | Grafana | +| -------------------- | --------------------------------------- | ------------------------------------------ | +| License | Apache 2.0 | AGPL v3 | +| Governance | CNCF (vendor-neutral) | Grafana Labs (commercial) | +| Dashboard-as-code | CUE/JSON spec, static validation, SDKs | JSON export, no built-in validation | +| K8s-native CRDs | Yes | Via third-party operator (grafana-operator)| +| Exemplar rendering | Not yet (upstream roadmap) | Yes (>= 7.4) | +| Data-source scope | Prometheus, Loki, Tempo | Broad plugin ecosystem | +| Maturity / ecosystem | Early (CNCF sandbox/incubating) | Mature, widely deployed | + +The main Perses gap today is exemplar visualization. Operators who need +exemplar overlays on dashboards should use Grafana alongside Perses or +wait for upstream support. Grafana remains fully compatible — all +`/metrics` and Loki endpoints are standard — so the choice is +non-exclusive. + Operators who prefer Grafana can still point it at the same `/metrics` and Loki -endpoints; this DD only governs the recommended dashboard experience. +endpoints; this DD only governs the *recommended* dashboard experience. ## Design Details From 78595ceb0101d08b443b4bb11ebecbafe33ac835 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 10:20:50 +0200 Subject: [PATCH 25/39] jep-0011: update README status from Draft to Discussion Made-with: Cursor --- python/docs/source/internal/jeps/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/docs/source/internal/jeps/README.md b/python/docs/source/internal/jeps/README.md index fca115978..76bc26419 100644 --- a/python/docs/source/internal/jeps/README.md +++ b/python/docs/source/internal/jeps/README.md @@ -35,7 +35,7 @@ For the full process definition, see [JEP-0000](JEP-0000-jep-process.md). | JEP | Title | Status | Author(s) | | ---- | ---------------------------------------------------- | ----------- | -------------------- | | 0010 | [Renode Integration](JEP-0010-renode-integration.md) | Implemented | @vtz (Vinicius Zein) | -| 0011 | [Metrics, Tracing, and Log Observability](JEP-0011-observability-telemetry-logs.md) | Draft | @mangelajo (Miguel Angel Ajo Pelayo) | +| 0011 | [Metrics, Tracing, and Log Observability](JEP-0011-observability-telemetry-logs.md) | Discussion | @mangelajo (Miguel Angel Ajo Pelayo) | ### Informational JEPs From 4d4adb7d80eef175c0a5e050b32d3d0609b34552 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 10:30:37 +0200 Subject: [PATCH 26/39] jep-0011: add operator configuration section with TLS and exemplar allowlist Add spec.telemetry CR fields: Loki URL/credentials, TLS (caSecretRef, insecureSkipVerify), exemplarKeys allowlist filtering all keys including spec.context, driverTypeEnum allowlist, serviceMonitor, prometheusRules, and backpressure queue depth. Includes example CR snippet and Loki transport evaluation note. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 93 ++++++++++++++++++- 1 file changed, 90 insertions(+), 3 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index c3a6a4a0a..15216fd82 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -639,12 +639,13 @@ endpoints; this DD only governs the *recommended* dashboard experience. | `namespace` | no | — | yes | yes | K8s namespace; bounded. | | `lease_id` | **no** | yes | **no** | yes | Unbounded; exemplar for drill-down. | | `client` | **no** | yes | **no** | yes | CRD name; exemplar for client identity. | -| `image_digest`, `build_id`, etc. | **no** | yes | **no** | yes | From `spec.context`; always included. | +| `image_digest`, `build_id`, etc. | **no** | yes | **no** | yes | From `spec.context`; included when listed in `exemplarKeys`. | | `trace_id` / `span_id` | **no** | yes | **no** | yes | W3C; links metrics to traces via exemplars. | Additional `lease.spec.context` correlation fields can be added at runtime; -they appear as structured log line fields and as Prometheus exemplar keys -(see *Exemplars for high-cardinality context* below). +they appear as structured log line fields and, when listed in the operator's +`exemplarKeys` allowlist, as Prometheus exemplar keys (see *Exemplars for +high-cardinality context* below and *Operator configuration*). ### Cardinality guidelines @@ -1005,6 +1006,92 @@ on the OTel SDK in application code. plan should name tested combinations (Prometheus and Loki version pairs where relevant) in `Implementation History`, not a single product bundle. +### Operator configuration + +The Jumpstarter operator CR controls telemetry behavior cluster-wide. +Observability settings live under `spec.telemetry` so that administrators +can tune metrics, logging, and exemplar behavior without editing code. + +**Key configurable fields:** + +| Field | Type | Default | Description | +| ----------------------------------------- | ---------- | ------------------------------------------------ | ---------------------------------------------------------------------------------------------- | +| `spec.telemetry.enabled` | `bool` | `false` | Deploy the optional Telemetry service. | +| `spec.telemetry.loki.url` | `string` | — | Loki push endpoint; required when Telemetry is enabled. | +| `spec.telemetry.loki.secretRef` | `string` | — | Secret with Loki credentials (see **DD-5**). | +| `spec.telemetry.loki.tls.caSecretRef` | `string` | — | Secret containing a CA bundle (`ca.crt` key) to trust for the Loki endpoint. | +| `spec.telemetry.loki.tls.insecureSkipVerify` | `bool` | `false` | Disable TLS certificate verification (development/testing only). | +| `spec.telemetry.metrics.exemplarKeys` | `[]string` | `["client", "lease_id"]` | Allowlist of keys to include in exemplars (including `spec.context` keys). Only listed keys are emitted; unlisted keys are omitted even if present. | +| `spec.telemetry.metrics.driverTypeEnum` | `[]string` | `["power", "storage", "network", "serial", …]` | Allowed `driver_type` label values. Drivers reporting an unlisted type are mapped to `other`. | +| `spec.telemetry.metrics.serviceMonitor` | `bool` | `true` | Create `ServiceMonitor` CRDs for Prometheus autodiscovery. | +| `spec.telemetry.metrics.prometheusRules` | `bool` | `false` | Deploy starter `PrometheusRule` CRDs (opt-in). | +| `spec.telemetry.backpressure.queueDepth` | `int` | `10000` | Ring buffer depth per destination (see backpressure design above). | + +**Example CR snippet:** + +```yaml +apiVersion: operator.jumpstarter.dev/v1alpha1 +kind: Jumpstarter +metadata: + name: jumpstarter +spec: + telemetry: + enabled: true + loki: + url: "https://loki-gateway.monitoring.svc:3100/loki/api/v1/push" + secretRef: "loki-credentials" + tls: + caSecretRef: "loki-ca-bundle" + metrics: + exemplarKeys: + - client + - lease_id + - build_id + driverTypeEnum: + - power + - storage + - network + - serial + - video + - composite + serviceMonitor: true + prometheusRules: true + backpressure: + queueDepth: 20000 +``` + +The `driverTypeEnum` list acts as an allowlist: drivers must select a +category from this set (or fall back to `other`). This keeps the +`driver_type` Prometheus label bounded and prevents cardinality +surprises from third-party drivers. Administrators can extend the list +for site-specific driver categories. + +The `exemplarKeys` list is an **allowlist** that controls which keys are +included in Prometheus exemplars. This filters *everything* — both +built-in keys (`client`, `lease_id`) and `spec.context` keys. Only keys +present in `exemplarKeys` are emitted; unlisted keys are omitted even if +available. This gives administrators full control over exemplar budget +usage: adding a `spec.context` key like `build_id` to the list opts it +in, while removing `lease_id` frees budget for other entries. + +**Loki transport:** During implementation, evaluate whether the Telemetry +service should connect to Loki via the HTTP push API +(`/loki/api/v1/push`) or the gRPC endpoint. gRPC may offer better +throughput and streaming semantics (aligned with Jumpstarter's existing +gRPC infrastructure), while the HTTP API is simpler to debug and more +broadly supported by Loki-compatible backends. The `spec.telemetry.loki.url` +field should accept either scheme (`http://` / `grpc://`) so the choice +remains a deployment decision. + +**Loki TLS:** Many deployments terminate Loki behind a TLS endpoint +with an internal or self-signed CA. The `spec.telemetry.loki.tls` +subsection follows the same pattern as the existing operator TLS +configuration: `caSecretRef` names a Kubernetes Secret whose `ca.crt` +key contains the PEM-encoded CA bundle to trust. When set, the +Telemetry service adds this CA to its TLS root pool when connecting to +Loki. `insecureSkipVerify` disables certificate verification entirely +and should only be used in development or testing environments. + ## Test Plan ### Unit Tests From d3e7543fb7bc5becb4067121b92ba7e29d9487fb Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 10:36:32 +0200 Subject: [PATCH 27/39] jep-0011: add exporterLabels for hardware-type tracking in logs and exemplars New spec.telemetry.exporterLabels field copies Exporter CRD label values (e.g. board-type) into structured log JSON fields and the exemplar candidate pool. Empty by default; exemplarKeys allowlist still controls final exemplar inclusion. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 28 ++++++++++++++----- 1 file changed, 21 insertions(+), 7 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 15216fd82..19a596caa 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -1021,7 +1021,8 @@ can tune metrics, logging, and exemplar behavior without editing code. | `spec.telemetry.loki.secretRef` | `string` | — | Secret with Loki credentials (see **DD-5**). | | `spec.telemetry.loki.tls.caSecretRef` | `string` | — | Secret containing a CA bundle (`ca.crt` key) to trust for the Loki endpoint. | | `spec.telemetry.loki.tls.insecureSkipVerify` | `bool` | `false` | Disable TLS certificate verification (development/testing only). | -| `spec.telemetry.metrics.exemplarKeys` | `[]string` | `["client", "lease_id"]` | Allowlist of keys to include in exemplars (including `spec.context` keys). Only listed keys are emitted; unlisted keys are omitted even if present. | +| `spec.telemetry.exporterLabels` | `[]string` | `[]` | Exporter-level label keys (e.g. `board-type`) copied from Exporter CRD labels into log JSON fields and exemplar candidates. | +| `spec.telemetry.metrics.exemplarKeys` | `[]string` | `["client", "lease_id"]` | Allowlist of keys to include in exemplars (including `spec.context` and `exporterLabels` keys). Only listed keys are emitted; unlisted keys are omitted even if present. | | `spec.telemetry.metrics.driverTypeEnum` | `[]string` | `["power", "storage", "network", "serial", …]` | Allowed `driver_type` label values. Drivers reporting an unlisted type are mapped to `other`. | | `spec.telemetry.metrics.serviceMonitor` | `bool` | `true` | Create `ServiceMonitor` CRDs for Prometheus autodiscovery. | | `spec.telemetry.metrics.prometheusRules` | `bool` | `false` | Deploy starter `PrometheusRule` CRDs (opt-in). | @@ -1037,6 +1038,8 @@ metadata: spec: telemetry: enabled: true + exporterLabels: + - board-type loki: url: "https://loki-gateway.monitoring.svc:3100/loki/api/v1/push" secretRef: "loki-credentials" @@ -1047,6 +1050,7 @@ spec: - client - lease_id - build_id + - board-type driverTypeEnum: - power - storage @@ -1066,13 +1070,23 @@ category from this set (or fall back to `other`). This keeps the surprises from third-party drivers. Administrators can extend the list for site-specific driver categories. +The `exporterLabels` list names Exporter CRD label keys whose values +are copied into every log JSON field and made available as exemplar +candidates for operations involving that exporter. For example, setting +`exporterLabels: ["board-type"]` means an Exporter with the label +`board-type: rpi4` will include `"board-type": "rpi4"` in its +structured log lines and in the exemplar candidate pool. The list is +empty by default — no exporter labels are propagated unless the +administrator opts in. + The `exemplarKeys` list is an **allowlist** that controls which keys are -included in Prometheus exemplars. This filters *everything* — both -built-in keys (`client`, `lease_id`) and `spec.context` keys. Only keys -present in `exemplarKeys` are emitted; unlisted keys are omitted even if -available. This gives administrators full control over exemplar budget -usage: adding a `spec.context` key like `build_id` to the list opts it -in, while removing `lease_id` frees budget for other entries. +included in Prometheus exemplars. This filters *everything* — built-in +keys (`client`, `lease_id`), `spec.context` keys, and `exporterLabels` +keys alike. Only keys present in `exemplarKeys` are emitted; unlisted +keys are omitted even if available. This gives administrators full +control over exemplar budget usage: adding `board-type` to both +`exporterLabels` and `exemplarKeys` propagates hardware type into +exemplars, while removing `lease_id` frees budget for other entries. **Loki transport:** During implementation, evaluate whether the Telemetry service should connect to Loki via the HTTP push API From ee25596735ba7874424effe12d99b244a9aa6611 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 11:44:03 +0200 Subject: [PATCH 28/39] jep-0011: fix consistency issues across document - Remove stale X-Scope-OrgID reference in DD-5 - Align spec.context and exporterLabels exemplar inclusion with exemplarKeys allowlist throughout (cardinality guidelines, proposed metrics, exemplar section) - Add exporterLabels rows to correlation table and DD-4 log fields table - Update document date to 2026-04-29 Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 22 +++++++++++-------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 19a596caa..41660c0fc 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -8,7 +8,7 @@ | **Status** | Discussion | | **Type** | Standards Track | | **Created** | 2026-04-23 | -| **Updated** | 2026-04-27 | +| **Updated** | 2026-04-29 | | **Discussion** | https://github.com/jumpstarter-dev/jumpstarter/pull/631 | | **Requires** | — | | **Supersedes** | — | @@ -337,6 +337,7 @@ are still useful for selection and for tools that only understand metadata. | `driver_type` | Category from predefined set (when applicable) | no | Driver category (storage, power, …). | | `client` | CRD name (when applicable) | no | Client CRD name (high cardinality). | | *`spec.context` keys* | User-defined strings (during active lease) | no | All `lease.spec.context` entries (e.g. `build_id`, `image_digest`, VCS ref) added as JSON fields. High cardinality, never stream labels. | +| *`exporterLabels` keys* | Values from Exporter CRD labels (when configured) | no | Operator-defined exporter labels (e.g. `board-type`); see `spec.telemetry.exporterLabels`. | `namespace` is **not** emitted by the application. Log shippers (Promtail, Grafana Alloy, Vector) automatically inject `namespace` @@ -399,8 +400,7 @@ are still useful for selection and for tools that only understand metadata. cluster's existing log shipping infrastructure. Generic in-cluster collectors solve *credentials* but not *semantic* correlation unless integrated; the hub (2) reuses the existing trust model - (exporter→controller) and can inject labels and tenant headers (for - example `X-Scope-OrgID`) in one place. A separate Deployment (**4** / + (exporter→controller) and can inject labels and tenant context in one place. A separate Deployment (**4** / **DD-7**) is preferable to overloading the main reconciler when load or residency of counters matters. @@ -641,6 +641,7 @@ endpoints; this DD only governs the *recommended* dashboard experience. | `client` | **no** | yes | **no** | yes | CRD name; exemplar for client identity. | | `image_digest`, `build_id`, etc. | **no** | yes | **no** | yes | From `spec.context`; included when listed in `exemplarKeys`. | | `trace_id` / `span_id` | **no** | yes | **no** | yes | W3C; links metrics to traces via exemplars. | +| *`exporterLabels` keys* | **no** | yes | **no** | yes | From Exporter CRD labels; included when listed in `exemplarKeys`. | Additional `lease.spec.context` correlation fields can be added at runtime; they appear as structured log line fields and, when listed in the operator's @@ -669,8 +670,8 @@ Rules of thumb for this JEP: guidance: < 100 k active streams). High-cardinality fields go inside the log line body. - **Lease context fields** from `spec.context` are propagated into log line - JSON and into Prometheus exemplars. They never become Prometheus labels or - Loki stream labels. + JSON and, when listed in `exemplarKeys`, into Prometheus exemplars. They + never become Prometheus labels or Loki stream labels. #### Exemplars for high-cardinality context @@ -694,8 +695,9 @@ Full distributed tracing (spans, storage, visualization) is deferred to a future JEP; when it lands, `trace_id` becomes a default key. Until then, omitting it saves ~45 characters of exemplar budget. -All `spec.context` keys (e.g. `build_id`, `image_digest`) are automatically -included as exemplar keys. Because exemplars are per-observation metadata — +`spec.context` keys (e.g. `build_id`, `image_digest`) are included as +exemplar keys when listed in the operator's `exemplarKeys` allowlist (see +*Operator configuration*). Because exemplars are per-observation metadata — not label dimensions — they have zero impact on series cardinality regardless of how many distinct values appear. @@ -740,8 +742,10 @@ and be fixed before "Implemented".* | `jumpstarter_active_sessions` | gauge | `exporter` | Currently active lease sessions. | | `jumpstarter_lease_acquisitions_total` | counter | `result` | Lease acquire attempts (controller). | -All counters and histograms carry exemplar keys (`client`, `lease_id`, -`trace_id` when present, and `spec.context` fields) on every observation. +All counters and histograms carry exemplar keys from the operator's +`exemplarKeys` allowlist (by default `client` and `lease_id`; `trace_id` +when present; `spec.context` and `exporterLabels` entries when listed) +on every observation. ### Metric usage and alerting From 8771ced17905bd71595ffec7c00c36bd52a9c072 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 12:38:07 +0200 Subject: [PATCH 29/39] jep-0011: add author email, fix status alignment, update implementation history Made-with: Cursor --- .../internal/jeps/JEP-0011-observability-telemetry-logs.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 41660c0fc..b44e5a577 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -4,8 +4,8 @@ | ----------------- | --------------------------------------------------------------------- | | **JEP** | 0011 | | **Title** | Metrics, Tracing, and Log Observability | -| **Author(s)** | @mangelajo (Miguel Angel Ajo Pelayo) | -| **Status** | Discussion | +| **Author(s)** | @mangelajo (Miguel Angel Ajo Pelayo | +| **Status** | Discussion | | **Type** | Standards Track | | **Created** | 2026-04-23 | | **Updated** | 2026-04-29 | @@ -1328,7 +1328,8 @@ all subsequent phases have E2E coverage from the start. ## Implementation History -- +— JEP-0011 proposed: 2026-04-23 +- JEP-0011 updated based on feedback: 2026-04-29 ## References From d2028ed96707becd026e33a8649e2a75d090ea26 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 19:27:13 +0200 Subject: [PATCH 30/39] jep-0011: add concrete gRPC protocol proposal for telemetry Expand the API / Protocol Changes section with: - GetServiceEndpoints RPC on ControllerService for telemetry discovery - TelemetryService in new telemetry.proto with MetricsStream (reverse scrape) and PushLogs RPCs - AuditStream removal (dead code, never implemented) - LogStreamResponse enrichment with optional additive fields Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 202 +++++++++++++++++- 1 file changed, 193 insertions(+), 9 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index b44e5a577..e7e05dfd2 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -126,20 +126,204 @@ exporter-level metrics that a monitoring stack can scrape or receive. ### API / Protocol Changes -*High level — to be refined during review.* +#### CRD (Lease) -- **CRD (Lease)**: Additive changes only for the `spec.context` field. Backwards - compatibility by making this field empty by default. -- **gRPC (if applicable)**: Additional controller methods to discover the availability - of a metrics, or set of metrics endpoint(s). Optional propagation of `traceparent` and lease - identifiers in metadata; must remain backward compatible for existing clients - (unknown metadata ignored by older servers). +Additive changes only for the `spec.context` field. Backwards compatibility +by making this field empty by default. -**Tracing scope:** This JEP covers *correlation only* — `lease_id`, `trace_id`, +#### gRPC: Telemetry endpoint discovery (`jumpstarter.proto`) + +A new RPC on the existing `ControllerService` lets both exporters and +clients discover the optional Telemetry endpoint: + +```protobuf +// Added to ControllerService +rpc GetServiceEndpoints(GetServiceEndpointsRequest) + returns (GetServiceEndpointsResponse); + +message GetServiceEndpointsRequest {} + +message GetServiceEndpointsResponse { + // Empty when telemetry is not enabled. + repeated TelemetryEndpoint telemetry_endpoints = 1; +} + +message TelemetryEndpoint { + string endpoint = 1; // gRPC address (host:port) + string certificate = 2; // Optional CA cert for the endpoint +} +``` + +Exporters call `GetServiceEndpoints` after `Register`; clients call it +after authentication. An empty `telemetry_endpoints` list means telemetry +is not deployed — callers skip all telemetry RPCs. Older controllers +that do not implement the method return `UNIMPLEMENTED`, which callers +treat identically to an empty list. + +#### gRPC: Telemetry service (`telemetry.proto` — new file) + +A new `protocol/proto/jumpstarter/v1/telemetry.proto` defines the +`TelemetryService` implemented by `jumpstarter-telemetry`. It has two +RPCs: one for metrics (reverse scrape) and one for log push. + +##### Metrics: reverse scrape via `MetricsStream` + +Exporters maintain a local `prometheus_client.CollectorRegistry` with +counters, histograms, and gauges. Rather than pushing increments, the +exporter opens a persistent bidirectional stream to the Telemetry +service; the Telemetry service periodically sends a scrape request +and the exporter responds with the output of +`prometheus_client.generate_latest()` in OpenMetrics text format. + +```protobuf +service TelemetryService { + // Persistent bidirectional stream: telemetry sends scrape requests, + // exporter responds with full metric snapshots. + rpc MetricsStream(stream MetricsStreamRequest) + returns (stream MetricsStreamResponse); + + // Structured log / event push (used by both exporters and clients). + rpc PushLogs(PushLogsRequest) returns (PushLogsResponse); +} + +// Exporter → Telemetry +message MetricsStreamRequest { + oneof msg { + MetricsRegister register = 1; // First message: identify this exporter + MetricsScrapeResponse scrape_response = 2; // Subsequent: reply to a scrape + } +} + +message MetricsRegister { + string identity = 1; // Exporter CRD name (verified against mTLS and auth token by server) +} + +message MetricsScrapeResponse { + bytes metrics_text = 1; // generate_latest() OpenMetrics output + google.protobuf.Timestamp timestamp = 2; +} + +// Telemetry → Exporter +message MetricsStreamResponse { + oneof msg { + MetricsScrapeRequest scrape_request = 1; + } +} + +message MetricsScrapeRequest {} // "send your /metrics now" +``` + +The stream lifecycle: + +1. Exporter opens the stream and sends `MetricsRegister`, the jumpstarter-telemetry + service authenticates the exporter identity and labels from cluster information. +2. When Prometheus (or any scraper) hits the Telemetry service's + `/metrics` endpoint, Telemetry fans out `MetricsScrapeRequest` + to all connected exporters. +3. Each exporter calls `generate_latest(registry)` and replies with + `MetricsScrapeResponse`. +4. Telemetry merges the responses and serves the combined result, + adds and filters any necessary labels or exemplars from data. + This on-demand approach avoids stale data and unnecessary + background traffic; it can be changed to periodic pre-fetching + later if scrape latency became problematic. + +**Client-side metrics are not collected.** All metrically-interesting +operations are observable from the exporter side: `DriverCall` methods +run on the exporter and can be instrumented there. Client-side drivers +that orchestrate complex workflows (e.g. serial-console-driven +flashing) report outcomes back to the exporter via regular +`DriverCall` methods, keeping the exporter as the single source of +truth for metrics. + +##### Logs: push via `PushLogs` + +Both exporters and clients push structured log entries to the +Telemetry service for Loki ingest: + +```protobuf +message PushLogsRequest { + repeated LogEntry entries = 1; +} + +message PushLogsResponse { + uint32 accepted = 1; // Entries accepted + uint32 dropped = 2; // Entries dropped (backpressure) +} + +message LogEntry { + google.protobuf.Timestamp timestamp = 1; + string severity = 2; // debug, info, warn, error + string message = 3; + string component = 4; // Log stream label: cli, exporter + string exporter = 5; // Log stream label: exporter CRD name + string lease_id = 6; // High-cardinality, log body only + string client = 7; // High-cardinality, log body only + string operation = 8; // flash, power, etc. + string result = 9; // success, failure + string driver_type = 10; // storage, power, network, etc. + map extra_fields = 11; // Driver-specific structured data +} +``` + +The Telemetry service maps `component` and `exporter` to Loki stream +labels and everything else into the JSON body, following the +cardinality rules in *Cardinality guidelines*. The `exporter` and +`client` fields are verified server-side with the authenticated +identity to prevent impersonation. Empty fields or details +that can be obtained from lease_id are incorporated into the log. + +#### gRPC: `AuditStream` removal (`jumpstarter.proto`) + +The existing `AuditStream` RPC on `ControllerService` and its +`AuditStreamRequest` message are removed. Analysis of the codebase +shows this is dead code: + +- The Go controller has no implementation — calls fall through to + `UnimplementedControllerServiceServer` which returns + `codes.Unimplemented`. +- No Python code (exporter or client) calls the RPC. +- No tests exercise it beyond generated stubs. + +Its intended purpose (tracking exporter activity) is fully superseded +by `TelemetryService.PushLogs` with a richer, properly-designed +message format. + +#### gRPC: `LogStreamResponse` enrichment (`jumpstarter.proto`) + +The existing `LogStream` RPC on `ExporterService` is kept — it serves +a fundamentally different purpose (real-time session logs from +exporter to connected client) from the Telemetry log push. However, +the `LogStreamResponse` message is enriched with optional additive +fields to support richer client-side display and optional dual-path +forwarding to telemetry: + +```protobuf +message LogStreamResponse { + string uuid = 1; + string severity = 2; + string message = 3; + optional LogSource source = 4; + // New additive fields: + optional string driver_type = 5; // Category when source=DRIVER + optional string operation = 6; // When the log is part of a known operation + optional google.protobuf.Timestamp timestamp = 7; + map structured_fields = 8; +} +``` + +These fields are optional and backward compatible — older clients +ignore unknown fields; older exporters simply do not set them. + +#### Tracing scope + +This JEP covers *correlation only* — `lease_id`, `trace_id`, and `span_id` are propagated as log fields and Prometheus exemplar keys so that metrics, logs, and (future) traces can be joined. Full distributed tracing (span creation, sampling policies, trace storage and visualization) is deferred -to a future JEP. +to a future JEP. Optional propagation of `traceparent` and lease +identifiers in gRPC metadata remains backward compatible (unknown +metadata ignored by older servers). ### Hardware Considerations From 8765da4bb70b2700790f7847fa145dc21fbf79b5 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Wed, 29 Apr 2026 19:31:11 +0200 Subject: [PATCH 31/39] jep-0011: update DD-3 with reverse-scrape alternative and decision Add alternative (4) for reverse scrape via gRPC MetricsStream, update decision to (4), and note Python prometheus_client library usage on exporter side. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 49 +++++++++++++------ 1 file changed, 34 insertions(+), 15 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index e7e05dfd2..f91e2443a 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -415,18 +415,32 @@ are still useful for selection and for tools that only understand metadata. 3. **Both** — **(1)** is required for the documented path; **(2)** is optional infrastructure behind Prometheus, not a second required app protocol. - -**Decision:** **(1)** for how cluster Prometheus ingests Jumpstarter - aggregated metrics (scrape the Telemetry, Controller, - and Router services). - -**Rationale:** Scrape is standard, debuggable, and scalable; it matches - `ServiceMonitor`; it avoids app-side remote-write credentials and - complexity in Jumpstarter. The OpenMetrics exposition format used by - the scrape path natively carries exemplars, enabling high-cardinality - context (`client`, `lease_id`, and `trace_id` when present) on individual - samples without additional infrastructure. See **DD-6** (no OTel), **DD-7** (Telemetry - Deployment), **DD-8** (HA replicas). +4. **Reverse scrape via gRPC** — exporters maintain a local + `prometheus_client.CollectorRegistry` and connect to the Telemetry + service via a persistent bidirectional gRPC stream (`MetricsStream`). + When Prometheus scrapes the Telemetry service's `/metrics` endpoint, + Telemetry fans out scrape requests to all connected exporters, merges + the `generate_latest()` responses, and serves the combined result. + Controller and Router still expose `/metrics` directly for Prometheus + scrape (no change). This avoids push-increment complexity on the wire + and keeps full counter state on the exporter at all times. + +**Decision:** **(4)** — exporter-originated metrics are reverse-scraped + through the Telemetry service via `MetricsStream`. + +**Rationale:** Exporters are often behind NAT or firewalls and cannot + be directly scraped by Prometheus. The reverse-scrape model **(4)** + solves this: the exporter initiates an outbound gRPC stream + (NAT-friendly, same direction as the existing controller connection), + the Telemetry service requests metric snapshots on demand, and full + counter state remains on the exporter at all times — eliminating + lost-increment concerns (see **DD-9**). The exporter uses standard + `prometheus_client` primitives locally, so driver authors instrument + with familiar counters and histograms. The OpenMetrics exposition + format natively carries exemplars, enabling high-cardinality context + (`client`, `lease_id`, and `trace_id` when present) on individual + samples without additional infrastructure. See **DD-6** (no OTel), + **DD-7** (Telemetry Deployment), **DD-8** (HA replicas). **Exemplar trade-offs and details:** @@ -457,9 +471,14 @@ are still useful for selection and for tools that only understand metadata. operators need when investigating a spike. - **Library support.** Go client support is mature - (`prometheus/client_golang` ≥ 1.16). The Python ecosystem is less - complete but not required for this JEP since metrics are exposed from - Go services. + (`prometheus/client_golang` ≥ 1.16). The Python `prometheus_client` + library is used on the exporter side to maintain local registries + and produce `generate_latest()` output for the reverse-scrape path + (see *API / Protocol Changes*). Exemplar support in the Python + library is functional but less complete than Go; if limitations + arise, exemplar data can be sent as a sidecar field in + `MetricsScrapeResponse` for the Telemetry service to merge + server-side. - **Infrastructure requirements.** Prometheus ≥ 2.26 with `--enable-feature=exemplar-storage` and From 81c9eec8dd3fd50b84e5f5c37ebfbed87acf08d9 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Thu, 30 Apr 2026 12:40:10 +0200 Subject: [PATCH 32/39] jep-0011: update DD-7 for reverse-scrape model, add operator config fields Rewrite DD-7 failure modes and memory budget for exporter-local counters with reverse-scrape. Add scrapeTimeout (default 7s), staleEvictionTTL (default 10m) to operator config. Add jumpstarter_scrape_timeouts_total metric. Make loki.url optional to support metrics-only deployments. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 85 +++++++++++++------ 1 file changed, 60 insertions(+), 25 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index f91e2443a..7274a52a6 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -679,44 +679,67 @@ is additive and does not require adopting OTel as a project dependency. ServiceAccount / mTLS as other control-plane binaries. 3. **Split** into separate sidecars (Loki-only, metrics-only) — more images to build and version. - -**Decision:** Prefer **(2)** for the optional aggregated-metrics + Loki +4. **Dedicated Deployment with reverse-scrape for metrics and push for + logs** — same dedicated `jumpstarter-telemetry` Deployment as **(2)**, + but instead of receiving increment RPCs the service reverse-scrapes + connected exporters via `MetricsStream` (see *API / Protocol + Changes*). Exporters maintain local `prometheus_client` registries; + the Telemetry service requests `generate_latest()` snapshots on + demand when its `/metrics` endpoint is hit, merges the results, and + serves them to Prometheus. Logs and events are still pushed by + exporters and clients via `PushLogs`. Client-side metrics are not + collected — all metrically-interesting operations are observable + from the exporter side. + +**Decision:** Prefer **(4)** for the optional aggregated-metrics + Loki path at scale; allow **(1)** in small or dev clusters; **(3)** only if review shows a need. Could still offer a centralized log/event source when Loki is not available by using the pod logs, this could be helpful for testing. **Rationale:** A dedicated workload can scale and restart independently; - Loki spikes and ingest load cannot starve lease - reconciliation in the controller by moving it to a separate service. + Loki spikes and ingest load cannot starve lease reconciliation in the + controller. The reverse-scrape model **(4)** is preferred over the + increment-push model **(2)** because full counter state stays on the + exporter — no metrics are lost when the Telemetry service restarts or + is temporarily unavailable, and idempotency concerns are eliminated + (see **DD-9**). **Identity enforcement:** The Telemetry service validates the source - identity of every ingest RPC from the mTLS certificate or - ServiceAccount token. The `exporter` and `client` labels on incoming - increments are enforced server-side to match the authenticated - identity — a compromised or misconfigured exporter cannot submit - metrics under another exporter's name or inject arbitrary labels. + identity of every `MetricsStream` connection and `PushLogs` RPC from + the mTLS certificate or ServiceAccount token. The `exporter` and + `client` labels on incoming data are enforced server-side to match the + authenticated identity — a compromised or misconfigured exporter + cannot submit metrics under another exporter's name or inject + arbitrary labels. **Failure modes:** | Scenario | Behavior | | ------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Telemetry service unavailable | Exporters and clients treat telemetry RPCs as fire-and-forget with bounded retry (e.g. 1–2 attempts, exponential backoff). Metrics increments are lost; device operations are unaffected. | -| Telemetry pod restart | In-memory counters reset to zero. This is standard Prometheus counter semantics — `rate()` and `increase()` handle resets transparently. | +| Telemetry service unavailable | Exporters keep counting locally; no metrics are lost. When the exporter reconnects, the next scrape returns the full current counter state. Log push RPCs are fire-and-forget with bounded retry; log entries may be lost but device operations are unaffected. | +| Telemetry pod restart | Metric state is rebuilt on the next scrape from each connected exporter — no permanent data loss. Prometheus `rate()` and `increase()` handle the apparent counter reset transparently. | | Loki unreachable | The Telemetry service buffers log entries in a bounded queue (see *Backpressure* in the control-plane section). On overflow, entries are dropped and `jumpstarter_telemetry_dropped_total` incremented. | -| Prometheus scrape fails | No data loss — counters remain in memory; the next successful scrape picks up the current values. | +| Prometheus scrape fails | No data loss — the next successful scrape triggers a fresh fan-out to connected exporters and returns current values. | The Telemetry service exposes `/healthz` (liveness) and `/readyz` - (readiness, gated on Loki and Prometheus reachability) endpoints for - Kubernetes probes. - -**Memory budget:** Each in-memory Prometheus series is expected to cost around - 200–300 bytes (labels + counter/histogram state). The bounded label - set `{exporter, operation, result, driver_type}` caps total series: - with 200 exporters × 6 operations × 2 results × 6 driver types = - 14 400 series, costing ~3–4 MB. Adding `error_type` and `direction` - on their respective metrics adds a small multiple. Series that receive - no updates for a configurable TTL (i.e. a default: 10 min) are eligible for - eviction to prevent stale-exporter accumulation after scale-down. + (readiness, gated on Loki reachability and at least one connected + exporter) endpoints for Kubernetes probes. + +**Scrape fan-out:** When Prometheus hits `/metrics`, the Telemetry + service fans out `MetricsScrapeRequest` to **all connected exporters in + parallel** and waits up to `spec.telemetry.metrics.scrapeTimeout` + for responses. Exporters that do not respond in time + are skipped for that scrape — their last-known snapshot is served if + available. + +**Memory budget:** The Telemetry service holds parsed metric snapshots + from connected exporters between Prometheus scrapes. With 200 + exporters each producing ~50 series (bounded by `{operation, result, + driver_type}` label combinations), the total is ~10 000 series at + ~200–300 bytes each, costing ~2–3 MB. Snapshots from exporters that + disconnect are evicted after `spec.telemetry.metrics.staleEvictionTTL` + (default: 10 min) to prevent stale-exporter accumulation after + scale-down. ### DD-8: Multiple Telemetry replicas (HA) and addable counters @@ -944,6 +967,7 @@ and be fixed before "Implemented".* | `jumpstarter_stream_bytes_total` | counter | `exporter`, `driver_type`, `direction` | Bytes transferred (tx/rx) on streams. | | `jumpstarter_active_sessions` | gauge | `exporter` | Currently active lease sessions. | | `jumpstarter_lease_acquisitions_total` | counter | `result` | Lease acquire attempts (controller). | +| `jumpstarter_scrape_timeouts_total` | counter | `exporter` | Scrape fan-out timeouts per exporter (Telemetry-side). | All counters and histograms carry exemplar keys from the operator's `exemplarKeys` allowlist (by default `client` and `lease_id`; `trace_id` @@ -961,6 +985,7 @@ on every observation. | `jumpstarter_active_sessions` | Dashboard | yes | 0 sessions for > 30 min (possible exporter issue). | | `jumpstarter_lease_acquisitions_total` | Dashboard | yes | Failure rate > 10 % over 15 min. | | `jumpstarter_telemetry_dropped_total` | Alerting | yes | Any increment (telemetry pipeline saturated). | +| `jumpstarter_scrape_timeouts_total` | Alerting | yes | Repeated timeouts for same exporter (connectivity or load issue). | Thresholds are suggestions; operators should tune them to their environment. The operator should ship a set of example `PrometheusRule` @@ -1024,6 +1049,12 @@ sum by (error_type) (rate(jumpstarter_operation_errors_total{driver_type="storag sum by (exporter, direction) (rate(jumpstarter_stream_bytes_total[5m])) ``` +**Exporters with repeated scrape timeouts (last 30 min):** + +```promql +topk(10, sum by (exporter) (increase(jumpstarter_scrape_timeouts_total[30m]))) +``` + **HA Telemetry: aggregate across replicas (drop pod/instance):** ```promql @@ -1224,7 +1255,7 @@ can tune metrics, logging, and exemplar behavior without editing code. | Field | Type | Default | Description | | ----------------------------------------- | ---------- | ------------------------------------------------ | ---------------------------------------------------------------------------------------------- | | `spec.telemetry.enabled` | `bool` | `false` | Deploy the optional Telemetry service. | -| `spec.telemetry.loki.url` | `string` | — | Loki push endpoint; required when Telemetry is enabled. | +| `spec.telemetry.loki.url` | `string` | — | Loki push endpoint; optional — Telemetry can run metrics-only without Loki. | | `spec.telemetry.loki.secretRef` | `string` | — | Secret with Loki credentials (see **DD-5**). | | `spec.telemetry.loki.tls.caSecretRef` | `string` | — | Secret containing a CA bundle (`ca.crt` key) to trust for the Loki endpoint. | | `spec.telemetry.loki.tls.insecureSkipVerify` | `bool` | `false` | Disable TLS certificate verification (development/testing only). | @@ -1233,7 +1264,9 @@ can tune metrics, logging, and exemplar behavior without editing code. | `spec.telemetry.metrics.driverTypeEnum` | `[]string` | `["power", "storage", "network", "serial", …]` | Allowed `driver_type` label values. Drivers reporting an unlisted type are mapped to `other`. | | `spec.telemetry.metrics.serviceMonitor` | `bool` | `true` | Create `ServiceMonitor` CRDs for Prometheus autodiscovery. | | `spec.telemetry.metrics.prometheusRules` | `bool` | `false` | Deploy starter `PrometheusRule` CRDs (opt-in). | -| `spec.telemetry.backpressure.queueDepth` | `int` | `10000` | Ring buffer depth per destination (see backpressure design above). | +| `spec.telemetry.metrics.scrapeTimeout` | `duration` | `7s` | Max time to wait for parallel exporter responses during a `/metrics` fan-out. Must leave headroom within the Prometheus-side `scrape_timeout` | +| `spec.telemetry.metrics.staleEvictionTTL` | `duration` | `10m` | Evict metric snapshots from disconnected exporters after this duration. | +| `spec.telemetry.backpressure.queueDepth` | `int` | `10000` | Ring buffer depth for Loki log push queue. | **Example CR snippet:** @@ -1267,6 +1300,8 @@ spec: - composite serviceMonitor: true prometheusRules: true + scrapeTimeout: "7s" + staleEvictionTTL: "10m" backpressure: queueDepth: 20000 ``` From 61ccbea84270c063a034339eaddc973e60acfd3b Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Thu, 30 Apr 2026 12:56:28 +0200 Subject: [PATCH 33/39] jep-0011: update DD-8 for exporter-sticky connections, no caching MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rewrite DD-8 with new alternative (4) for exporter-sticky MetricsStream connections. Only metrics received during the current scrape fan-out are served — no cached or stale data is ever returned. Snapshots are discarded after each /metrics response flush, removing staleEvictionTTL entirely. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 77 +++++++++++-------- 1 file changed, 45 insertions(+), 32 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 7274a52a6..6805365d2 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -728,27 +728,30 @@ is additive and does not require adopting OTel as a project dependency. **Scrape fan-out:** When Prometheus hits `/metrics`, the Telemetry service fans out `MetricsScrapeRequest` to **all connected exporters in parallel** and waits up to `spec.telemetry.metrics.scrapeTimeout` - for responses. Exporters that do not respond in time - are skipped for that scrape — their last-known snapshot is served if - available. - -**Memory budget:** The Telemetry service holds parsed metric snapshots - from connected exporters between Prometheus scrapes. With 200 - exporters each producing ~50 series (bounded by `{operation, result, - driver_type}` label combinations), the total is ~10 000 series at - ~200–300 bytes each, costing ~2–3 MB. Snapshots from exporters that - disconnect are evicted after `spec.telemetry.metrics.staleEvictionTTL` - (default: 10 min) to prevent stale-exporter accumulation after - scale-down. - -### DD-8: Multiple Telemetry replicas (HA) and addable counters - -**Context:** The Telemetry process holds in-memory counters. Exporters send -+1 (e.g. flash success), +N (bytes read/written), or +1 per -reporting interval (e.g. one “inactive” minute for a lease with -labels `exporter`, `operation`, `result`, `driver_type`). High-cardinality -context (`lease_id`, `client`, `trace_id`) is attached via exemplars, not -labels. + (default: 7 s) for responses. **Only metrics received during the + current fan-out are included in the response.** Exporters that do not + respond in time are omitted entirely — no cached or stale data is + ever served. This eliminates any risk of double-counting from stale + connections where the exporter may have already migrated to another + replica (see **DD-8**). + +**Memory budget:** During a scrape fan-out the Telemetry service + temporarily holds metric snapshots from responding exporters until the + merged response is written to Prometheus. With 200 exporters each + producing ~50 series (bounded by `{operation, result, driver_type}` + label combinations), the peak is ~10 000 series at ~200–300 bytes + each, costing ~2–3 MB. Snapshots are discarded as soon as the + `/metrics` response is flushed — no metric data is retained between + scrapes. + +### DD-8: Multiple Telemetry replicas (HA) and exporter-sticky connections + +**Context:** With the reverse-scrape model (see **DD-3** alternative 4 +and *API / Protocol Changes*), the Telemetry service does not hold +authoritative counter state — exporters maintain their own local +`prometheus_client` registries. The Telemetry service only caches the +latest metric snapshot per exporter. Each exporter opens a single +long-lived `MetricsStream` to one Telemetry replica. **Alternatives considered:** @@ -764,15 +767,27 @@ labels. additive; increments are partitioned by traffic). 3. **Strong consistency** (Raft, Redis as source of truth for counters) — higher operating cost than this JEP’s v1 scope. - -**Decision:** **(2)** - -**Rationale:** Sums of cumulative counters across replicas are - meaningful when each event is not double-applied; Loki - appends are naturally per-replica as well. A possible failure mode is - duplicate increments (retries, at-least-once RPCs). But this - is informative data, and eventual (and very low chance) of duplication does not - justify a more complex design — see **DD-9**. +4. **Multiple replicas with exporter-sticky connections** — each exporter + opens a single `MetricsStream` to one replica (sticky by stream). + Each replica only caches metric snapshots for its connected + exporters. Prometheus scrapes all replicas (via `PodMonitor`); + `sum by (exporter, operation, result, driver_type) (…)` after + dropping `pod` / `instance` yields the exact global total with no + double-counting, because each exporter’s metrics appear on exactly + one replica’s `/metrics` output. On replica failure the exporter + reconnects to a survivor and the next scrape returns its full + current counter state — no data is lost. + +**Decision:** **(4)** + +**Rationale:** Exporter-sticky connections naturally partition metric + snapshots across replicas with no overlap, so `sum` across replicas + is exact and double-counting is impossible. Full counter state lives + on the exporter, not on the Telemetry service, so replica restarts + or failovers cause no data loss. Loki log pushes (`PushLogs`) are + naturally per-replica as well and do not require deduplication. + Alternative (3) adds operational complexity with no benefit given + the reverse-scrape model. ### DD-9: Idempotency vs. best-effort (acceptable over-count for informative metrics) @@ -1265,7 +1280,6 @@ can tune metrics, logging, and exemplar behavior without editing code. | `spec.telemetry.metrics.serviceMonitor` | `bool` | `true` | Create `ServiceMonitor` CRDs for Prometheus autodiscovery. | | `spec.telemetry.metrics.prometheusRules` | `bool` | `false` | Deploy starter `PrometheusRule` CRDs (opt-in). | | `spec.telemetry.metrics.scrapeTimeout` | `duration` | `7s` | Max time to wait for parallel exporter responses during a `/metrics` fan-out. Must leave headroom within the Prometheus-side `scrape_timeout` | -| `spec.telemetry.metrics.staleEvictionTTL` | `duration` | `10m` | Evict metric snapshots from disconnected exporters after this duration. | | `spec.telemetry.backpressure.queueDepth` | `int` | `10000` | Ring buffer depth for Loki log push queue. | **Example CR snippet:** @@ -1301,7 +1315,6 @@ spec: serviceMonitor: true prometheusRules: true scrapeTimeout: "7s" - staleEvictionTTL: "10m" backpressure: queueDepth: 20000 ``` From 3e718fe6c04b3498b5d9f86b9a92634d75cb3afd Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Thu, 30 Apr 2026 12:58:02 +0200 Subject: [PATCH 34/39] jep-0011: update DD-9 for reverse-scrape, scope idempotency to PushLogs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Metrics idempotency is moot with reverse-scrape (full counter state per scrape). Reframe DD-9 around PushLogs deduplication only — best-effort at-least-once is acceptable for diagnostic logs. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 37 +++++++++++-------- 1 file changed, 22 insertions(+), 15 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 6805365d2..e32e7e951 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -789,24 +789,31 @@ long-lived `MetricsStream` to one Telemetry replica. Alternative (3) adds operational complexity with no benefit given the reverse-scrape model. -### DD-9: Idempotency vs. best-effort (acceptable over-count for informative metrics) +### DD-9: Idempotency vs. best-effort -**Alternatives considered:** - -1. **Idempotent** increments (deduplication keys or idempotent RPCs) — - appropriate for billing- or SLO-sensitive series; more design - and storage in the ingest path. -2. **Best effort** (at-least-once) `+1` / `+N` without global deduplication - — simpler; rare extra counts on retries or replays (see - **DD-8**). -3. “Exactly once in the exporter; Telemetry is a dumb adder” — - still **(2)** at the edge if the network retries. +**Context:** With the reverse-scrape model, metrics idempotency is a +non-issue — each scrape returns the full current counter state from the +exporter, so there are no increments to deduplicate or double-count. +The only remaining idempotency concern is for `PushLogs` RPCs, where +a retry could result in duplicate log entries in Loki. -**Decision:** (2) +**Alternatives considered:** -**Rationale:** Simpler RPCs and no global dedup store in v1; - operators treat these numbers as order-of-magnitude signals, - not invoices, unless policy changes. +1. **Idempotent** log pushes (deduplication keys per `LogEntry`) — + appropriate for billing- or SLO-sensitive log pipelines; requires + a dedup store or Loki-side dedup. +2. **Best effort** (at-least-once) for `PushLogs` without global + deduplication — simpler; rare duplicate log entries on retries. +3. **Metrics idempotency** (dedup keys on metric increments) — no + longer applicable; the reverse-scrape model returns full state, + making increment deduplication moot. + +**Decision:** (2) for `PushLogs`; metrics idempotency is not needed. + +**Rationale:** Duplicate log entries from occasional retries are + acceptable for informative/diagnostic logs. Loki queries are + tolerant of rare duplicates. No global dedup store is needed in v1; + operators treat these logs as diagnostic signals, not audit trails. ### DD-10: Perses over Grafana for dashboarding From 3bd48b7935c43f6972f3ec47ed57b936b3216bff Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Thu, 30 Apr 2026 13:06:47 +0200 Subject: [PATCH 35/39] jep-0011: rewrite control-plane aggregation for reverse-scrape model MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace increment-based language with reverse-scrape: exporters maintain local prometheus_client registries, Telemetry fans out MetricsScrapeRequest on each Prometheus hit. Scope backpressure to Loki log push path only — metrics are pull-based and transient. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 54 ++++++++++--------- 1 file changed, 29 insertions(+), 25 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index e32e7e951..dd2cc83b7 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -1121,45 +1121,49 @@ sum by (client) ( When this mode is enabled in a deployment: -- Exporters and clients (`jmp`) send increments (`+1` / - `+N`) and structured log/event records to the optional - `jumpstarter-telemetry` service (name TBD, see **DD-7**). This dedicated - `Service` uses the same mTLS / ServiceAccount / NetworkPolicy model as - Controller and Router; it holds in-memory counters, POSTs to - the Loki API, and exposes `/metrics` for Prometheus scrape - (**DD-3**). HA (multiple replicas) uses `sum` in PromQL (**DD-8**); - best-effort duplicate tolerance (**DD-9**). Exporter and edge processes never +- Exporters maintain local `prometheus_client` registries and open a + `MetricsStream` to the optional `jumpstarter-telemetry` service + (**DD-7**). On each Prometheus scrape the Telemetry service fans out + `MetricsScrapeRequest` to all connected exporters in parallel, merges + the responses, and serves the combined output on `/metrics` + (**DD-3**). HA (multiple replicas with exporter-sticky connections) + uses `sum` in PromQL (**DD-8**). Exporter and edge processes never need Loki or cluster-scrape credentials directly (**DD-5**). +- Exporters and clients (`jmp`) push structured log entries to the + Telemetry service via `PushLogs`. The Telemetry service forwards + these to Loki. Best-effort duplicate tolerance applies (**DD-9**). - Controller and Router emit structured JSON logs to stdout (see **DD-4**). They do not push logs directly to Loki; a cluster-level log shipper (Promtail, Grafana Alloy, Vector, or equivalent) scrapes their pod logs and delivers them to Loki. This decouples the reconciler and session-handling hot paths from Loki availability. - **Backpressure:** The Telemetry service uses a bounded ring buffer - per destination (Loki push, metric ingest) with a configurable depth - (default: 10 000 entries). On overflow, dropped entries are replaced - by a single **drop marker** — a synthetic log entry recording the - count of dropped entries and the time window. Subsequent drops while - the buffer is still full accumulate into the same marker rather than - adding new entries, so the queue always retains one slot for the - current drop summary. When the buffer drains and the marker is - flushed, the downstream log contains an explicit record such as + for the Loki log push path with a configurable depth + (default: 10 000 entries, see `spec.telemetry.backpressure.queueDepth`). + On overflow, dropped entries are replaced by a single **drop marker** + — a synthetic log entry recording the count of dropped entries and the + time window. Subsequent drops while the buffer is still full + accumulate into the same marker rather than adding new entries, so the + queue always retains one slot for the current drop summary. When the + buffer drains and the marker is flushed, the downstream log contains + an explicit record such as `{"level":"warn","msg":"entries dropped","count":142,"window_seconds":12}`. A `jumpstarter_telemetry_dropped_total` counter (partitioned by - `destination={loki,metrics}`) is also incremented on `/metrics` for - alerting. Because the Controller and Router no longer push to Loki, - their lease/session operations are inherently isolated from Loki or - metrics path slowdowns. + `destination={loki}`) is also incremented on `/metrics` for alerting. + Metrics do not need backpressure — the reverse-scrape model is + pull-based and transient (no buffering between scrapes). + Because the Controller and Router do not push to Loki, their + lease/session operations are inherently isolated from Loki slowdowns. - **Multi-tenancy:** write-side tenant scoping (e.g. namespace-based separation in Loki and Prometheus) is a deployment concern handled by the log shipper and Prometheus configuration. Read-side access control (who can query which metrics or logs) is likewise a deployment concern and out of scope for this JEP. -- This does not require that *all* metrics *originate* in a single - process: the exporter and drivers still emit the facts; - Telemetry aggregates and ships to Loki; Controller and - Router expose `/metrics` for Prometheus scrape and rely on the - log shipper for their stdout logs. +- Metric facts originate on the exporter (local `prometheus_client` + counters/histograms); the Telemetry service is a transparent + scrape-aggregation proxy. Controller and Router expose their own + `/metrics` for Prometheus scrape and rely on the log shipper for + their stdout logs. ### High-level data flow From 03e033587a62368ca014ef2c2c9e614f443e4fcd Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Thu, 30 Apr 2026 13:10:03 +0200 Subject: [PATCH 36/39] jep-0011: update data flow diagrams for reverse-scrape model Exporter diagram shows bidirectional MetricsStream and PushLogs. Telemetry diagram shows Prometheus-initiated scrape with fan-out to exporters. Summary references exporter-sticky connections. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 34 +++++++++++-------- 1 file changed, 19 insertions(+), 15 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index dd2cc83b7..72dfdaaad 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -1184,25 +1184,30 @@ logs to the Telemetry service for Loki ingest (see **DD-4**). flowchart LR ctrl[jumpstarter-controller] -->|lease lifecycle| exp[Exporter] drv[Drivers] --> exp - exp -->|increments, events, logs| tel[jumpstarter-telemetry] + exp <-->|MetricsStream| tel[jumpstarter-telemetry] + exp -->|PushLogs| tel ``` -The Controller assigns leases; the Exporter delegates to Drivers -and forwards metrics increments, operational events, and logs to -Telemetry (see **DD-2**, **DD-5**, **DD-7**). +The Controller assigns leases; the Exporter delegates to Drivers and +maintains local `prometheus_client` counters. It opens a `MetricsStream` +to Telemetry for reverse-scrape and pushes structured logs via `PushLogs` +(see **DD-2**, **DD-3**, **DD-5**, **DD-7**). #### Telemetry to backends ```{mermaid} flowchart LR - tel[jumpstarter-telemetry] -->|JSON stdout| shipper[Log shipper] - shipper -->|pod logs| loki[(Loki)] - tel -->|push API| loki - tel -->|/metrics| prom[(Prometheus)] + prom[(Prometheus)] -->|scrape /metrics| tel[jumpstarter-telemetry] + tel <-->|MetricsStream fan-out| exp[Exporters] + tel -->|push API| loki[(Loki)] + tel -->|JSON stdout| shipper[Log shipper] + shipper -->|pod logs| loki ``` -Telemetry aggregates exporter and client data and writes to Loki and -exposes `/metrics` for Prometheus scrape (**DD-3**, **DD-7**). +On each Prometheus scrape, Telemetry fans out `MetricsScrapeRequest` to +all connected exporters in parallel, merges responses, and serves the +combined output. Logs received via `PushLogs` are forwarded to Loki +(**DD-3**, **DD-7**, **DD-8**). #### Controller to backends @@ -1230,11 +1235,10 @@ The Router writes structured JSON to stdout (see **DD-4**). A cluster log shipper scrapes pod logs and delivers them to Loki. The Router exposes `/metrics` for routing and session-level counters. -The diagrams above summarize the hub model described in *Control-plane -aggregation*. For credential isolation see **DD-5**; for the Telemetry -Deployment see **DD-7**; for HA summing see **DD-8**; for best-effort -semantics see **DD-9**. Optional direct exporter→Loki and `/metrics` -scrape on Exporter Pods remain valid for deployments that prefer them. +The diagrams above summarize the reverse-scrape hub model described in +*Control-plane aggregation*. For credential isolation see **DD-5**; for +the Telemetry Deployment see **DD-7**; for HA with exporter-sticky +connections see **DD-8**; for best-effort log semantics see **DD-9**. No OpenTelemetry Collector is *required* (see **DD-6**); operators may run one *alongside* and scrape the same targets if they choose. From 8f9057ab92d057e29fe5a5a36a17c780ebd052f2 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Thu, 30 Apr 2026 13:15:05 +0200 Subject: [PATCH 37/39] jep-0011: update phases, concepts, and prose for reverse-scrape model Update Phase 2/3 descriptions, Concepts, What users see, high-freq byte counters, DD-5 validation text, and test sections to reflect exporter-local prometheus_client registries and reverse-scrape via MetricsStream. Remove stale increment-push language throughout. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 53 ++++++++++--------- 1 file changed, 27 insertions(+), 26 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 72dfdaaad..63f55fcd0 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -33,8 +33,8 @@ and compatibility rules. | Phase | Scope | Key deliverables | | ----- | ----- | ---------------- | | 1 | Structured logging + lease context | `spec.context` CRD field; JSON structured logs for all long-running services; correlation fields (`lease_id`, `exporter`, `operation`, `result`) in every log line. | -| 2 | Metrics endpoints | `/metrics` scrape endpoints on Controller and Router; exporter counter/histogram/gauge metrics with `driver_type`; Prometheus exemplars for high-cardinality context. | -| 3 | Telemetry service | Optional `jumpstarter-telemetry` Deployment managed by the operator; exporter and client data aggregation; Loki push for edge-originated logs and events. | +| 2 | Metrics endpoints | `/metrics` scrape endpoints on Controller and Router; exporter-local `prometheus_client` counters/histograms/gauges with `driver_type`; Prometheus exemplars for high-cardinality context. | +| 3 | Telemetry service | Optional `jumpstarter-telemetry` Deployment managed by the operator; reverse-scrape of exporter metrics via `MetricsStream`; Loki push for edge-originated logs and events. | | 4 | In-cluster log scraping | Operator configures log shipper integration (Promtail, Grafana Alloy, Vector) for Controller/Router pod logs; `ServiceMonitor` CRDs for Prometheus autodiscovery. | | 5 | Dashboards + alerting | Perses CRD dashboards; starter alert rules; documentation and operator integration. | @@ -97,13 +97,13 @@ exporter-level metrics that a monitoring stack can scrape or receive. its power sub-driver emits `driver_type="power"`, and so on. Any top-level methods on the composite driver itself (e.g. VM lifecycle) emit `driver_type="composite"`. -- **Jumpstarter Telemetry** (optional) — a dedicated - component with a well-known ingest path and the same trust - model (mTLS, ServiceAccount) as Controller/Router; - it isolates Loki/series work from the reconciler hot path (see - **DD-7**). Multi-replica HA and PromQL `sum` aggregation are - covered in **DD-8**; best-effort idempotency for informative metrics in - **DD-9**. +- **Jumpstarter Telemetry** (optional) — a dedicated component that + reverse-scrapes connected exporters for metrics via `MetricsStream` + and receives structured logs via `PushLogs`, using the same trust + model (mTLS, ServiceAccount) as Controller/Router. It isolates + Loki/series work from the reconciler hot path (see **DD-7**). + Multi-replica HA with exporter-sticky connections is covered in + **DD-8**; best-effort log deduplication in **DD-9**. ### What users see @@ -113,13 +113,15 @@ exporter-level metrics that a monitoring stack can scrape or receive. identifier, image digest, or VCS. - The controller and/or data plane write structured, annotated log events (see **DD-2**) for significant operations such as flash attempts and outcomes. -- Exporters send increments to the Jumpstarter Telemetry - service over the existing exporter↔control-plane trust boundary; - the in-cluster side then POSTs to Loki and exposes `/metrics` - for scrape (see **DD-3**, **DD-7**), with cluster credentials, avoiding - per-exporter Loki and metrics secrets. The same path can carry operator-chosen structured log lines - and events (not unbounded default client chatter — see *Control-plane - aggregation* below). +- Exporters maintain local `prometheus_client` counters and open a + `MetricsStream` to the Jumpstarter Telemetry service over the + existing exporter↔control-plane trust boundary. On each Prometheus + scrape, the Telemetry service fans out to connected exporters and + serves the merged `/metrics` output (see **DD-3**, **DD-7**), with + cluster credentials — avoiding per-exporter Loki and metrics secrets. + Exporters and clients also push structured log entries via `PushLogs` + (not unbounded default chatter — see *Control-plane aggregation* + below). - The `jmp` CLI output remains human-readable, but when a Telemetry endpoint is available, `jmp` also pushes structured JSON logs to the Jumpstarter Telemetry service for Loki ingest. @@ -587,7 +589,7 @@ are still useful for selection and for tools that only understand metadata. 4. **Dedicated Jumpstarter Telemetry Deployment** (see **DD-7**) instead of folding everything into the Controller — only Telemetry holds Loki-push credentials; isolated failure domain - and scaling for high-volume increments. Router and Controller + and scaling for reverse-scrape and log ingest. Router and Controller write structured JSON to stdout (see **DD-4**) and expose `/metrics` for Prometheus scrape; a cluster log shipper delivers their pod logs to Loki without Jumpstarter-specific Loki credentials. @@ -653,8 +655,8 @@ its configuration model, receivers, processors, and exporters — overhead that is not justified when the data paths are known in advance. Additionally, the Telemetry service operates inside Jumpstarter's existing authentication and trust domain (mTLS, registered client and -exporter identities). It can validate that an incoming increment -actually originates from the claimed exporter or client — preventing +exporter identities). It can validate that an incoming `MetricsStream` +or `PushLogs` call originates from the claimed exporter or client — preventing impersonation or label injection — without requiring a separate auth layer. A generic OTel Collector has no awareness of Jumpstarter identities and would need external policy to achieve the same guarantee. @@ -1016,12 +1018,11 @@ These rules are opt-in and disabled by default to avoid noise in environments with different baselines. **High-frequency byte counters:** `jumpstarter_stream_bytes_total` can -be incremented at very high rates on serial and video streams. Exporters -must pre-aggregate byte counts locally and flush a single `+N` increment -to the Telemetry service at a configurable interval (default: every 5 s -or every 64 KiB, whichever comes first) rather than sending a per-read -or per-write RPC. This bounds telemetry RPC volume independently of -stream throughput. +be incremented at very high rates on serial and video streams. Because +metrics live in the exporter's local `prometheus_client` registry, high +update rates do not generate any RPC traffic — the counter is updated +in-process and only serialized when the Telemetry service sends a +`MetricsScrapeRequest`. ### Example queries @@ -1393,7 +1394,7 @@ and should only be used in development or testing environments. correlation fields (`lease_id`, `exporter`, …) and that exporter pods do not require Loki or cluster-scrape credentials in their spec. - If Telemetry runs with >1 replica: one test verifies that - `sum` by business labels (dropping `pod`/`instance`) matches expected totals after partitioned increments (see **DD-8**). + `sum` by business labels (dropping `pod`/`instance`) matches expected totals with exporter-sticky connections (see **DD-8**). - Lease with metadata: objects validate; events or status updates match expected structure. From 36fe66afc18cb4abcd9338728161a059c4ca058d Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Thu, 30 Apr 2026 13:23:51 +0200 Subject: [PATCH 38/39] jep-0011: add AuditStream removal and push-increments to backward compat and rejected alternatives Document AuditStream removal as no-op in Backward Compatibility. Add rejected alternatives for reusing AuditStream and for pushing metric increments instead of reverse-scrape. Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 23 +++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 63f55fcd0..0ac2a0feb 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -1488,6 +1488,15 @@ all subsequent phases have E2E coverage from the start. - gRPC: new metadata must be additive; servers tolerate missing trace and context fields from older clients; clients ignore unknown fields where applicable. +- **`AuditStream` removal:** The `AuditStream` RPC and `AuditStreamRequest` + message on `ControllerService` are removed. This RPC was never implemented + or called by any client — `Grep` across the codebase confirms zero usage + outside its protobuf definition. Removing it is a no-op for all existing + deployments. The new `PushLogs` RPC on `TelemetryService` supersedes the + intended use case. +- `LogStreamResponse` enrichment (new optional fields `driver_type`, + `operation`, `timestamp`, `structured_fields`) is purely additive and + backward-compatible — existing clients ignore unknown fields. - No removal of current default CLI behavior; JSON logging only when selected. ## Consequences @@ -1551,6 +1560,20 @@ all subsequent phases have E2E coverage from the start. unscalable for joins with traces and multi-service incidents. - **"Mandatory full tracing for every command"** — high overhead; rejected; prefer sampling and opt-in for heavy paths. +- **"Push metric increments from exporters to telemetry"** — exporters + would send `+1`/`+N` counter increments and histogram observations to + the Telemetry service, which would maintain in-memory counters and + expose them on `/metrics`. Rejected because: (a) counter state would + be lost on Telemetry restart, (b) retries introduce double-counting + requiring idempotency logic, and (c) high-frequency counters (e.g. + stream bytes) generate excessive RPC traffic. The reverse-scrape model + keeps full counter state on the exporter and generates zero RPC + traffic between scrapes (see **DD-3** alternative 4, **DD-7**). +- **"Reuse `AuditStream` for telemetry log push"** — `AuditStream` was an + unimplemented stub on `ControllerService` with no message schema for + structured telemetry data. Rather than retrofitting it, a purpose-built + `PushLogs` RPC on the new `TelemetryService` provides a cleaner contract + and separates telemetry from the controller's reconciliation API. ## Prior Art From 339325bd7bc8af7a77c6725012dca888f3370b93 Mon Sep 17 00:00:00 2001 From: Miguel Angel Ajo Pelayo Date: Thu, 30 Apr 2026 13:43:54 +0200 Subject: [PATCH 39/39] jep-0011: consistency fixes and improved OTel justification - Fix unclosed parenthesis in author field - Add missing jumpstarter_telemetry_dropped_total to proposed metrics - Align driver_type enum across Concepts and operator config - Clarify namespace is application-emitted, not shipper-injected - Fix phase numbering after new exporter drivers telemetry phase - Restructure DD-6 rationale: lead with identity enforcement as primary argument, acknowledge OTel Collector overlap, add "When OTel is the better choice" section per review feedback - Reframe future OTLP extension as push output (not ingest) - Tighten DD-7 decision statement and DD-5 rationale wording - State spec.context placement in PushLogs LogEntry extra_fields - Fix scrapeTimeout description and implementation history formatting Made-with: Cursor --- .../JEP-0011-observability-telemetry-logs.md | 127 ++++++++++-------- 1 file changed, 74 insertions(+), 53 deletions(-) diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md index 0ac2a0feb..b4e1a83f7 100644 --- a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -4,7 +4,7 @@ | ----------------- | --------------------------------------------------------------------- | | **JEP** | 0011 | | **Title** | Metrics, Tracing, and Log Observability | -| **Author(s)** | @mangelajo (Miguel Angel Ajo Pelayo | +| **Author(s)** | @mangelajo (Miguel Angel Ajo Pelayo ) | | **Status** | Discussion | | **Type** | Standards Track | | **Created** | 2026-04-23 | @@ -30,13 +30,14 @@ and compatibility rules. ### Phases -| Phase | Scope | Key deliverables | -| ----- | ----- | ---------------- | -| 1 | Structured logging + lease context | `spec.context` CRD field; JSON structured logs for all long-running services; correlation fields (`lease_id`, `exporter`, `operation`, `result`) in every log line. | -| 2 | Metrics endpoints | `/metrics` scrape endpoints on Controller and Router; exporter-local `prometheus_client` counters/histograms/gauges with `driver_type`; Prometheus exemplars for high-cardinality context. | -| 3 | Telemetry service | Optional `jumpstarter-telemetry` Deployment managed by the operator; reverse-scrape of exporter metrics via `MetricsStream`; Loki push for edge-originated logs and events. | -| 4 | In-cluster log scraping | Operator configures log shipper integration (Promtail, Grafana Alloy, Vector) for Controller/Router pod logs; `ServiceMonitor` CRDs for Prometheus autodiscovery. | -| 5 | Dashboards + alerting | Perses CRD dashboards; starter alert rules; documentation and operator integration. | +| Phase | Scope | Key deliverables | +| ----- | ---------------------------------- | ---------------- | +| 1 | Structured logging + lease context | `spec.context` CRD field; JSON structured logs for all long-running services; correlation fields (`lease_id`, `exporter`, `operation`, `result`) in every log line. | +| 2 | Metrics endpoints | `/metrics` scrape endpoints on Controller and Router; exporter-local `prometheus_client` counters/histograms/gauges with `driver_type`; Prometheus exemplars for high-cardinality context. | +| 3 | Telemetry service | Optional `jumpstarter-telemetry` Deployment managed by the operator; reverse-scrape of exporter metrics via `MetricsStream`; Loki push for edge-originated logs and events. | +| 4 | Exporter drivers telemetry | Provides a clean architecture to let drivers generate their own telemetry data. | +| 5 | In-cluster log scraping | Operator configures log shipper integration (Promtail, Grafana Alloy, Vector) for Controller/Router pod logs; `ServiceMonitor` CRDs for Prometheus autodiscovery. | +| 6 | Dashboards + alerting | Perses CRD dashboards; starter alert rules; documentation and operator integration. | Each phase is independently useful and builds on the previous ones. Phase 1 can ship without any later phase; operators who only need @@ -89,7 +90,8 @@ exporter-level metrics that a monitoring stack can scrape or receive. duration), and gauges (active sessions) exposed from the exporter and enriched by individual drivers via the `driver_type` label. Each driver selects a category from a predefined set in jumpstarter core (e.g. - `storage`, `power`, `network`, `serial`, `console`, `video`). + `storage`, `power`, `network`, `serial`, `console`, `video`, + `composite`). Composite drivers (e.g. Renode, QEMU) that bundle multiple sub-drivers do not emit a single top-level category for delegated work. Instead, each sub-driver emits its own `driver_type` when it performs an @@ -272,8 +274,10 @@ The Telemetry service maps `component` and `exporter` to Loki stream labels and everything else into the JSON body, following the cardinality rules in *Cardinality guidelines*. The `exporter` and `client` fields are verified server-side with the authenticated -identity to prevent impersonation. Empty fields or details -that can be obtained from lease_id are incorporated into the log. +identity to prevent impersonation. `spec.context` entries associated +with the active lease (e.g. `build_id`, `image_digest`) are placed in +`extra_fields` by the caller. Empty fields or details +that can be obtained from `lease_id` are incorporated into the log. #### gRPC: `AuditStream` removal (`jumpstarter.proto`) @@ -544,11 +548,10 @@ are still useful for selection and for tools that only understand metadata. | *`spec.context` keys* | User-defined strings (during active lease) | no | All `lease.spec.context` entries (e.g. `build_id`, `image_digest`, VCS ref) added as JSON fields. High cardinality, never stream labels. | | *`exporterLabels` keys* | Values from Exporter CRD labels (when configured) | no | Operator-defined exporter labels (e.g. `board-type`); see `spec.telemetry.exporterLabels`. | - `namespace` is **not** emitted by the application. Log shippers - (Promtail, Grafana Alloy, Vector) automatically inject `namespace` - (and `pod`, `container`) from Kubernetes pod metadata via service - discovery, so it is available as a Loki stream label without - application-level awareness. + `namespace` is emitted by the application from its own runtime + context (the namespace in which the process is running). Log shippers + (Promtail, Grafana Alloy, Vector) may also inject `pod` and + `container` from Kubernetes pod metadata via service discovery. Fields marked as **Loki stream labels** are extracted by the log shipper and used as indexed stream selectors. They must be low-cardinality to @@ -604,10 +607,11 @@ are still useful for selection and for tools that only understand metadata. client dependency (see **DD-4**); their pod logs reach Loki via the cluster's existing log shipping infrastructure. Generic in-cluster collectors solve *credentials* but not *semantic* correlation unless - integrated; the hub (2) reuses the existing trust model - (exporter→controller) and can inject labels and tenant context in one place. A separate Deployment (**4** / - **DD-7**) is preferable to overloading the main reconciler when - load or residency of counters matters. + integrated; alternative (2)'s trust-model advantage — which (4) + inherits — reuses the existing exporter→controller relationship and + can inject labels and tenant context in one place. A separate + Deployment (**4** / **DD-7**) is preferable to overloading the main + reconciler when load or residency of counters matters. ### DD-6: OpenTelemetry (OTLP / Collector) as a *mandated* layer @@ -637,35 +641,49 @@ are still useful for selection and for tools that only understand metadata. **Rationale:** -- **Complexity** — the Collector is another versioned, configured service; dual - OTel stacks (Go, Python) add version drift and test matrix. -- **Fit** — most Jumpstarter metrics and lease events map cleanly to - Prometheus and Loki wire protocols operators already use. -- **Narrow scope** — full three-pillar OTel (unified logs via OTLP) is - *optional product territory*; this JEP optimizes for low ceremony and - direct integration. - -The proposed Jumpstarter Telemetry service (**DD-7**) is itself a -non-trivial component (metric aggregation, Loki forwarding, multi-replica -HA). The distinction is that it is *purpose-built* for Jumpstarter's -narrow scope: a single Go binary with a single config surface, no -separate version matrix, and no generic pipeline DSL to learn. An OTel -Collector serves many use cases but requires operator familiarity with -its configuration model, receivers, processors, and exporters — overhead -that is not justified when the data paths are known in advance. -Additionally, the Telemetry service operates inside Jumpstarter's -existing authentication and trust domain (mTLS, registered client and -exporter identities). It can validate that an incoming `MetricsStream` -or `PushLogs` call originates from the claimed exporter or client — preventing -impersonation or label injection — without requiring a separate -auth layer. A generic OTel Collector has no awareness of Jumpstarter -identities and would need external policy to achieve the same guarantee. - -**Future extension:** the Telemetry service's ingest endpoint could -accept OTLP in a future iteration, enabling operators who run OTel -Collectors on exporter hosts (e.g. for host-level stats) to route data -through the same trust boundary without a second credential set. This -is additive and does not require adopting OTel as a project dependency. +The proposed Jumpstarter Telemetry service (**DD-7**) admittedly +reimplements a subset of OTel Collector functionality — metric +aggregation, log forwarding, backpressure, and multi-replica HA. The +decision to build a purpose-built component rather than adopt the OTel +Collector rests on three arguments, ordered by importance: + +1. **Identity enforcement (primary)** — The Telemetry service operates + inside Jumpstarter's existing authentication and trust domain (mTLS, + registered client and exporter identities). It validates that every + incoming `MetricsStream` or `PushLogs` call originates from the + claimed exporter — preventing impersonation or label + injection — using identities the platform already manages. A generic + OTel Collector has no awareness of Jumpstarter identities; achieving + the same guarantee would require an external auth policy layer + (e.g. custom processors, mTLS-to-attribute mapping, and a sidecar or + admission webhook to enforce label provenance), adding complexity + that offsets the Collector's generality. + +2. **Operational simplicity** — The Telemetry service is a single Go + binary with a single config surface (the operator CR), no separate + version matrix, and no generic pipeline DSL. An OTel Collector + requires operator familiarity with its configuration model + (receivers, processors, exporters, and connectors), dual OTel SDK + stacks (Go + Python) add version drift and test matrix, and the + Collector itself is another versioned service to upgrade and + monitor. This overhead is not justified when the data paths are + known in advance. + +3. **Narrow scope** — Jumpstarter metrics and lease events map directly + to Prometheus and Loki wire protocols that operators already use. + Full three-pillar OTel (unified logs and metrics via OTLP) is + *optional product territory*; this JEP optimizes for low ceremony + and direct integration with exactly those two backends. + + +**Future extension:** because the Telemetry service already aggregates +metrics snapshots and structured log entries in well-defined formats, +adding an OTLP push output (logs and metrics) alongside the existing +Loki and `/metrics` paths would be a trivial change. This would let +operators route Jumpstarter data into an OTel Collector or any +OTLP-compatible backend without altering the exporter or client side. +The change is additive and does not require adopting the OTel SDK as a +project dependency. ### DD-7: Optional Jumpstarter Telemetry service (dedicated Deployment vs. Controller/Router only) @@ -695,8 +713,9 @@ is additive and does not require adopting OTel as a project dependency. **Decision:** Prefer **(4)** for the optional aggregated-metrics + Loki path at scale; allow **(1)** in small or dev clusters; **(3)** only - if review shows a need. Could still offer a centralized log/event source when - Loki is not available by using the pod logs, this could be helpful for testing. + if review shows a need. In deployments without Loki, the Telemetry + service's own pod logs (structured JSON to stdout) still provide a + centralized, queryable event source via the cluster log shipper. **Rationale:** A dedicated workload can scale and restart independently; Loki spikes and ingest load cannot starve lease reconciliation in the @@ -991,6 +1010,7 @@ and be fixed before "Implemented".* | `jumpstarter_stream_bytes_total` | counter | `exporter`, `driver_type`, `direction` | Bytes transferred (tx/rx) on streams. | | `jumpstarter_active_sessions` | gauge | `exporter` | Currently active lease sessions. | | `jumpstarter_lease_acquisitions_total` | counter | `result` | Lease acquire attempts (controller). | +| `jumpstarter_telemetry_dropped_total` | counter | `destination` | Log entries dropped due to backpressure (e.g. `destination="loki"`). | | `jumpstarter_scrape_timeouts_total` | counter | `exporter` | Scrape fan-out timeouts per exporter (Telemetry-side). | All counters and histograms carry exemplar keys from the operator's @@ -1295,7 +1315,7 @@ can tune metrics, logging, and exemplar behavior without editing code. | `spec.telemetry.metrics.driverTypeEnum` | `[]string` | `["power", "storage", "network", "serial", …]` | Allowed `driver_type` label values. Drivers reporting an unlisted type are mapped to `other`. | | `spec.telemetry.metrics.serviceMonitor` | `bool` | `true` | Create `ServiceMonitor` CRDs for Prometheus autodiscovery. | | `spec.telemetry.metrics.prometheusRules` | `bool` | `false` | Deploy starter `PrometheusRule` CRDs (opt-in). | -| `spec.telemetry.metrics.scrapeTimeout` | `duration` | `7s` | Max time to wait for parallel exporter responses during a `/metrics` fan-out. Must leave headroom within the Prometheus-side `scrape_timeout` | +| `spec.telemetry.metrics.scrapeTimeout` | `duration` | `7s` | Max time to wait for parallel exporter responses during a `/metrics` fan-out. Should be set lower than the Prometheus-side `scrape_timeout` to leave headroom for HTTP transport. | | `spec.telemetry.backpressure.queueDepth` | `int` | `10000` | Ring buffer depth for Loki log push queue. | **Example CR snippet:** @@ -1326,6 +1346,7 @@ spec: - storage - network - serial + - console - video - composite serviceMonitor: true @@ -1618,7 +1639,7 @@ all subsequent phases have E2E coverage from the start. ## Implementation History -— JEP-0011 proposed: 2026-04-23 +- JEP-0011 proposed: 2026-04-23 - JEP-0011 updated based on feedback: 2026-04-29 ## References