diff --git a/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md new file mode 100644 index 000000000..b4e1a83f7 --- /dev/null +++ b/python/docs/source/internal/jeps/JEP-0011-observability-telemetry-logs.md @@ -0,0 +1,1658 @@ +# JEP-0011: Metrics, Tracing, and Log Observability + +| Field | Value | +| ----------------- | --------------------------------------------------------------------- | +| **JEP** | 0011 | +| **Title** | Metrics, Tracing, and Log Observability | +| **Author(s)** | @mangelajo (Miguel Angel Ajo Pelayo ) | +| **Status** | Discussion | +| **Type** | Standards Track | +| **Created** | 2026-04-23 | +| **Updated** | 2026-04-29 | +| **Discussion** | https://github.com/jumpstarter-dev/jumpstarter/pull/631 | +| **Requires** | — | +| **Supersedes** | — | +| **Superseded-By** | — | + +--- + +## Abstract + +This JEP defines an optional, cross-component observability model for +Jumpstarter covering lease context metadata, structured operational events, +exporter/driver metrics, and standardized logging. It targets direct integration +with Prometheus (scrape), Loki (log aggregation), and Perses (dashboards) — +without mandating OpenTelemetry — and introduces an optional in-cluster +Jumpstarter Telemetry service that aggregates data from exporters and clients so +that edge processes never need Loki or cluster-scrape credentials. +Implementation is expected to land in phases; this JEP describes the end state +and compatibility rules. + +### Phases + +| Phase | Scope | Key deliverables | +| ----- | ---------------------------------- | ---------------- | +| 1 | Structured logging + lease context | `spec.context` CRD field; JSON structured logs for all long-running services; correlation fields (`lease_id`, `exporter`, `operation`, `result`) in every log line. | +| 2 | Metrics endpoints | `/metrics` scrape endpoints on Controller and Router; exporter-local `prometheus_client` counters/histograms/gauges with `driver_type`; Prometheus exemplars for high-cardinality context. | +| 3 | Telemetry service | Optional `jumpstarter-telemetry` Deployment managed by the operator; reverse-scrape of exporter metrics via `MetricsStream`; Loki push for edge-originated logs and events. | +| 4 | Exporter drivers telemetry | Provides a clean architecture to let drivers generate their own telemetry data. | +| 5 | In-cluster log scraping | Operator configures log shipper integration (Promtail, Grafana Alloy, Vector) for Controller/Router pod logs; `ServiceMonitor` CRDs for Prometheus autodiscovery. | +| 6 | Dashboards + alerting | Perses CRD dashboards; starter alert rules; documentation and operator integration. | + +Each phase is independently useful and builds on the previous ones. +Phase 1 can ship without any later phase; operators who only need +structured logs benefit immediately. Phase 2 adds scrape-ready metrics +without requiring the Telemetry service. + +## Motivation + +Today, operators and CI maintainers need to answer questions that raw Kubernetes +objects and ad hoc text logs do not always answer in one place: +- *Which pipeline or image was being tested on this lease?* +- *How often do flashes fail on this exporter?* +- *What lease or user correlates a controller line with a failure on the client?* + +The `Lease` API already models scheduling and assignment; it does +not yet provide a first-class, documented place for run metadata or a standard +for lease-scoped operational events (beyond generic `conditions`). + +Exporters expose work to drivers, but there is no shared model for driver- or +exporter-level metrics that a monitoring stack can scrape or receive. + +### User Stories + +- **As a** lab operator, **I want to** see flash success/failure rates per + exporter in a Prometheus dashboard, **so that** I can spot failing hardware + before CI teams notice. +- **As a** CI pipeline author, **I want to** attach my build ID and image + digest to a lease, **so that** post-mortem queries in Loki can filter all + logs for one pipeline run across controller, exporter, and client. +- **As a** platform engineer, **I want** exporter processes to send telemetry + without holding Loki or Prometheus credentials, **so that** I do not have to + distribute and rotate secrets on every lab machine. +- **As an** AI agent orchestrating CI, **I want** machine-readable structured + logs and metric exemplars with lease context, **so that** I can + programmatically identify failing exporters and correlate test results + without parsing free-form text. + +## Proposal + +### Concepts + +- **Lease context** — Identifiers and labels supplied by a client or CI and + associated for the life of a lease, propagated where safe so metrics, logs, + and traces can be filtered and joined. +- **Lease events** (or *operations*) — Annotated, structured log entries + recording significant actions (for example *flash started*, *flash failed*, + *image reference*) with typed fields, queryable in **Loki** alongside + regular logs and distinct from higher-frequency debug output (see **DD-2**). +- **Exporter metrics** — Counters (operations, bytes), histograms (operation + duration), and gauges (active sessions) exposed from the exporter and + enriched by individual drivers via the `driver_type` label. Each driver + selects a category from a predefined set in jumpstarter core (e.g. + `storage`, `power`, `network`, `serial`, `console`, `video`, + `composite`). + Composite drivers (e.g. Renode, QEMU) that bundle multiple sub-drivers + do not emit a single top-level category for delegated work. Instead, + each sub-driver emits its own `driver_type` when it performs an + operation — a Renode storage sub-driver emits `driver_type="storage"`, + its power sub-driver emits `driver_type="power"`, and so on. Any + top-level methods on the composite driver itself (e.g. VM lifecycle) + emit `driver_type="composite"`. +- **Jumpstarter Telemetry** (optional) — a dedicated component that + reverse-scrapes connected exporters for metrics via `MetricsStream` + and receives structured logs via `PushLogs`, using the same trust + model (mTLS, ServiceAccount) as Controller/Router. It isolates + Loki/series work from the reconciler hot path (see **DD-7**). + Multi-replica HA with exporter-sticky connections is covered in + **DD-8**; best-effort log deduplication in **DD-9**. + +### What users see + +- When creating a lease, clients (or their tooling) can attach metadata via + CRD fields and/or `spec.context` using documented + keys and size limits. Example keys might include a build / pipeline + identifier, image digest, or VCS. +- The controller and/or data plane write structured, annotated log events + (see **DD-2**) for significant operations such as flash attempts and outcomes. +- Exporters maintain local `prometheus_client` counters and open a + `MetricsStream` to the Jumpstarter Telemetry service over the + existing exporter↔control-plane trust boundary. On each Prometheus + scrape, the Telemetry service fans out to connected exporters and + serves the merged `/metrics` output (see **DD-3**, **DD-7**), with + cluster credentials — avoiding per-exporter Loki and metrics secrets. + Exporters and clients also push structured log entries via `PushLogs` + (not unbounded default chatter — see *Control-plane aggregation* + below). +- The `jmp` CLI output remains human-readable, but when a Telemetry + endpoint is available, `jmp` also pushes structured JSON logs to the + Jumpstarter Telemetry service for Loki ingest. + +### API / Protocol Changes + +#### CRD (Lease) + +Additive changes only for the `spec.context` field. Backwards compatibility +by making this field empty by default. + +#### gRPC: Telemetry endpoint discovery (`jumpstarter.proto`) + +A new RPC on the existing `ControllerService` lets both exporters and +clients discover the optional Telemetry endpoint: + +```protobuf +// Added to ControllerService +rpc GetServiceEndpoints(GetServiceEndpointsRequest) + returns (GetServiceEndpointsResponse); + +message GetServiceEndpointsRequest {} + +message GetServiceEndpointsResponse { + // Empty when telemetry is not enabled. + repeated TelemetryEndpoint telemetry_endpoints = 1; +} + +message TelemetryEndpoint { + string endpoint = 1; // gRPC address (host:port) + string certificate = 2; // Optional CA cert for the endpoint +} +``` + +Exporters call `GetServiceEndpoints` after `Register`; clients call it +after authentication. An empty `telemetry_endpoints` list means telemetry +is not deployed — callers skip all telemetry RPCs. Older controllers +that do not implement the method return `UNIMPLEMENTED`, which callers +treat identically to an empty list. + +#### gRPC: Telemetry service (`telemetry.proto` — new file) + +A new `protocol/proto/jumpstarter/v1/telemetry.proto` defines the +`TelemetryService` implemented by `jumpstarter-telemetry`. It has two +RPCs: one for metrics (reverse scrape) and one for log push. + +##### Metrics: reverse scrape via `MetricsStream` + +Exporters maintain a local `prometheus_client.CollectorRegistry` with +counters, histograms, and gauges. Rather than pushing increments, the +exporter opens a persistent bidirectional stream to the Telemetry +service; the Telemetry service periodically sends a scrape request +and the exporter responds with the output of +`prometheus_client.generate_latest()` in OpenMetrics text format. + +```protobuf +service TelemetryService { + // Persistent bidirectional stream: telemetry sends scrape requests, + // exporter responds with full metric snapshots. + rpc MetricsStream(stream MetricsStreamRequest) + returns (stream MetricsStreamResponse); + + // Structured log / event push (used by both exporters and clients). + rpc PushLogs(PushLogsRequest) returns (PushLogsResponse); +} + +// Exporter → Telemetry +message MetricsStreamRequest { + oneof msg { + MetricsRegister register = 1; // First message: identify this exporter + MetricsScrapeResponse scrape_response = 2; // Subsequent: reply to a scrape + } +} + +message MetricsRegister { + string identity = 1; // Exporter CRD name (verified against mTLS and auth token by server) +} + +message MetricsScrapeResponse { + bytes metrics_text = 1; // generate_latest() OpenMetrics output + google.protobuf.Timestamp timestamp = 2; +} + +// Telemetry → Exporter +message MetricsStreamResponse { + oneof msg { + MetricsScrapeRequest scrape_request = 1; + } +} + +message MetricsScrapeRequest {} // "send your /metrics now" +``` + +The stream lifecycle: + +1. Exporter opens the stream and sends `MetricsRegister`, the jumpstarter-telemetry + service authenticates the exporter identity and labels from cluster information. +2. When Prometheus (or any scraper) hits the Telemetry service's + `/metrics` endpoint, Telemetry fans out `MetricsScrapeRequest` + to all connected exporters. +3. Each exporter calls `generate_latest(registry)` and replies with + `MetricsScrapeResponse`. +4. Telemetry merges the responses and serves the combined result, + adds and filters any necessary labels or exemplars from data. + This on-demand approach avoids stale data and unnecessary + background traffic; it can be changed to periodic pre-fetching + later if scrape latency became problematic. + +**Client-side metrics are not collected.** All metrically-interesting +operations are observable from the exporter side: `DriverCall` methods +run on the exporter and can be instrumented there. Client-side drivers +that orchestrate complex workflows (e.g. serial-console-driven +flashing) report outcomes back to the exporter via regular +`DriverCall` methods, keeping the exporter as the single source of +truth for metrics. + +##### Logs: push via `PushLogs` + +Both exporters and clients push structured log entries to the +Telemetry service for Loki ingest: + +```protobuf +message PushLogsRequest { + repeated LogEntry entries = 1; +} + +message PushLogsResponse { + uint32 accepted = 1; // Entries accepted + uint32 dropped = 2; // Entries dropped (backpressure) +} + +message LogEntry { + google.protobuf.Timestamp timestamp = 1; + string severity = 2; // debug, info, warn, error + string message = 3; + string component = 4; // Log stream label: cli, exporter + string exporter = 5; // Log stream label: exporter CRD name + string lease_id = 6; // High-cardinality, log body only + string client = 7; // High-cardinality, log body only + string operation = 8; // flash, power, etc. + string result = 9; // success, failure + string driver_type = 10; // storage, power, network, etc. + map extra_fields = 11; // Driver-specific structured data +} +``` + +The Telemetry service maps `component` and `exporter` to Loki stream +labels and everything else into the JSON body, following the +cardinality rules in *Cardinality guidelines*. The `exporter` and +`client` fields are verified server-side with the authenticated +identity to prevent impersonation. `spec.context` entries associated +with the active lease (e.g. `build_id`, `image_digest`) are placed in +`extra_fields` by the caller. Empty fields or details +that can be obtained from `lease_id` are incorporated into the log. + +#### gRPC: `AuditStream` removal (`jumpstarter.proto`) + +The existing `AuditStream` RPC on `ControllerService` and its +`AuditStreamRequest` message are removed. Analysis of the codebase +shows this is dead code: + +- The Go controller has no implementation — calls fall through to + `UnimplementedControllerServiceServer` which returns + `codes.Unimplemented`. +- No Python code (exporter or client) calls the RPC. +- No tests exercise it beyond generated stubs. + +Its intended purpose (tracking exporter activity) is fully superseded +by `TelemetryService.PushLogs` with a richer, properly-designed +message format. + +#### gRPC: `LogStreamResponse` enrichment (`jumpstarter.proto`) + +The existing `LogStream` RPC on `ExporterService` is kept — it serves +a fundamentally different purpose (real-time session logs from +exporter to connected client) from the Telemetry log push. However, +the `LogStreamResponse` message is enriched with optional additive +fields to support richer client-side display and optional dual-path +forwarding to telemetry: + +```protobuf +message LogStreamResponse { + string uuid = 1; + string severity = 2; + string message = 3; + optional LogSource source = 4; + // New additive fields: + optional string driver_type = 5; // Category when source=DRIVER + optional string operation = 6; // When the log is part of a known operation + optional google.protobuf.Timestamp timestamp = 7; + map structured_fields = 8; +} +``` + +These fields are optional and backward compatible — older clients +ignore unknown fields; older exporters simply do not set them. + +#### Tracing scope + +This JEP covers *correlation only* — `lease_id`, `trace_id`, +and `span_id` are propagated as log fields and Prometheus exemplar keys so that +metrics, logs, and (future) traces can be joined. Full distributed tracing +(span creation, sampling policies, trace storage and visualization) is deferred +to a future JEP. Optional propagation of `traceparent` and lease +identifiers in gRPC metadata remains backward compatible (unknown +metadata ignored by older servers). + +### Hardware Considerations + +- No hardware considerations. + +## Design Decisions + +### DD-1: How lease-scoped *context* metadata is stored + +**Scope:** This decision is about where to store generic metadata on a +`Lease` that describes *why* a run exists or *where* it came from — for example +an external build id, pipeline id, VCS revision, or other +operator-defined keys (team, environment), within the cardinality and +size limits defined in *Cardinality guidelines*. The same stored context +is the intended source to propagate (where safe) into metric series +labels and into log line fields for emissions that occur during the +lease and for logs produced during client access to the platform +(for example `jmp`) or during exporter and control-plane handling, so +Prometheus and Loki can correlate on one lease-level +identity without re-typing it on every line. + +**Alternatives considered:** + +1. **Annotation and label only** on the `Lease` object — Kube-native, no spec + change; limited size for annotations; labels for select queries only. +2. **Typed subfields under `spec`** (for example `observability` or `context`) + — easier validation, clearer API, migration path in CRD. +3. **Only client-side** (environment / local config) — no cluster visibility; + hard for operators to audit; no stable object-level link to per-lease + metrics and server logs in the cluster. + +**Decision:** **(2)** — a typed `spec.context` map under the Lease CRD for +first-class, validated context. **(1)** (labels/annotations) remains allowed +for integration with generic tooling that only understands Kubernetes metadata +or benefits from lease label filtering. + +**Rationale:** Typed fields make validation and documentation clear; labels +are still useful for selection and for tools that only understand metadata. + +### DD-2: Where operational events (flash, image) live + +**Alternatives considered:** + +1. **Kubernetes `Event` objects** — built-in, TTL-limited, good for + "what happened" in `kubectl get events` but not long-term history by default. +2. **`Lease.status.conditions` only** — compact but poor for a sequence of + operations with payloads (image id, size). +3. **Dedicated CRD** (for example per-event or a single stream object) — more + design and RBAC, better long-term retention and querying if backed properly. +4. **Annotated log events** Provides a lightweight alternative that can be traced + and filtered along logs. + +**Decision:** (4), since the other alternatives add additional pressure to the cluster + etcd via CRDs, annotated logs still provide the same level of functionality and can + be browsed together with logs. + +**Rationale:** Annotated log events naturally flow through the Loki + pipeline this JEP already establishes (**DD-5**, **DD-7**), so operational + records (flash started, flash failed, image reference) are queryable, + filterable, and correlated with surrounding exporter and controller logs + using the same correlation fields (`lease_id`, `exporter`, `result`, …) + without a second query domain. Kubernetes `Event` objects **(1)** have a short + default TTL (~1 h) and still write to etcd on every occurrence; + `status.conditions` **(2)** is a poor fit for a sequence of operations with + variable payloads (image digest, byte count, duration); a dedicated CRD + **(3)** adds schema versioning, RBAC surface, and per-event etcd writes + that scale with flash volume — all pressure the cluster does not need + for data whose primary consumers are dashboards and post-mortem + queries, not reconciliation loops. Structured log events carry arbitrary + fields without CRD migration, support configurable retention in Loki, + and keep the etcd write budget reserved for scheduling and assignment + where it matters most. + +### DD-3: Metrics: Prometheus scrape of `/metrics` as the reference path + +**Alternatives considered:** + +1. **HTTP `GET /metrics` in Prometheus text format** (pull) — the default + for in-cluster Prometheus in scrape mode; works + with the Prometheus Operator (`ServiceMonitor`), `kube-prometheus`, and + self-hosted jobs. The optional Jumpstarter Telemetry service exposes + this for aggregated counters it holds after receiving +1 / +N + from exporters. +2. **Prometheus remote write** (or a Mimir / Cortex receiver) + from a Jumpstarter component — useful in advanced topologies; not + part of the reference implementation in this JEP; operators can add a + federation or `remote_write` from Prometheus to long-term + storage without the application pushing to Prometheus. +3. **Both** — **(1)** is required for the documented path; **(2)** is + optional infrastructure behind Prometheus, not a second + required app protocol. +4. **Reverse scrape via gRPC** — exporters maintain a local + `prometheus_client.CollectorRegistry` and connect to the Telemetry + service via a persistent bidirectional gRPC stream (`MetricsStream`). + When Prometheus scrapes the Telemetry service's `/metrics` endpoint, + Telemetry fans out scrape requests to all connected exporters, merges + the `generate_latest()` responses, and serves the combined result. + Controller and Router still expose `/metrics` directly for Prometheus + scrape (no change). This avoids push-increment complexity on the wire + and keeps full counter state on the exporter at all times. + +**Decision:** **(4)** — exporter-originated metrics are reverse-scraped + through the Telemetry service via `MetricsStream`. + +**Rationale:** Exporters are often behind NAT or firewalls and cannot + be directly scraped by Prometheus. The reverse-scrape model **(4)** + solves this: the exporter initiates an outbound gRPC stream + (NAT-friendly, same direction as the existing controller connection), + the Telemetry service requests metric snapshots on demand, and full + counter state remains on the exporter at all times — eliminating + lost-increment concerns (see **DD-9**). The exporter uses standard + `prometheus_client` primitives locally, so driver authors instrument + with familiar counters and histograms. The OpenMetrics exposition + format natively carries exemplars, enabling high-cardinality context + (`client`, `lease_id`, and `trace_id` when present) on individual + samples without additional infrastructure. See **DD-6** (no OTel), + **DD-7** (Telemetry Deployment), **DD-8** (HA replicas). + +**Exemplar trade-offs and details:** + +- **Wire format.** On the OpenMetrics `/metrics` endpoint an exemplar is + appended after the sample value: + + ```text + jumpstarter_operations_total{exporter="lab-01",operation="flash",result="success"} 42 # {client="ci-bot",lease_id="abc123",build_id="nightly-42"} 1.0 1625000000.000 + ``` + + The `# {key=value,...} value timestamp` suffix is the exemplar. Grafana + (≥ 7.4) renders these as clickable dots on metric panels; clicking a dot + reveals the attached keys and can link to a Loki log query (filtered by + `lease_id`) or a trace view (filtered by `trace_id`). + +- **Size limit.** The [OpenMetrics 1.0 spec](https://prometheus.io/docs/specs/om/open_metrics_spec) + imposes a **128 UTF-8 character** limit on the combined length of + exemplar label names and values per exemplar. + [OpenMetrics 2.0](https://github.com/prometheus/docs/blob/main/docs/specs/om/open_metrics_spec_2_0.md) + (experimental, 2026) relaxes this to a soft cap measured in bytes. + The exemplar key budget is discussed further in *Exemplars for + high-cardinality context*. + +- **Sampling.** Client libraries rate-limit exemplar updates internally; + the last-seen exemplar per series is served on each scrape, not one + per data point. For the Jumpstarter use case this is sufficient: + the most recent `lease_id` / `trace_id` on a counter is the value + operators need when investigating a spike. + +- **Library support.** Go client support is mature + (`prometheus/client_golang` ≥ 1.16). The Python `prometheus_client` + library is used on the exporter side to maintain local registries + and produce `generate_latest()` output for the reverse-scrape path + (see *API / Protocol Changes*). Exemplar support in the Python + library is functional but less complete than Go; if limitations + arise, exemplar data can be sent as a sidecar field in + `MetricsScrapeResponse` for the Telemetry service to merge + server-side. + +- **Infrastructure requirements.** Prometheus ≥ 2.26 with + `--enable-feature=exemplar-storage` and + `--storage.tsdb.max-exemplars` (e.g. 100 000). Grafana ≥ 7.4 for + exemplar visualization. Perses does not yet support exemplar + rendering; until it does, operators who want exemplar click-through + can use Grafana alongside Perses or wait for upstream support. + + These limitations are acceptable for the correlation use case this JEP + targets. + +### DD-4: Log format for services vs CLI + +**Alternatives considered:** + +1. **JSON always** for every process — best for machines; hard for humans. +2. **Human text default for `jmp`**, **JSON for long-running services** and a + CLI push via the Telemetry ingest endpoint in JSON format (in addition to the + human-friendly output) +3. **Single format** with a pretty-printer in front of developers — more moving + parts. + +**Decision:** **(2)**. Long-running services (`jumpstarter-controller`, + `jumpstarter-router`, `jumpstarter-telemetry`, Exporter) emit + structured JSON to stdout. The Controller and Router do not + push logs directly to Loki; instead, a cluster-level log shipper + (Promtail, Grafana Alloy, Vector, or equivalent DaemonSet) scrapes their + pod logs and delivers them to Loki. Only `jumpstarter-telemetry` writes + to Loki directly (push API) because the exporter/client data it + aggregates does not originate as any pod's stdout. + +**Rationale:** Matches the requirement that *clients* stay human-readable, and at + the same time all services get parseable, joinable log lines. Writing JSON + to stdout and relying on the cluster log shipper for Loki delivery + decouples the Controller reconciler and Router session handling from + Loki availability — a Loki outage does not affect lease operations. + The Telemetry service retains a direct Loki-push because it is an + isolated workload (**DD-7**) whose core job is Loki ingest. + +**Format:** JSONL (one JSON object per line), produced by setting + `--zap-encoder=json` on the existing `controller-runtime` / Zap logger + (no changes to log call sites — existing `logr` structured fields become + JSON keys automatically). The `ts`, `level`, and `msg` fields follow + Zap's default JSON encoder output; application code adds domain fields + via the standard `logr` `WithValues` / `Info` / `Error` API. + + Base fields present in every log line: + +| Field | Format | Loki label | Description | +| ------------- | ------------------------------------------------------------------- | :--------: | ----------------------------------------- | +| `ts` | ISO-8601 (`2026-04-28T10:15:30.123Z`) | no | Timestamp (Zap default). | +| `level` | Lower-case string (`debug`, `info`, `warn`, `error`) | no | Log severity (Zap default). | +| `msg` | Free-form string | no | Human-readable message (Zap default). | +| `component` | Fixed enum (`cli`, `controller`, `router`, `telemetry`, `exporter`) | **yes** | Emitting service. | +| `exporter` | CRD name (when applicable) | **yes** | Exporter CRD name; bounded by cluster size.| +| `lease_id` | UID string (when applicable) | no | Lease UID (high cardinality). | +| `operation` | String (when applicable) | no | Operation name (flash, power, …). | +| `result` | String (when applicable) | no | Outcome (success, failure, …). | +| `driver_type` | Category from predefined set (when applicable) | no | Driver category (storage, power, …). | +| `client` | CRD name (when applicable) | no | Client CRD name (high cardinality). | +| *`spec.context` keys* | User-defined strings (during active lease) | no | All `lease.spec.context` entries (e.g. `build_id`, `image_digest`, VCS ref) added as JSON fields. High cardinality, never stream labels. | +| *`exporterLabels` keys* | Values from Exporter CRD labels (when configured) | no | Operator-defined exporter labels (e.g. `board-type`); see `spec.telemetry.exporterLabels`. | + + `namespace` is emitted by the application from its own runtime + context (the namespace in which the process is running). Log shippers + (Promtail, Grafana Alloy, Vector) may also inject `pod` and + `container` from Kubernetes pod metadata via service discovery. + + Fields marked as **Loki stream labels** are extracted by the log shipper + and used as indexed stream selectors. They must be low-cardinality to + keep the active stream count manageable (Grafana recommends < 100 k + active streams per tenant). With the labels above, a deployment with + 200 exporters across 5 namespaces produces roughly 1 000 streams — + well within budget. High-cardinality fields like `client` or + `lease_id` must stay in the JSON body: promoting `client` to a + stream label in a 1 000-client, 200-exporter cluster would create + up to 1 000 000 streams, overwhelming the Loki ingester. These fields + are instead queried with `| json | client="value"` filter + expressions after selecting the relevant streams. + + Multi-line content (e.g. stack traces) is embedded as an escaped string + within the JSON value (typically in a `stacktrace` or `error` field), + never as bare multi-line text, so each physical line is always one + complete JSON object. + +### DD-5: Where Loki and Prometheus (or remote-write) credentials live + +**Alternatives considered:** + +1. **Each exporter and edge host** holds credentials (or a sidecar) to push + directly to Loki and to Prometheus (or a metrics gateway) — maximum + flexibility; maximum secret distribution and rotation burden on lab and + remote sites. +2. **Jumpstarter Controller and/or Router** receive metrics and structured + events from exporters and (optionally) from client traffic they already + handle, and forward to the Loki push API and to + Prometheus-compatible sinks (scrape registration) + with in-cluster auth — one + credential surface; enriched with lease, exporter, and client context + in one place; must be non-blocking, bounded, and optional so the + control path does not depend on Loki or Prometheus availability. +3. **Hybrid** — generic in-cluster collectors for raw pod logs and scrape; + (2) for lease-scoped events and aggregated exporter metrics the + platform understands. +4. **Dedicated Jumpstarter Telemetry Deployment** (see **DD-7**) + instead of folding everything into the Controller — only + Telemetry holds Loki-push credentials; isolated failure domain + and scaling for reverse-scrape and log ingest. Router and Controller + write structured JSON to stdout (see **DD-4**) and expose `/metrics` + for Prometheus scrape; a cluster log shipper delivers their pod logs + to Loki without Jumpstarter-specific Loki credentials. + +**Decision:** (4) + +**Rationale:** The goal is to avoid propagating Loki- and + cluster-ingest authentication + to every exporter process while still attaching Jumpstarter-specific + context. Among Jumpstarter components, only `jumpstarter-telemetry` + holds Loki-push credentials — the Controller and Router have no Loki + client dependency (see **DD-4**); their pod logs reach Loki via the + cluster's existing log shipping infrastructure. Generic in-cluster + collectors solve *credentials* but not *semantic* correlation unless + integrated; alternative (2)'s trust-model advantage — which (4) + inherits — reuses the existing exporter→controller relationship and + can inject labels and tenant context in one place. A separate + Deployment (**4** / **DD-7**) is preferable to overloading the main + reconciler when load or residency of counters matters. + +### DD-6: OpenTelemetry (OTLP / Collector) as a *mandated* layer + +**Alternatives considered:** + +1. **Adopt OpenTelemetry** — instrument Controller, Router, Exporter, and + clients with the OTel SDK, export OTLP to a cluster-local + OpenTelemetry Collector, and let the Collector fan out to Loki, Prometheus + (remote write), and Tempo. +2. **Integrate directly** with each backend: Loki HTTP `POST /loki/api/v1/push` or + gRPC; Prometheus text on `/metrics`; structured JSON + (or logfmt) logs to stdout for shippers; optional W3C `traceparent` in + gRPC metadata for correlation *without* shipping full distributed + traces in the first iteration. If traces are ever needed, use Tempo + ingest where practical, *or* a thin sender — still + without a project-wide requirement on the OTel SDK in every binary. +3. **Hybrid (OTel in one language, direct in another)** — lowest common + implementation cost but inconsistent contributor experience and two + operational models. + +**Decision:** **(2).** This JEP does not make OpenTelemetry (SDK or + Collector) part of the required reference architecture. Vendors and + operators who already run an OpenTelemetry Collector may scrape the + same `/metrics`, receive logs shipped by existing agents, or + receive the Loki body the hub would have sent — compatibility + is welcome; dependency is not mandatory. + +**Rationale:** + +The proposed Jumpstarter Telemetry service (**DD-7**) admittedly +reimplements a subset of OTel Collector functionality — metric +aggregation, log forwarding, backpressure, and multi-replica HA. The +decision to build a purpose-built component rather than adopt the OTel +Collector rests on three arguments, ordered by importance: + +1. **Identity enforcement (primary)** — The Telemetry service operates + inside Jumpstarter's existing authentication and trust domain (mTLS, + registered client and exporter identities). It validates that every + incoming `MetricsStream` or `PushLogs` call originates from the + claimed exporter — preventing impersonation or label + injection — using identities the platform already manages. A generic + OTel Collector has no awareness of Jumpstarter identities; achieving + the same guarantee would require an external auth policy layer + (e.g. custom processors, mTLS-to-attribute mapping, and a sidecar or + admission webhook to enforce label provenance), adding complexity + that offsets the Collector's generality. + +2. **Operational simplicity** — The Telemetry service is a single Go + binary with a single config surface (the operator CR), no separate + version matrix, and no generic pipeline DSL. An OTel Collector + requires operator familiarity with its configuration model + (receivers, processors, exporters, and connectors), dual OTel SDK + stacks (Go + Python) add version drift and test matrix, and the + Collector itself is another versioned service to upgrade and + monitor. This overhead is not justified when the data paths are + known in advance. + +3. **Narrow scope** — Jumpstarter metrics and lease events map directly + to Prometheus and Loki wire protocols that operators already use. + Full three-pillar OTel (unified logs and metrics via OTLP) is + *optional product territory*; this JEP optimizes for low ceremony + and direct integration with exactly those two backends. + + +**Future extension:** because the Telemetry service already aggregates +metrics snapshots and structured log entries in well-defined formats, +adding an OTLP push output (logs and metrics) alongside the existing +Loki and `/metrics` paths would be a trivial change. This would let +operators route Jumpstarter data into an OTel Collector or any +OTLP-compatible backend without altering the exporter or client side. +The change is additive and does not require adopting the OTel SDK as a +project dependency. + +### DD-7: Optional Jumpstarter Telemetry service (dedicated Deployment vs. Controller/Router only) + +**Alternatives considered:** + +1. **In-process** in the Controller (and Router) reconciler — few + moving parts; risk of CPU / GC pressure and stronger coupling + between leases and high-volume increments or Loki writes. +2. A **dedicated** in-cluster Service and Deployment (working name + `jumpstarter-telemetry`, TBD) that: receives gRPC/HTTP increments from + exporters and clients, applies them to counters in memory, + POSTs to Loki, exposes `/metrics`, and uses the same K8s + ServiceAccount / mTLS as other control-plane binaries. +3. **Split** into separate sidecars (Loki-only, metrics-only) — more images to + build and version. +4. **Dedicated Deployment with reverse-scrape for metrics and push for + logs** — same dedicated `jumpstarter-telemetry` Deployment as **(2)**, + but instead of receiving increment RPCs the service reverse-scrapes + connected exporters via `MetricsStream` (see *API / Protocol + Changes*). Exporters maintain local `prometheus_client` registries; + the Telemetry service requests `generate_latest()` snapshots on + demand when its `/metrics` endpoint is hit, merges the results, and + serves them to Prometheus. Logs and events are still pushed by + exporters and clients via `PushLogs`. Client-side metrics are not + collected — all metrically-interesting operations are observable + from the exporter side. + +**Decision:** Prefer **(4)** for the optional aggregated-metrics + Loki + path at scale; allow **(1)** in small or dev clusters; **(3)** only + if review shows a need. In deployments without Loki, the Telemetry + service's own pod logs (structured JSON to stdout) still provide a + centralized, queryable event source via the cluster log shipper. + +**Rationale:** A dedicated workload can scale and restart independently; + Loki spikes and ingest load cannot starve lease reconciliation in the + controller. The reverse-scrape model **(4)** is preferred over the + increment-push model **(2)** because full counter state stays on the + exporter — no metrics are lost when the Telemetry service restarts or + is temporarily unavailable, and idempotency concerns are eliminated + (see **DD-9**). + +**Identity enforcement:** The Telemetry service validates the source + identity of every `MetricsStream` connection and `PushLogs` RPC from + the mTLS certificate or ServiceAccount token. The `exporter` and + `client` labels on incoming data are enforced server-side to match the + authenticated identity — a compromised or misconfigured exporter + cannot submit metrics under another exporter's name or inject + arbitrary labels. + +**Failure modes:** + +| Scenario | Behavior | +| ------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Telemetry service unavailable | Exporters keep counting locally; no metrics are lost. When the exporter reconnects, the next scrape returns the full current counter state. Log push RPCs are fire-and-forget with bounded retry; log entries may be lost but device operations are unaffected. | +| Telemetry pod restart | Metric state is rebuilt on the next scrape from each connected exporter — no permanent data loss. Prometheus `rate()` and `increase()` handle the apparent counter reset transparently. | +| Loki unreachable | The Telemetry service buffers log entries in a bounded queue (see *Backpressure* in the control-plane section). On overflow, entries are dropped and `jumpstarter_telemetry_dropped_total` incremented. | +| Prometheus scrape fails | No data loss — the next successful scrape triggers a fresh fan-out to connected exporters and returns current values. | + + The Telemetry service exposes `/healthz` (liveness) and `/readyz` + (readiness, gated on Loki reachability and at least one connected + exporter) endpoints for Kubernetes probes. + +**Scrape fan-out:** When Prometheus hits `/metrics`, the Telemetry + service fans out `MetricsScrapeRequest` to **all connected exporters in + parallel** and waits up to `spec.telemetry.metrics.scrapeTimeout` + (default: 7 s) for responses. **Only metrics received during the + current fan-out are included in the response.** Exporters that do not + respond in time are omitted entirely — no cached or stale data is + ever served. This eliminates any risk of double-counting from stale + connections where the exporter may have already migrated to another + replica (see **DD-8**). + +**Memory budget:** During a scrape fan-out the Telemetry service + temporarily holds metric snapshots from responding exporters until the + merged response is written to Prometheus. With 200 exporters each + producing ~50 series (bounded by `{operation, result, driver_type}` + label combinations), the peak is ~10 000 series at ~200–300 bytes + each, costing ~2–3 MB. Snapshots are discarded as soon as the + `/metrics` response is flushed — no metric data is retained between + scrapes. + +### DD-8: Multiple Telemetry replicas (HA) and exporter-sticky connections + +**Context:** With the reverse-scrape model (see **DD-3** alternative 4 +and *API / Protocol Changes*), the Telemetry service does not hold +authoritative counter state — exporters maintain their own local +`prometheus_client` registries. The Telemetry service only caches the +latest metric snapshot per exporter. Each exporter opens a single +long-lived `MetricsStream` to one Telemetry replica. + +**Alternatives considered:** + +1. **Single replica** for Telemetry — no cross-pod `sum` issue; SPOF for + ingest and scrape of that `Service`. +2. **Multiple replicas** behind a load balancer; each RPC updates one + pod, which only advances its partial counters for the label + sets it has seen. Prometheus scrapes all pods (or separate + `PodMonitor` targets). In PromQL, + `sum by (exporter, operation, result, driver_type) (…)` after dropping + `pod` / `instance` matches the global total, as long as each real + event is applied at most once in the system (counters are + additive; increments are partitioned by traffic). +3. **Strong consistency** (Raft, Redis as source of truth for + counters) — higher operating cost than this JEP’s v1 scope. +4. **Multiple replicas with exporter-sticky connections** — each exporter + opens a single `MetricsStream` to one replica (sticky by stream). + Each replica only caches metric snapshots for its connected + exporters. Prometheus scrapes all replicas (via `PodMonitor`); + `sum by (exporter, operation, result, driver_type) (…)` after + dropping `pod` / `instance` yields the exact global total with no + double-counting, because each exporter’s metrics appear on exactly + one replica’s `/metrics` output. On replica failure the exporter + reconnects to a survivor and the next scrape returns its full + current counter state — no data is lost. + +**Decision:** **(4)** + +**Rationale:** Exporter-sticky connections naturally partition metric + snapshots across replicas with no overlap, so `sum` across replicas + is exact and double-counting is impossible. Full counter state lives + on the exporter, not on the Telemetry service, so replica restarts + or failovers cause no data loss. Loki log pushes (`PushLogs`) are + naturally per-replica as well and do not require deduplication. + Alternative (3) adds operational complexity with no benefit given + the reverse-scrape model. + +### DD-9: Idempotency vs. best-effort + +**Context:** With the reverse-scrape model, metrics idempotency is a +non-issue — each scrape returns the full current counter state from the +exporter, so there are no increments to deduplicate or double-count. +The only remaining idempotency concern is for `PushLogs` RPCs, where +a retry could result in duplicate log entries in Loki. + +**Alternatives considered:** + +1. **Idempotent** log pushes (deduplication keys per `LogEntry`) — + appropriate for billing- or SLO-sensitive log pipelines; requires + a dedup store or Loki-side dedup. +2. **Best effort** (at-least-once) for `PushLogs` without global + deduplication — simpler; rare duplicate log entries on retries. +3. **Metrics idempotency** (dedup keys on metric increments) — no + longer applicable; the reverse-scrape model returns full state, + making increment deduplication moot. + +**Decision:** (2) for `PushLogs`; metrics idempotency is not needed. + +**Rationale:** Duplicate log entries from occasional retries are + acceptable for informative/diagnostic logs. Loki queries are + tolerant of rare duplicates. No global dedup store is needed in v1; + operators treat these logs as diagnostic signals, not audit trails. + +### DD-10: Perses over Grafana for dashboarding + +**Alternatives considered:** + +1. **Grafana** — mature, widely deployed, massive plugin and datasource + ecosystem; governed by Grafana Labs (commercial); AGPL v3 license; + custom JSON dashboard format; external to Kubernetes architecture. +2. **Perses** — CNCF project (vendor-neutral governance); Apache 2.0 + license; standardized dashboard spec (CUE/JSON) with built-in static + validation and SDKs for GitOps; Kubernetes-native (CRD support for + dashboards-as-code); data-source focus on Prometheus, Loki, and + Tempo — exactly the backends this JEP targets. + +**Decision:** **(2)** + +**Rationale:** + +- **License alignment** — Jumpstarter is Apache 2.0; recommending an + AGPL-licensed dashboard layer introduces license friction for downstream + distributors and embedders. +- **CNCF governance** — vendor-neutral stewardship matches the project's + open-source posture; no single-vendor control over the dashboard layer. +- **Kubernetes-native CRDs** — dashboards can be managed as K8s resources, + fitting the same declarative, reconciler-driven model Jumpstarter already + uses for Leases, Exporters, and the optional Telemetry Deployment. +- **GitOps and validation** — CUE-based specs with static validation and SDKs + enable dashboard-as-code in CI pipelines, consistent with the JEP's emphasis + on automation and CI integration. +- **Backend focus** — Perses targets Prometheus, Loki, and Tempo — exactly the + three backends this JEP standardizes on — without carrying the cost of a + broad plugin ecosystem the project does not need. + +**Perses vs Grafana — practical comparison:** + +| Aspect | Perses | Grafana | +| -------------------- | --------------------------------------- | ------------------------------------------ | +| License | Apache 2.0 | AGPL v3 | +| Governance | CNCF (vendor-neutral) | Grafana Labs (commercial) | +| Dashboard-as-code | CUE/JSON spec, static validation, SDKs | JSON export, no built-in validation | +| K8s-native CRDs | Yes | Via third-party operator (grafana-operator)| +| Exemplar rendering | Not yet (upstream roadmap) | Yes (>= 7.4) | +| Data-source scope | Prometheus, Loki, Tempo | Broad plugin ecosystem | +| Maturity / ecosystem | Early (CNCF sandbox/incubating) | Mature, widely deployed | + +The main Perses gap today is exemplar visualization. Operators who need +exemplar overlays on dashboards should use Grafana alongside Perses or +wait for upstream support. Grafana remains fully compatible — all +`/metrics` and Loki endpoints are standard — so the choice is +non-exclusive. + +Operators who prefer Grafana can still point it at the same `/metrics` and Loki +endpoints; this DD only governs the *recommended* dashboard experience. + +## Design Details + +### Correlation and fields + +*Subject to review — names and cardinality rules should be fixed before +"Implemented".* + +| Field / label | Prom label | Prom exemplar | Loki stream | Log line | Notes | +| -------------------------------- | :--------: | :-----------: | :---------: | :------: | --------------------------------------------------- | +| `exporter` | yes | — | yes | yes | CRD name; bounded by cluster size. | +| `operation` | yes | — | no | yes | Small fixed enum (flash, power, …). | +| `result` | yes | — | no | yes | Small fixed enum (success, failure, …). | +| `driver_type` | yes | — | no | yes | Category from a predefined set in core (storage, power, …). | +| `error_type` | yes | — | no | yes | Failure class (timeout, device_error, …); on errors. | +| `direction` | yes | — | no | yes | tx / rx; for byte-counter and stream metrics only. | +| `component` | no | — | yes | yes | Fixed set (cli, controller, router, telemetry, exporter).| +| `namespace` | no | — | yes | yes | K8s namespace; bounded. | +| `lease_id` | **no** | yes | **no** | yes | Unbounded; exemplar for drill-down. | +| `client` | **no** | yes | **no** | yes | CRD name; exemplar for client identity. | +| `image_digest`, `build_id`, etc. | **no** | yes | **no** | yes | From `spec.context`; included when listed in `exemplarKeys`. | +| `trace_id` / `span_id` | **no** | yes | **no** | yes | W3C; links metrics to traces via exemplars. | +| *`exporterLabels` keys* | **no** | yes | **no** | yes | From Exporter CRD labels; included when listed in `exemplarKeys`. | + +Additional `lease.spec.context` correlation fields can be added at runtime; +they appear as structured log line fields and, when listed in the operator's +`exemplarKeys` allowlist, as Prometheus exemplar keys (see *Exemplars for +high-cardinality context* below and *Operator configuration*). + +### Cardinality guidelines + +Unbounded identifiers (`lease_id`, `client`, `image_digest`, `trace_id`, and +any operator-defined `spec.context` keys) must not be used as Prometheus metric +labels or Loki stream labels. They belong inside structured log line JSON +and Prometheus exemplars (see below), where Loki filter expressions +(`| json | lease_id = "…"`) and dashboard exemplar overlays can surface them +without inflating the label index or TSDB series count. + +Rules of thumb for this JEP: + +- **Prometheus labels**: each metric label dimension should have < 100 distinct + values per scrape target. The label set for Jumpstarter metrics is + `{exporter, operation, result, driver_type}` — all bounded enums. + `error_type` is added on failure-path metrics and `direction` on + byte-counter metrics. High-cardinality context is carried via exemplars, + not labels. +- **Loki**: stream labels should be a small fixed set (`{component, exporter, + namespace}`) to keep active stream count per tenant manageable (Grafana's + guidance: < 100 k active streams). High-cardinality fields go inside the log + line body. +- **Lease context fields** from `spec.context` are propagated into log line + JSON and, when listed in `exemplarKeys`, into Prometheus exemplars. They + never become Prometheus labels or Loki stream labels. + +#### Exemplars for high-cardinality context + +Prometheus exemplars attach arbitrary key-value pairs to individual counter +increments and histogram observations without creating new time series. This +is the primary mechanism this JEP uses to surface per-request context +(`client`, `lease_id`, and `trace_id` when present) on metrics while keeping series cardinality +flat. + +Default exemplar keys emitted on every counter/histogram observation: + +| Key | Source | Purpose | +| ---------- | --------------------- | ----------------------------------------------- | +| `client` | Client CRD name | "Which client caused this spike?" | +| `lease_id` | Lease UID | Correlate a metric sample with lease logs. | +| `trace_id` | W3C `traceparent` | Included **only when present** in gRPC metadata.| + +`trace_id` is not synthesized by Jumpstarter — it is included only when +an external caller (CI pipeline, user code) propagates a `traceparent`. +Full distributed tracing (spans, storage, visualization) is deferred to +a future JEP; when it lands, `trace_id` becomes a default key. Until +then, omitting it saves ~45 characters of exemplar budget. + +`spec.context` keys (e.g. `build_id`, `image_digest`) are included as +exemplar keys when listed in the operator's `exemplarKeys` allowlist (see +*Operator configuration*). Because exemplars are per-observation metadata — +not label dimensions — they have zero impact on series cardinality regardless +of how many distinct values appear. + +**Exemplar size budget:** The OpenMetrics 1.0 limit is 128 UTF-8 +characters for the combined key-value pairs in a single exemplar. +The two default keys (`client`, `lease_id`) consume roughly 30–50 +characters, leaving ~80–100 characters for `spec.context` entries +(or more when `trace_id` is absent). To stay within budget: + +1. Default keys (`client`, `lease_id`) are always included first. + `trace_id` is added when present in the request context. +2. `spec.context` keys are added in alphabetical order until the 128-char + limit is reached; remaining keys are silently dropped from the + exemplar (they remain available in structured log lines). +3. The `Lease` CRD validates `spec.context` at admission time: key names + are limited to 32 characters, values to 64 characters, and the total + number of entries to 8. This prevents accidental budget exhaustion and + ensures exemplar truncation is rare in practice. + +**Dashboard visualization**: when exemplars are enabled on a Prometheus data +source, metric panels render clickable dots on each sample that carries +exemplar data. Clicking a dot reveals the attached keys and can link to +Loki log queries (filtered by `lease_id`) or a Tempo trace view (filtered +by `trace_id`). + +Per-client analysis remains available via LogQL for operators who do not +use exemplars: +`sum by (client) (count_over_time({component="exporter"} | json | operation="flash" [5m]))`. + +### Proposed metrics + +*Names are illustrative; final naming should follow +[Prometheus naming conventions](https://prometheus.io/docs/practices/naming/) +and be fixed before "Implemented".* + +| Metric name | Type | Labels | Description | +| -------------------------------------------- | --------- | -------------------------------------------- | ----------------------------------------- | +| `jumpstarter_operations_total` | counter | `exporter`, `operation`, `result`, `driver_type` | Total operations performed. | +| `jumpstarter_operation_duration_seconds` | histogram | `exporter`, `operation`, `result`, `driver_type` | Duration of each operation. | +| `jumpstarter_operation_errors_total` | counter | `exporter`, `operation`, `driver_type`, `error_type` | Errors by class (timeout, device, …). | +| `jumpstarter_stream_bytes_total` | counter | `exporter`, `driver_type`, `direction` | Bytes transferred (tx/rx) on streams. | +| `jumpstarter_active_sessions` | gauge | `exporter` | Currently active lease sessions. | +| `jumpstarter_lease_acquisitions_total` | counter | `result` | Lease acquire attempts (controller). | +| `jumpstarter_telemetry_dropped_total` | counter | `destination` | Log entries dropped due to backpressure (e.g. `destination="loki"`). | +| `jumpstarter_scrape_timeouts_total` | counter | `exporter` | Scrape fan-out timeouts per exporter (Telemetry-side). | + +All counters and histograms carry exemplar keys from the operator's +`exemplarKeys` allowlist (by default `client` and `lease_id`; `trace_id` +when present; `spec.context` and `exporterLabels` entries when listed) +on every observation. + +### Metric usage and alerting + +| Metric | Primary use | Alert? | Starter threshold | +| -------------------------------------------- | ----------- | :----: | ---------------------------------------------- | +| `jumpstarter_operations_total` | Dashboard | yes | Failure rate > 20 % over 15 min per exporter. | +| `jumpstarter_operation_duration_seconds` | Dashboard | yes | p95 > 60 s per operation type. | +| `jumpstarter_operation_errors_total` | Dashboard | yes | Error rate rising; group by `error_type`. | +| `jumpstarter_stream_bytes_total` | Dashboard | no | — | +| `jumpstarter_active_sessions` | Dashboard | yes | 0 sessions for > 30 min (possible exporter issue). | +| `jumpstarter_lease_acquisitions_total` | Dashboard | yes | Failure rate > 10 % over 15 min. | +| `jumpstarter_telemetry_dropped_total` | Alerting | yes | Any increment (telemetry pipeline saturated). | +| `jumpstarter_scrape_timeouts_total` | Alerting | yes | Repeated timeouts for same exporter (connectivity or load issue). | + +Thresholds are suggestions; operators should tune them to their +environment. The operator should ship a set of example `PrometheusRule` +CRDs based on the table above that operators can enable and customize. +These rules are opt-in and disabled by default to avoid noise in +environments with different baselines. + +**High-frequency byte counters:** `jumpstarter_stream_bytes_total` can +be incremented at very high rates on serial and video streams. Because +metrics live in the exporter's local `prometheus_client` registry, high +update rates do not generate any RPC traffic — the counter is updated +in-process and only serialized when the Telemetry service sends a +`MetricsScrapeRequest`. + +### Example queries + +#### PromQL (Prometheus) + +**Flash failure rate per exporter:** + +```promql +sum by (exporter) (rate(jumpstarter_operations_total{operation="flash", result="failure"}[5m])) +/ +sum by (exporter) (rate(jumpstarter_operations_total{operation="flash"}[5m])) +``` + +**p95 flash duration per driver type:** + +```promql +histogram_quantile(0.95, + sum by (driver_type, le) (rate(jumpstarter_operation_duration_seconds_bucket{operation="flash"}[5m])) +) +``` + +**Top 5 busiest exporters (all operations, 1 h window):** + +```promql +topk(5, sum by (exporter) (rate(jumpstarter_operations_total[1h]))) +``` + +**Alert: exporter flash failure rate > 20% over 15 min:** + +```promql +( + sum by (exporter) (rate(jumpstarter_operations_total{operation="flash", result="failure"}[15m])) + / + sum by (exporter) (rate(jumpstarter_operations_total{operation="flash"}[15m])) +) > 0.2 +``` + +**Error breakdown by class for a specific driver:** + +```promql +sum by (error_type) (rate(jumpstarter_operation_errors_total{driver_type="storage"}[1h])) +``` + +**Bytes per second by exporter and direction:** + +```promql +sum by (exporter, direction) (rate(jumpstarter_stream_bytes_total[5m])) +``` + +**Exporters with repeated scrape timeouts (last 30 min):** + +```promql +topk(10, sum by (exporter) (increase(jumpstarter_scrape_timeouts_total[30m]))) +``` + +**HA Telemetry: aggregate across replicas (drop pod/instance):** + +```promql +sum by (exporter, operation, result, driver_type) (rate(jumpstarter_operations_total[5m])) +``` + +#### LogQL (Loki) + +**All flash events for a specific lease:** + +```text +{component="exporter"} | json | operation="flash" | lease_id="" +``` + +**Flash failures per client over 5 min (log-based, no exemplars needed):** + +```text +sum by (client) ( + count_over_time({component="exporter"} | json | operation="flash" | result="failure" [5m]) +) +``` + +**Controller logs for a specific lease (post-mortem):** + +```text +{component="controller"} | json | lease_id="" +``` + +**Error events across all exporters in a namespace:** + +```text +{component="exporter", namespace="production"} | json | result="failure" +``` + +**Telemetry service health (its own operational logs):** + +```text +{component="telemetry"} | json | level="error" +``` + +### Control-plane aggregation (Controller / Router / optional Telemetry) + +When this mode is enabled in a deployment: + +- Exporters maintain local `prometheus_client` registries and open a + `MetricsStream` to the optional `jumpstarter-telemetry` service + (**DD-7**). On each Prometheus scrape the Telemetry service fans out + `MetricsScrapeRequest` to all connected exporters in parallel, merges + the responses, and serves the combined output on `/metrics` + (**DD-3**). HA (multiple replicas with exporter-sticky connections) + uses `sum` in PromQL (**DD-8**). Exporter and edge processes never + need Loki or cluster-scrape credentials directly (**DD-5**). +- Exporters and clients (`jmp`) push structured log entries to the + Telemetry service via `PushLogs`. The Telemetry service forwards + these to Loki. Best-effort duplicate tolerance applies (**DD-9**). +- Controller and Router emit structured JSON logs to stdout + (see **DD-4**). They do not push logs directly to Loki; a cluster-level + log shipper (Promtail, Grafana Alloy, Vector, or equivalent) scrapes + their pod logs and delivers them to Loki. This decouples the reconciler + and session-handling hot paths from Loki availability. +- **Backpressure:** The Telemetry service uses a bounded ring buffer + for the Loki log push path with a configurable depth + (default: 10 000 entries, see `spec.telemetry.backpressure.queueDepth`). + On overflow, dropped entries are replaced by a single **drop marker** + — a synthetic log entry recording the count of dropped entries and the + time window. Subsequent drops while the buffer is still full + accumulate into the same marker rather than adding new entries, so the + queue always retains one slot for the current drop summary. When the + buffer drains and the marker is flushed, the downstream log contains + an explicit record such as + `{"level":"warn","msg":"entries dropped","count":142,"window_seconds":12}`. + A `jumpstarter_telemetry_dropped_total` counter (partitioned by + `destination={loki}`) is also incremented on `/metrics` for alerting. + Metrics do not need backpressure — the reverse-scrape model is + pull-based and transient (no buffering between scrapes). + Because the Controller and Router do not push to Loki, their + lease/session operations are inherently isolated from Loki slowdowns. +- **Multi-tenancy:** write-side tenant scoping (e.g. namespace-based + separation in Loki and Prometheus) is a deployment concern handled by + the log shipper and Prometheus configuration. Read-side access control + (who can query which metrics or logs) is likewise a deployment concern + and out of scope for this JEP. +- Metric facts originate on the exporter (local `prometheus_client` + counters/histograms); the Telemetry service is a transparent + scrape-aggregation proxy. Controller and Router expose their own + `/metrics` for Prometheus scrape and rely on the log shipper for + their stdout logs. + +### High-level data flow + +#### Client (`jmp`) + +```{mermaid} +flowchart LR + jmp([jmp CLI]) -->|session gRPC| exp[Exporter] + jmp -->|structured logs| tel[jumpstarter-telemetry] +``` + +The CLI connects to the Exporter for device sessions and sends structured +logs to the Telemetry service for Loki ingest (see **DD-4**). + +#### Exporter + +```{mermaid} +flowchart LR + ctrl[jumpstarter-controller] -->|lease lifecycle| exp[Exporter] + drv[Drivers] --> exp + exp <-->|MetricsStream| tel[jumpstarter-telemetry] + exp -->|PushLogs| tel +``` + +The Controller assigns leases; the Exporter delegates to Drivers and +maintains local `prometheus_client` counters. It opens a `MetricsStream` +to Telemetry for reverse-scrape and pushes structured logs via `PushLogs` +(see **DD-2**, **DD-3**, **DD-5**, **DD-7**). + +#### Telemetry to backends + +```{mermaid} +flowchart LR + prom[(Prometheus)] -->|scrape /metrics| tel[jumpstarter-telemetry] + tel <-->|MetricsStream fan-out| exp[Exporters] + tel -->|push API| loki[(Loki)] + tel -->|JSON stdout| shipper[Log shipper] + shipper -->|pod logs| loki +``` + +On each Prometheus scrape, Telemetry fans out `MetricsScrapeRequest` to +all connected exporters in parallel, merges responses, and serves the +combined output. Logs received via `PushLogs` are forwarded to Loki +(**DD-3**, **DD-7**, **DD-8**). + +#### Controller to backends + +```{mermaid} +flowchart LR + ctrl[jumpstarter-controller] -->|JSON stdout| shipper[Log shipper] + shipper -->|pod logs| loki[(Loki)] + ctrl -->|/metrics| prom[(Prometheus)] +``` + +The Controller writes structured JSON to stdout (see **DD-4**). A +cluster log shipper scrapes pod logs and delivers them to Loki. The +Controller exposes `/metrics` for reconciliation and lease-level counters. + +#### Router to backends + +```{mermaid} +flowchart LR + router[jumpstarter-router] -->|JSON stdout| shipper[Log shipper] + shipper -->|pod logs| loki[(Loki)] + router -->|/metrics| prom[(Prometheus)] +``` + +The Router writes structured JSON to stdout (see **DD-4**). A +cluster log shipper scrapes pod logs and delivers them to Loki. The +Router exposes `/metrics` for routing and session-level counters. + +The diagrams above summarize the reverse-scrape hub model described in +*Control-plane aggregation*. For credential isolation see **DD-5**; for +the Telemetry Deployment see **DD-7**; for HA with exporter-sticky +connections see **DD-8**; for best-effort log semantics see **DD-9**. +No OpenTelemetry Collector is *required* (see **DD-6**); operators may +run one *alongside* and scrape the same targets if they choose. + +### Common open-source backends (direct integration; no mandatory OTel) + +This JEP’s target wire protocols and components are Prometheus and +Loki (and, if trace export is ever added, Tempo or Jaeger with +native ingest or HTTP — not OTLP as a *Jumpstarter* requirement; see +**DD-6**). OpenTelemetry is a parallel ecosystem: teams can run a +Collector next to Jumpstarter and still scrape `/metrics` and ship +logs with Promtail-class agents; the reference design does not depend +on the OTel SDK in application code. + +- Prometheus for metrics (and Alertmanager for routing alerts): scrape + the `/metrics` endpoint, remote-write to long-term store if needed, and drive + dashboards in Perses or self-hosted UIs (see **DD-10**). `kube-state-metrics` and + the Prometheus Operator are common in Kubernetes; vendors often package + the same projects, but this JEP refers to the open-source components by name. +- Loki (Grafana Labs, AGPL) for log storage and querying; it pairs with + Perses (see **DD-10**) for search and with Promtail, Grafana + Agent, or Grafana Alloy to ship logs, or with application push to Loki’s HTTP API as + already discussed in the control-plane path. +- Traces (optional, future work) — if adopted, Grafana Tempo and Jaeger + are typical stores; use W3C Trace Context in RPC metadata for + correlation even when full trace export is off. OTLP may be + *only* a convenience for operators; it is not a JEP-0011 core + dependency. +- A typical Kubernetes integration path: `ServiceMonitor` + Prometheus + (or a compatible remote-write consumer), a Loki endpoint for logs + — any EKS, GKE, AKS, self-managed + Kubernetes, or bare-metal install that runs these same projects can be the + target; the implementation + plan should name tested combinations (Prometheus and Loki version + pairs where relevant) in `Implementation History`, not a single product bundle. + +### Operator configuration + +The Jumpstarter operator CR controls telemetry behavior cluster-wide. +Observability settings live under `spec.telemetry` so that administrators +can tune metrics, logging, and exemplar behavior without editing code. + +**Key configurable fields:** + +| Field | Type | Default | Description | +| ----------------------------------------- | ---------- | ------------------------------------------------ | ---------------------------------------------------------------------------------------------- | +| `spec.telemetry.enabled` | `bool` | `false` | Deploy the optional Telemetry service. | +| `spec.telemetry.loki.url` | `string` | — | Loki push endpoint; optional — Telemetry can run metrics-only without Loki. | +| `spec.telemetry.loki.secretRef` | `string` | — | Secret with Loki credentials (see **DD-5**). | +| `spec.telemetry.loki.tls.caSecretRef` | `string` | — | Secret containing a CA bundle (`ca.crt` key) to trust for the Loki endpoint. | +| `spec.telemetry.loki.tls.insecureSkipVerify` | `bool` | `false` | Disable TLS certificate verification (development/testing only). | +| `spec.telemetry.exporterLabels` | `[]string` | `[]` | Exporter-level label keys (e.g. `board-type`) copied from Exporter CRD labels into log JSON fields and exemplar candidates. | +| `spec.telemetry.metrics.exemplarKeys` | `[]string` | `["client", "lease_id"]` | Allowlist of keys to include in exemplars (including `spec.context` and `exporterLabels` keys). Only listed keys are emitted; unlisted keys are omitted even if present. | +| `spec.telemetry.metrics.driverTypeEnum` | `[]string` | `["power", "storage", "network", "serial", …]` | Allowed `driver_type` label values. Drivers reporting an unlisted type are mapped to `other`. | +| `spec.telemetry.metrics.serviceMonitor` | `bool` | `true` | Create `ServiceMonitor` CRDs for Prometheus autodiscovery. | +| `spec.telemetry.metrics.prometheusRules` | `bool` | `false` | Deploy starter `PrometheusRule` CRDs (opt-in). | +| `spec.telemetry.metrics.scrapeTimeout` | `duration` | `7s` | Max time to wait for parallel exporter responses during a `/metrics` fan-out. Should be set lower than the Prometheus-side `scrape_timeout` to leave headroom for HTTP transport. | +| `spec.telemetry.backpressure.queueDepth` | `int` | `10000` | Ring buffer depth for Loki log push queue. | + +**Example CR snippet:** + +```yaml +apiVersion: operator.jumpstarter.dev/v1alpha1 +kind: Jumpstarter +metadata: + name: jumpstarter +spec: + telemetry: + enabled: true + exporterLabels: + - board-type + loki: + url: "https://loki-gateway.monitoring.svc:3100/loki/api/v1/push" + secretRef: "loki-credentials" + tls: + caSecretRef: "loki-ca-bundle" + metrics: + exemplarKeys: + - client + - lease_id + - build_id + - board-type + driverTypeEnum: + - power + - storage + - network + - serial + - console + - video + - composite + serviceMonitor: true + prometheusRules: true + scrapeTimeout: "7s" + backpressure: + queueDepth: 20000 +``` + +The `driverTypeEnum` list acts as an allowlist: drivers must select a +category from this set (or fall back to `other`). This keeps the +`driver_type` Prometheus label bounded and prevents cardinality +surprises from third-party drivers. Administrators can extend the list +for site-specific driver categories. + +The `exporterLabels` list names Exporter CRD label keys whose values +are copied into every log JSON field and made available as exemplar +candidates for operations involving that exporter. For example, setting +`exporterLabels: ["board-type"]` means an Exporter with the label +`board-type: rpi4` will include `"board-type": "rpi4"` in its +structured log lines and in the exemplar candidate pool. The list is +empty by default — no exporter labels are propagated unless the +administrator opts in. + +The `exemplarKeys` list is an **allowlist** that controls which keys are +included in Prometheus exemplars. This filters *everything* — built-in +keys (`client`, `lease_id`), `spec.context` keys, and `exporterLabels` +keys alike. Only keys present in `exemplarKeys` are emitted; unlisted +keys are omitted even if available. This gives administrators full +control over exemplar budget usage: adding `board-type` to both +`exporterLabels` and `exemplarKeys` propagates hardware type into +exemplars, while removing `lease_id` frees budget for other entries. + +**Loki transport:** During implementation, evaluate whether the Telemetry +service should connect to Loki via the HTTP push API +(`/loki/api/v1/push`) or the gRPC endpoint. gRPC may offer better +throughput and streaming semantics (aligned with Jumpstarter's existing +gRPC infrastructure), while the HTTP API is simpler to debug and more +broadly supported by Loki-compatible backends. The `spec.telemetry.loki.url` +field should accept either scheme (`http://` / `grpc://`) so the choice +remains a deployment decision. + +**Loki TLS:** Many deployments terminate Loki behind a TLS endpoint +with an internal or self-signed CA. The `spec.telemetry.loki.tls` +subsection follows the same pattern as the existing operator TLS +configuration: `caSecretRef` names a Kubernetes Secret whose `ca.crt` +key contains the PEM-encoded CA bundle to trust. When set, the +Telemetry service adds this CA to its TLS root pool when connecting to +Loki. `insecureSkipVerify` disables certificate verification entirely +and should only be used in development or testing environments. + +## Test Plan + +### Unit Tests + +- Log field builders and redaction: ensure defaults strip secrets; optional + fields behind flags. +- Metric registration helpers: label validation and naming conventions. + +### Integration Tests + +- Operator + exporter: scrape or receive metrics; assert presence of a minimal + documented set of series after a known operation. +- If the control-plane forward path is implemented: with a test Loki and + a Prometheus-compatible sink (or mock), assert that records arrive with expected + correlation fields (`lease_id`, `exporter`, …) and that exporter pods do not require + Loki or cluster-scrape credentials in their spec. +- If Telemetry runs with >1 replica: one test verifies that + `sum` by business labels (dropping `pod`/`instance`) matches expected totals with exporter-sticky connections (see **DD-8**). +- Lease with metadata: objects validate; events or status updates match expected + structure. + +### Hardware-in-the-Loop + +- Flashing and power paths: at least one driver records an event and/or + metrics counter on success and failure on real hardware in a lab. +- Serial and stream paths expose tx/rx byte counts. + +### Independent testability + +Each component must be testable in isolation without deploying the full +stack: + +- **Structured logging**: unit tests validate JSON output format, base + fields, and `spec.context` propagation using an in-memory logger — no + Loki required. +- **Exporter metrics**: unit tests verify counter/histogram registration, + label correctness, and exemplar attachment using a local Prometheus + registry — no Telemetry service required. +- **Telemetry service**: integration tests use mock gRPC clients and a + mock Loki endpoint to verify ingest, counter aggregation, backpressure + behavior, and drop markers — no real exporters required. +- **Operator configuration**: unit tests validate CRD admission + (e.g. `spec.context` size limits) and `ServiceMonitor` generation. + +### End-to-end (CI) + +The full telemetry pipeline should be exercised in GitHub Actions CI. +Evaluate feasibility of running a minimal Prometheus + Loki stack inside +the CI environment (e.g. single-binary mode containers); if resource +constraints make this impractical, at minimum: + +- **Loki mock or single-binary**: a lightweight Loki instance (or a mock + HTTP/gRPC endpoint that validates the Loki push API contract) receives logs + from the Telemetry service and asserts expected fields, stream labels, + and `spec.context` propagation across the full exporter → Telemetry → + Loki path. +- **Prometheus scrape**: the existing Go/Ginkgo E2E test suite performs + direct HTTP scrapes of the `/metrics` endpoints on Controller, Router, + and Telemetry services — no separate Prometheus instance required. The + test parses the OpenMetrics response and asserts that documented + series, labels, and exemplars appear after a known operation sequence. +- **Correlation round-trip**: an E2E test runs a lease lifecycle (create → + flash → power-cycle → release) and verifies that the same `lease_id` + and `exporter` values appear in both scraped metrics (label or + exemplar) and ingested log entries, confirming cross-signal + correlation. + +Feasibility of this stack should be evaluated early (Phase 1) so that +all subsequent phases have E2E coverage from the start. + +### Manual + +- `jmp` default output remains readable; JSON structured logs are only sent + to jumpstarter-telemetry for general log ingest. + +## Acceptance Criteria + +- [ ] Exporter (or sidecar) exposes a documented metrics surface; drivers + can contribute without reimplementing the HTTP server ad hoc in each + driver. +- [ ] Controller and one data-plane service emit structured logs with a + documented minimum field set; +- [ ] Operator provides a section to enable metrics, with the right details/secret + references to integrate with Loki for pushing logs. +- [ ] Operator attempts to auto-configure Prometheus metric scraping on the right + endpoints. +- [ ] A JSON schema (or equivalent machine-readable specification) is + published for the structured log format, enabling consumers to + validate log entries and detect regressions in field names or types. +- [ ] Backward compatibility: existing clients and manifests without the new + fields continue to work; deployments that do not use hub forwarding + behave as today. + +## Graduation Criteria + +### Experimental (first release behind flag or doc-only) + +- JEP in Discussion; partial implementation; known gaps listed in + *Unresolved Questions*. + +### Stable + +- Acceptance criteria met; SLOs for log volume and metric cardinality + documented; upgrade notes for the operator and CLI. + +## Backward Compatibility + +- New CRD fields and labels must be optional; existing lease flows unchanged. +- gRPC: new metadata must be additive; servers tolerate missing trace and + context fields from older clients; clients ignore unknown fields where + applicable. +- **`AuditStream` removal:** The `AuditStream` RPC and `AuditStreamRequest` + message on `ControllerService` are removed. This RPC was never implemented + or called by any client — `Grep` across the codebase confirms zero usage + outside its protobuf definition. Removing it is a no-op for all existing + deployments. The new `PushLogs` RPC on `TelemetryService` supersedes the + intended use case. +- `LogStreamResponse` enrichment (new optional fields `driver_type`, + `operation`, `timestamp`, `structured_fields`) is purely additive and + backward-compatible — existing clients ignore unknown fields. +- No removal of current default CLI behavior; JSON logging only when selected. + +## Consequences + +### Positive + +- **Operators** can route logs and metrics to existing Prometheus, Loki, + and Perses-based stacks (self-hosted or platform-managed under + the hood) without a mandatory OpenTelemetry Collector in front of + Jumpstarter (see **DD-6**, **DD-10**). +- **CI** can correlate a failed run to equipment and build metadata. +- **Driver authors** get a single pattern for operation counters and event + emission. +- **Security-conscious** users can run with minimal log fields and no trace. +- **Operators** can keep Loki, Prometheus, and related API tokens in-cluster + only; exporters keep a single Jumpstarter trust relationship (**DD-5**). +- The optional Telemetry service isolates Loki/series work from the reconciler + (**DD-7**, **DD-8**); Controller and Router carry no Loki client dependency, + so a Loki outage cannot affect lease operations (**DD-4**). + +### Negative + +- More code paths, dependencies (for example a Prometheus client + library, Loki HTTP client, and structured log helpers), and + operability and documentation burden. +- Operators must run a functioning cluster log shipper (Promtail, Grafana + Alloy, Vector, or equivalent) to see Controller and Router logs in Loki. + This is near-universal in production Kubernetes but worth documenting for + minimal or dev clusters. + +### Risks + +- High-cardinality metadata accidentally promoted to metric *labels* could + overload TSDB. *Cardinality guidelines* restricts labels to bounded enums + and routes variable context through exemplars and log line fields instead. +- Exemplars require the OpenMetrics exposition format and Prometheus >= 2.26 + with exemplar storage enabled (on by default since Prometheus 2.39). + Operators on older Prometheus versions still get full metrics and logs; + exemplar-based drill-down is unavailable until they upgrade. +- Prometheus / Loki / Perses-stack version drift in the field + — document tested pairs; W3C Trace Context in gRPC remains + best-effort across Python and Go (no OTel SDK requirement to + propagate `traceparent` where needed). + +## Rejected Alternatives + +- **"All metrics and facts are *generated* only in the controller"** — would + miss per-exporter and per-driver truth; rejected. *Forwarding* + exporter-originated series and events *through* the control-plane (with + stable labels) is not the same and remains in scope (see DD-5). +- *Requiring Loki- and Prometheus-ingest credentials on every exporter + and edge* as the only supported model — rejected in favor of + optional hub + forwarding and of cluster-native collectors that also avoid per-host + secrets, even though those collectors are not Jumpstarter-specific. +- **"Mandatory OpenTelemetry SDK and Collector"** for all metrics, + logs, and traces — rejected for the reference architecture; + rationale in **DD-6** (optional parallel deployment by operators is + still fine). +- **"Unstructured logs everywhere; parse with regex"** — rejected as + unscalable for joins with traces and multi-service incidents. +- **"Mandatory full tracing for every command"** — high overhead; rejected; prefer + sampling and opt-in for heavy paths. +- **"Push metric increments from exporters to telemetry"** — exporters + would send `+1`/`+N` counter increments and histogram observations to + the Telemetry service, which would maintain in-memory counters and + expose them on `/metrics`. Rejected because: (a) counter state would + be lost on Telemetry restart, (b) retries introduce double-counting + requiring idempotency logic, and (c) high-frequency counters (e.g. + stream bytes) generate excessive RPC traffic. The reverse-scrape model + keeps full counter state on the exporter and generates zero RPC + traffic between scrapes (see **DD-3** alternative 4, **DD-7**). +- **"Reuse `AuditStream` for telemetry log push"** — `AuditStream` was an + unimplemented stub on `ControllerService` with no message schema for + structured telemetry data. Rather than retrofitting it, a purpose-built + `PushLogs` RPC on the new `TelemetryService` provides a cleaner contract + and separates telemetry from the controller's reconciliation API. + +## Prior Art + +- [Prometheus](https://prometheus.io/) and [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/) + — time-series metrics and alerting; [Prometheus naming and labels](https://prometheus.io/docs/practices/naming/) + on cardinality and naming; remote write for non-scrape topologies; + [Exemplars](https://prometheus.io/docs/instrumenting/exposition_formats/#exemplars) + for attaching high-cardinality context to individual samples. +- [Grafana exemplar support](https://grafana.com/docs/grafana/latest/fundamentals/exemplars/) + — visualizing exemplars in metric panels and linking to traces or logs. +- [Loki](https://grafana.com/oss/loki/) — log aggregation, label model, and push + and query APIs; often combined with [Perses](https://perses.dev/) (see + **DD-10**) and Grafana Agent / Alloy or + [Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/) for log + shipping. +- [Grafana Tempo](https://grafana.com/oss/tempo/) or [Jaeger](https://www.jaegertracing.io/) — common trace backends + (native or HTTP ingest; OTLP where the operator uses it — not a + Jumpstarter code dependency; see **DD-6**). +- [Perses](https://perses.dev/) — CNCF dashboard project; Apache 2.0; + Kubernetes-native CRDs; CUE/JSON spec with GitOps SDKs; focused on + Prometheus, Loki, and Tempo data sources (see **DD-10**). +- [OpenTelemetry](https://opentelemetry.io/) and the + [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) — + relevant as ecosystem and operator-side *optional* plumbing; + this JEP intentionally does not adopt them in-process by default (**DD-6**). +- Other HiL / test systems often separate "run metadata" (like Jenkins build + id) from device state; similar separation maps well to this JEP’s lease + context + events. + +## Unresolved Questions + +- Event retention: Loki retention policy (per-tenant, per-stream retention + classes) for annotated log events (**DD-2**); whether Jumpstarter should + document recommended retention defaults or leave this to operators. + +## Future Possibilities + +- SLOs and error budgets on lease acquisition time, flash success rate, and + mean time to recovery of exporters. +- Per-tenant or per-namespace dashboards as samples in the docs. +- *Not* part of this JEP: billing usage metering (could reuse metrics later). + +## Implementation History + +- JEP-0011 proposed: 2026-04-23 +- JEP-0011 updated based on feedback: 2026-04-29 + +## References + +- [JEP-0000 — JEP Process](JEP-0000-jep-process.md) +- [Kubernetes Events](https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/) +- [W3C Trace Context](https://www.w3.org/TR/trace-context/) (`traceparent`) +- Upstream project docs for the Prometheus, Loki, and + Perses versions (and optional Tempo / Jaeger if used) in a + given deployment; pin versions in release notes + and integration tests. + +--- + +*This JEP is licensed under the +[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)* diff --git a/python/docs/source/internal/jeps/README.md b/python/docs/source/internal/jeps/README.md index bba33ef35..76bc26419 100644 --- a/python/docs/source/internal/jeps/README.md +++ b/python/docs/source/internal/jeps/README.md @@ -35,6 +35,7 @@ For the full process definition, see [JEP-0000](JEP-0000-jep-process.md). | JEP | Title | Status | Author(s) | | ---- | ---------------------------------------------------- | ----------- | -------------------- | | 0010 | [Renode Integration](JEP-0010-renode-integration.md) | Implemented | @vtz (Vinicius Zein) | +| 0011 | [Metrics, Tracing, and Log Observability](JEP-0011-observability-telemetry-logs.md) | Discussion | @mangelajo (Miguel Angel Ajo Pelayo) | ### Informational JEPs @@ -67,4 +68,5 @@ For the full process definition, see [JEP-0000](JEP-0000-jep-process.md). JEP-0000-jep-process.md JEP-0010-renode-integration.md +JEP-0011-observability-telemetry-logs.md ``` diff --git a/typos.toml b/typos.toml index 3cc13976e..5e5d30957 100644 --- a/typos.toml +++ b/typos.toml @@ -19,6 +19,9 @@ mosquitto = "mosquitto" # ser is short for "serialize" in variable names like ser_json_timedelta ser = "ser" +# AKS is Azure Kubernetes Service +AKS = "AKS" + [type.gomod] # Exclude go.mod and go.sum from spell checking extend-glob = ["go.mod", "go.sum"]