Skip to content

feat(quickstart): rich ClickHouse / host / otelcol observability via OTel collector#701

Open
tsouza wants to merge 9 commits into
mainfrom
feat/quickstart-rich-observability
Open

feat(quickstart): rich ClickHouse / host / otelcol observability via OTel collector#701
tsouza wants to merge 9 commits into
mainfrom
feat/quickstart-rich-observability

Conversation

@tsouza
Copy link
Copy Markdown
Owner

@tsouza tsouza commented May 22, 2026

Summary

Wires three additional OTel-collector metric receivers into the quickstart compose stack so it paints a full-stack observability picture out of the box, with three new Grafana dashboards under the existing Cerberus folder. Every metric path flows through cerberus on the read side -- nothing bypasses the gateway.

  • prometheus/self -- scrapes the collector's own :8888 endpoint, surfaces every otelcol_* receiver / processor / exporter counter, queue depth, batch send sizes, Go runtime memory.
  • hostmetrics -- every supported scraper turned on (cpu / memory / disk / network / filesystem / load / paging / processes) including the disabled-by-default *.utilization gauges and conntrack counters.
  • sqlquery/clickhouse -- queries system.metrics, system.events, system.asynchronous_metrics, and system.parts every 15s. Three name-pivoted families (clickhouse_metric / _event / _async_metric) cover ~400 CH server signals without enumerating each one up front.

A transform/metric_names processor rewrites OTel-dotted names (system.cpu.time) to underscored PromQL-friendly ones (system_cpu_time) before write so dashboard queries don't need UTF-8 escaping. Three resource processors stamp service.name per source so PromQL filters can pivot.

New dashboards

  • clickhouse-observability.json -- in-flight queries, parts on disk, memory, connections, merges, query rate by type, MergeTree I/O, network, caches, thread pools, replication state, errors, host resource gauges.
  • otelcol-observability.json -- uptime, RSS, queue depth, send failures, processor refusals, Go heap, receiver/exporter throughput by signal, drops/failures, batch send-size quantiles, queue depth vs capacity.
  • host-observability.json -- CPU by state, per-core utilisation, memory by state + utilisation, disk IOPS / throughput / operation time, network throughput + packets/errors/drops, filesystem utilisation by mount with red threshold at 90%, load average, paging.

Verification

Validated against otel/opentelemetry-collector-contrib:0.152.1 (latest release) -- validate --config passes cleanly and a one-off run on a verify-network produced 750 sum rows + 1284 gauge rows across 64 distinct metric names within ~60s. Sample names:

clickhouse_async_metric  clickhouse_event       clickhouse_metric
clickhouse_parts_active  clickhouse_parts_bytes_on_disk  clickhouse_parts_rows
otelcol_exporter_queue_capacity  otelcol_exporter_queue_size
otelcol_exporter_sent_metric_points  otelcol_process_memory_rss
otelcol_process_runtime_heap_alloc_bytes  otelcol_process_uptime
otelcol_processor_accepted_metric_points  otelcol_receiver_accepted_metric_points
system_cpu_load_average_1m  system_cpu_time  system_cpu_utilization
system_disk_io  system_disk_operations  system_filesystem_utilization
system_memory_usage  system_memory_utilization  system_network_io
system_network_packets  system_paging_faults  system_processes_count

Coordination notes (out of scope for this PR)

The receiver YAML is wired; for the dashboards to populate against the live stack the docker-compose.yml owner (seed-removal agent) needs to:

  1. Bump the collector image pin from otel/opentelemetry-collector-contrib:0.116.1 to 0.152.1 (the new service.telemetry.metrics.readers/pull/prometheus syntax is 0.123+; the legacy address: shorthand has been removed in 0.152).
  2. Mount host paths into the otel-collector service for true host visibility (without these, hostmetrics scrapes the container's namespace — still real data, just container-scoped):
    otel-collector:
      volumes:
        - /proc:/hostfs/proc:ro
        - /sys:/hostfs/sys:ro
        - /:/hostfs:ro
      environment:
        HOST_PROC: /hostfs/proc
        HOST_SYS: /hostfs/sys
      pid: host
    And flip hostmetrics.root_path: /hostfs in the receiver YAML.

Both are documented in the comment block at the top of test/e2e/otel-collector/compose-config.yaml.

Test plan

  • otelcol-contrib validate --config=... against 0.152.1 (zero errors)
  • One-off boot against a verify-network ClickHouse: all 4 metric pipelines (otlp, prometheus/self, hostmetrics, sqlquery/clickhouse) produced rows in otel_metrics_{sum,gauge,histogram} within 60s
  • Dashboards JSON-valid (Grafana provisioning will refuse malformed JSON)
  • Live-stack verification after the docker-compose owner lands the image pin bump + host mounts

🤖 Generated with Claude Code

@tsouza tsouza enabled auto-merge (squash) May 22, 2026 12:19
@tsouza tsouza force-pushed the feat/quickstart-rich-observability branch from 3ac2a40 to 232bae5 Compare May 22, 2026 12:44
@tsouza tsouza closed this May 22, 2026
auto-merge was automatically disabled May 22, 2026 14:05

Pull request was closed

@tsouza tsouza reopened this May 22, 2026
@tsouza tsouza force-pushed the feat/quickstart-rich-observability branch 2 times, most recently from 3ac2a40 to 34916aa Compare May 22, 2026 14:29
@tsouza tsouza enabled auto-merge (squash) May 22, 2026 14:43
@tsouza tsouza force-pushed the feat/quickstart-rich-observability branch from 4b7fd57 to 5b78c5c Compare May 22, 2026 16:33
tsouza added a commit that referenced this pull request May 22, 2026
…type error doesn't fire (#706)

PromQL `or` chains like `sum(increase(A[5m]) or increase(B[5m]) or
increase(C[5m]))` (PR #701's otelcol-observability dashboard) failed at
CH with `code: 386 — There is no supertype for types String,
Map(LowCardinality(String), String)`. `A or B or C` parses as `(A or
B) or C`; the inner `(A or B)` arm projected the canonical 4 columns
(`MetricName, Attributes, TimeUnix, Value` — String, Map, …), while
the matrix-shape `RangeWindow` for `C` exposed `Attributes, anchor_ts,
TimeUnix, Value` (Map first, no MetricName because `increase` drops
`__name__`). The UNION ALL then asked CH to unify String + Map at
column position 0.

Every VectorSetOp arm now projects the canonical 4-column shape
explicitly, synthesising `'' AS MetricName` for derived-shape arms
(RangeWindow / Aggregate / MetricsAggregate / MetricsHistogramOverTime
/ a Project on top of one of those) — mirroring
`wrapWithSampleProjection`'s derived-shape branch. Positional column
unification across the UNION arms now always sees matching types.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@tsouza tsouza force-pushed the feat/quickstart-rich-observability branch from 5b78c5c to 33af8e9 Compare May 22, 2026 17:25
tsouza added a commit that referenced this pull request May 22, 2026
…h compose stack

The rich-observability compose stack (PR #701) fans in the OTel
collector's own self-telemetry plus hostmetrics + sqlqueryreceiver
output. Many of those metric names appear in `/api/v1/label/__name__/values`
as soon as the collector's first push lands, but the corresponding
per-series rows can race the 5m series window (or stay at 0 forever
on a quiet stack with no errors / no traffic).

Extend `EXPECTED_EMPTY` in `iterate-metrics-explorer.spec.ts` with
prefix entries covering each empty-by-design family:

- `clickhouse_event`        — sqlqueryreceiver, quiet stack has no events
- `otelcol_connector_servicegraph_` — requires trace volume + TTL turnover
- `otelcol_exporter_send_failed_`    — stays at 0 on a healthy stack
- `otelcol_exporter_sent_`           — first emission races ahead of 5m window
- `otelcol_process_`                 — collector self-process gauges, same race
- `otelcol_processor_`               — pipeline counters, same race
- `otelcol_receiver_`                — pipeline counters, same race
- `otelcol_scraper_`                 — scrape cadence leaves window empty
- `system_`                          — hostmetrics, same race

Also extend `EXPECTED_EMPTY_EXPR_SUBSTRINGS` in
`iterate-all-dashboards.spec.ts` with a `clickhouse_event` match so
the `clickhouse-observability` dashboard's "Query rate by type"
panel is treated as tolerated-empty on a quiet compose stack.

Each entry carries the one-line rationale required by PR #704's
allowlist pattern.
tsouza added a commit that referenced this pull request May 22, 2026
…rms (#707)

PR #706's vectorSetOpCanonicalArmFrag projects every VectorSetOp arm as
SELECT MetricName, Attributes, TimeUnix, Value but the inner SELECT for
an instant-mode RangeWindow / Aggregate / MetricsAggregate /
MetricsHistogramOverTime only exposes (group-keys..., Value). The bare
TimeUnix column reference then fails at CH 24.x with
"Unknown expression identifier 'TimeUnix'" / "Resolve identifier
'TimeUnix' from parent scope only supported for constants and CTE".

PR #701's new otelcol-observability dashboard surfaced the residue
on the "Send failures (5m)" + "Processor refusals (5m)" stat panels,
which Grafana renders via instant /api/v1/query (no step). Both fire
as
  sum(increase(otelcol_..._log_records[5m])
      or increase(otelcol_..._metric_points[5m])
      or increase(otelcol_..._spans[5m]))
and consistently 502 the cerberus engine (browser shows 400).

Mirror the wrapWithSampleProjection instant branch: synthesize
TimeUnix as (now64(9) - toIntervalNanosecond(5_000_000_000)) for
derived-shape arms in instant mode. Matrix-mode arms (OuterRange > 0)
still reference TimeUnix by name because emitWindowedArrayPairsMatrix
already aliases anchor_ts AS TimeUnix on the outer SELECT — covered by
the existing binary_or_increase_range_canonicalises_arms fixture; the
new binary_or_increase_instant_canonicalises_arms fixture pins the
instant-mode path.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@tsouza tsouza force-pushed the feat/quickstart-rich-observability branch from 7962567 to f6065ff Compare May 22, 2026 18:41
tsouza added a commit that referenced this pull request May 22, 2026
…pties on fresh compose

iterate-metrics-explorer + iterate-all-dashboards on PR #701's compose
stack flagged ~30 otelcol_* metrics with empty /api/v1/series + the
clickhouse-observability "Query rate by type" panel as empty. Both are
emission-cadence artefacts of a fresh stack, not regressions:

- otelcol_{exporter,processor,receiver,scraper,connector,process}_* —
  Collector self-telemetry counters that only tick on the underlying
  event (refused span, failed export, queue change). On a clean
  pipeline with no overload most stay at 0 in the 5m window even
  though the prometheus/self scraper has primed the catalog.

- clickhouse_event{name=~"Query|SelectQuery|...|FailedInsertQuery"} —
  CH's per-event counters published via its built-in /metrics. The
  warmup drives a few SELECTs through cerberus but the scrape cadence
  (15s) + CH-side ProfileEvents flush can leave the 5m rate window
  empty when the cluster is otherwise idle.

Add one broad `otelcol_` prefix entry to EXPECTED_EMPTY (covers all
six otelcol_* subsystems; per-metric entries would be ~30 lines with
identical rationale) and one substring entry to
EXPECTED_EMPTY_EXPR_SUBSTRINGS pinned to the clickhouse_event Query
regex. Keeps both lists under the 10-entry budget called out in their
docstrings.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@tsouza tsouza force-pushed the feat/quickstart-rich-observability branch from f6065ff to 712db3e Compare May 22, 2026 19:29
tsouza added a commit that referenced this pull request May 22, 2026
…pties on fresh compose (#708)

iterate-metrics-explorer + iterate-all-dashboards on PR #701's compose
stack flagged ~30 otelcol_* metrics with empty /api/v1/series + the
clickhouse-observability "Query rate by type" panel as empty. Both are
emission-cadence artefacts of a fresh stack, not regressions:

- otelcol_{exporter,processor,receiver,scraper,connector,process}_* —
  Collector self-telemetry counters that only tick on the underlying
  event (refused span, failed export, queue change). On a clean
  pipeline with no overload most stay at 0 in the 5m window even
  though the prometheus/self scraper has primed the catalog.

- clickhouse_event{name=~"Query|SelectQuery|...|FailedInsertQuery"} —
  CH's per-event counters published via its built-in /metrics. The
  warmup drives a few SELECTs through cerberus but the scrape cadence
  (15s) + CH-side ProfileEvents flush can leave the 5m rate window
  empty when the cluster is otherwise idle.

Add one broad `otelcol_` prefix entry to EXPECTED_EMPTY (covers all
six otelcol_* subsystems; per-metric entries would be ~30 lines with
identical rationale) and one substring entry to
EXPECTED_EMPTY_EXPR_SUBSTRINGS pinned to the clickhouse_event Query
regex. Keeps both lists under the 10-entry budget called out in their
docstrings.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@tsouza tsouza force-pushed the feat/quickstart-rich-observability branch from 712db3e to 28f6aa0 Compare May 22, 2026 19:53
tsouza added a commit that referenced this pull request May 22, 2026
…tter sums resolve (#710)

PromQL's `__name__` matcher lowering uses a Prom-naming heuristic
(`internal/schema/Metrics.TableFor`) to pick the metrics table: names
ending in `_total` / `_count` / `_sum` / `_bucket` route to the Sum
table, everything else to the Gauge table. The heuristic mirrors the
Prom-on-OTel remote-write convention, but the OTel-Collector emitters
PR #701 wires into the quickstart (`hostmetrics`, `sqlquery/clickhouse`,
`prometheus/self`) ship cumulative sums under bare names that violate
the convention — `system_cpu_time`, `clickhouse_event`,
`otelcol_process_uptime` — so the matcher routed those to the Gauge
table and returned zero rows even though the row data lived in Sum.

The catalog endpoints (`/api/v1/series`, `/api/v1/label/...`) already
union all metric tables on the read side, so dashboards surfaced these
metrics in their metric pickers — but the matcher path silently
diverged. PR #701's `otelcol-observability` + `clickhouse-observability`
dashboards painted empty panels against fresh compose data; the
panel-kiosk + iterate-metrics-explorer sweeps caught the regression as
"Unable to fetch labels" + 10-11 console-error 400s per panel.

The fix introduces `schema.Metrics.TablesFor` returning the candidate
table set (Gauge + Sum for unsuffixed names, single-table for suffixed
ones) and an opt-in `chplan.Scan.UnionTables` field the chsql emitter
renders as a CH `merge(currentDatabase(), '<regex>')` table function
call. CH's `merge()` fans the scan across the matching tables in the
named database, projecting the columns common to every member; the
Sum-only columns (`AggregationTemporality`, `IsMonotonic`) drop out of
the merged view but no metric-row consumer references them. The
PREWHERE on `MetricName` translates per-arm at CH's planning stage so
granule pruning still fires.

`lowerVectorSelector` now constructs the Scan via a `scanFromTables`
helper: single-element table list lowers to the legacy
`Scan{Table: ...}` shape (byte-stable for the suffix-routed fixtures);
multi-element lowers to `Scan{UnionTables: ...}`. The histogram-companion
+ bucket-selector overrides keep single-table semantics — they rewrite
the `__name__` matcher to a bare base name that only the histogram
table stores, so a fan-out across Gauge/Sum would just contribute zero
rows.

The mv_substitution rule's `c.BaseTable != scan.Table` guard naturally
skips Scans with empty Table (the UnionTables case) — rollups can't
re-route across heterogeneous physical layouts. The late-mat optimizer
likewise skips via `lateMatShapeFor(scan.Table)` returning `!ok`. Both
exclusions are correct: rollups and wide-column late mat both bake in
single-table assumptions the union scan doesn't satisfy.

158 existing TXTAR fixtures absorb the `FROM otel_metrics_gauge` →
`FROM merge(currentDatabase(), '^(otel_metrics_gauge|otel_metrics_sum)$')`
change. A new pin fixture (`scan_unions_gauge_sum_for_unsuffixed_metric.txtar`)
seeds an empty Gauge table alongside a populated Sum table so the chDB
roundtrip exercises the actual multi-table union — a regression that
dropped the Sum-table arm of merge() would return zero rows.

Verified against the live compose stack: every previously-failing query
listed on PR #701's compose-smoke iteration (run 26308908297) now
returns data — `otelcol_process_uptime`, `system_cpu_time`,
`clickhouse_event`, the MergeTree-I/O `rate(clickhouse_event{name=~...}[5m])`
panel, the three-arm `or` shapes, the histogram_quantile over
`otelcol_processor_batch_batch_send_size_bucket`.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@tsouza tsouza force-pushed the feat/quickstart-rich-observability branch from b4526fd to 5c08787 Compare May 22, 2026 21:15
@tsouza tsouza force-pushed the feat/quickstart-rich-observability branch from 5c08787 to 0019ecd Compare May 22, 2026 21:43
tsouza added a commit that referenced this pull request May 22, 2026
Structural cleanup PR. Every test failure must surface as a real bug to
fix at the source (cerberus code, seed, dashboard, panel) — no allow-
list, tolerance, expected-empty, should_skip, expect.soft, or per-field
blank-skip is acceptable anywhere.

Playwright specs
- iterate-all-dashboards.spec.ts: delete EXPECTED_EMPTY_EXPR_SUBSTRINGS
  + isExpectedEmpty + the conditional in probeTarget. Every empty
  panel result is now a hard fail.
- iterate-metrics-explorer.spec.ts: delete EXPECTED_EMPTY,
  HISTOGRAM_COMPANION_SUFFIXES, dottedAlias, and the companion-
  suffix + dotted-alias fallbacks that masked bare-name 0-result
  failures. Every catalog-published metric must resolve to >= 1 series.
- iterate-drilldown-apps.spec.ts: delete ALERT_ERROR_PATTERNS (banner-
  substring allow-list), APP_NOT_INSTALLED_BANNER_PATTERNS, and
  DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE. Hard-fail every role=alert
  banner and every console error. Remove the install-probe early-skip;
  every catalogue entry must be installed (catalogue now lists three
  apps cerberus actually provisions; pyroscope is gone).
- helpers/drilldown.ts: drop grafana-pyroscope-app from the catalogue.
- compose_grafana_smoke.spec.ts: flip expect.soft -> hard throw; prune
  the retired-allowlist documentation block.
- helpers/{assertions,dom,sweep}.ts: prune retired-allowlist comments;
  reword "tolerated" prose.

Compatibility harnesses
- compatibility/loki/cerberus-test-queries.yml: empty the should_skip
  block (was 17 entries). File kept as schema placeholder.
- compatibility/loki/cmd/loki-compliance-tester/main.go: delete the
  Overlay loader, skipKey lookup, SkipReason field on Result, -overlay
  flag, and every overlay-driven branch in compareAll. Drop the yaml
  import.
- compatibility/loki/scripts/run-loki-compatibility.sh: drop -overlay
  arg + DRIVER_OVERLAY env + the `skipped` jq bucket.
- compatibility/prometheus/expected-failures.json: DELETED (was empty
  anyway; the file itself was the allowlist mechanism).
- compatibility/tempo/expected-failures.json: DELETED. Remove the
  --expected-failures flag, the loader, the docker-compose mount, the
  run-script DIFF_FLAGS branch.
- compatibility/tempo/driver/differ.go: remove the StartTimeUnixNano
  blank-skip — an asymmetric blank on either side is now a real
  divergence (the backend that omitted the field is the bug). Epsilon
  comparison for parsed numeric values is kept (float-noise absorption,
  not a case-allowlist).

CI / lefthook gates
- .github/workflows/ci.yml forbid-skip:
  * Replace the "guard new should_skip entries with a tracking ref"
    step with a hard reject of any non-empty should_skip block. The
    consumer code is gone; entries would be silently ignored.
  * New step: reject test-suite escape-hatch primitives anywhere in
    .ts/.tsx/.go (EXPECTED_EMPTY, EXPECTED_TOLERATED, isKnownTolerated,
    tolerated404, expect.soft, should_tolerate, skipReason/SkipReason,
    APP_NOT_INSTALLED_BANNER_PATTERNS, DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE).
- lefthook.yml: mirror the same forbid-escape-hatch + should_skip
  guards in the pre-push hook.
- scripts/check-skip-additions.sh: DELETED. The "guard untracked
  entries" policy is replaced by "zero entries".

Docs
- docs/compatibility.md: replace the "Expected-failures allowlist"
  section with a "No allow-lists" section documenting the new policy.
- compatibility/loki/README.md: align the overlay description with
  the schema-placeholder reality.
- compatibility/prometheus/{test-cerberus.yml,scripts/run-compatibility.sh}:
  reword to remove the expected-failures references.

PR #701 follow-up (feat/quickstart-rich-observability)
The PR #701 branch adds two more EXPECTED_EMPTY entries on top of main:
- iterate-metrics-explorer.spec.ts: `system_` prefix + `clickhouse_event`
  prefix entries (b3a9dad).
- iterate-all-dashboards.spec.ts: broaden the `clickhouse_event` match
  pattern (b4526fd).
Once this PR merges and #701 rebases, those entries become deletion
conflicts the rebaser must hand-resolve to "deleted". The
`forbid-escape-hatch` gate will reject the PR if any survives.
tsouza added a commit that referenced this pull request May 22, 2026
Structural cleanup PR. Every test failure must surface as a real bug to
fix at the source (cerberus code, seed, dashboard, panel) — no allow-
list, tolerance, expected-empty, should_skip, expect.soft, or per-field
blank-skip is acceptable anywhere.

Playwright specs
- iterate-all-dashboards.spec.ts: delete EXPECTED_EMPTY_EXPR_SUBSTRINGS
  + isExpectedEmpty + the conditional in probeTarget. Every empty
  panel result is now a hard fail.
- iterate-metrics-explorer.spec.ts: delete EXPECTED_EMPTY,
  HISTOGRAM_COMPANION_SUFFIXES, dottedAlias, and the companion-
  suffix + dotted-alias fallbacks that masked bare-name 0-result
  failures. Every catalog-published metric must resolve to >= 1 series.
- iterate-drilldown-apps.spec.ts: delete ALERT_ERROR_PATTERNS (banner-
  substring allow-list), APP_NOT_INSTALLED_BANNER_PATTERNS, and
  DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE. Hard-fail every role=alert
  banner and every console error. Remove the install-probe early-skip;
  every catalogue entry must be installed (catalogue now lists three
  apps cerberus actually provisions; pyroscope is gone).
- helpers/drilldown.ts: drop grafana-pyroscope-app from the catalogue.
- compose_grafana_smoke.spec.ts: flip expect.soft -> hard throw; prune
  the retired-allowlist documentation block.
- helpers/{assertions,dom,sweep}.ts: prune retired-allowlist comments;
  reword "tolerated" prose.

Compatibility harnesses
- compatibility/loki/cerberus-test-queries.yml: empty the should_skip
  block (was 17 entries). File kept as schema placeholder.
- compatibility/loki/cmd/loki-compliance-tester/main.go: delete the
  Overlay loader, skipKey lookup, SkipReason field on Result, -overlay
  flag, and every overlay-driven branch in compareAll. Drop the yaml
  import.
- compatibility/loki/scripts/run-loki-compatibility.sh: drop -overlay
  arg + DRIVER_OVERLAY env + the `skipped` jq bucket.
- compatibility/prometheus/expected-failures.json: DELETED (was empty
  anyway; the file itself was the allowlist mechanism).
- compatibility/tempo/expected-failures.json: DELETED. Remove the
  --expected-failures flag, the loader, the docker-compose mount, the
  run-script DIFF_FLAGS branch.
- compatibility/tempo/driver/differ.go: remove the StartTimeUnixNano
  blank-skip — an asymmetric blank on either side is now a real
  divergence (the backend that omitted the field is the bug). Epsilon
  comparison for parsed numeric values is kept (float-noise absorption,
  not a case-allowlist).

CI / lefthook gates
- .github/workflows/ci.yml forbid-skip:
  * Replace the "guard new should_skip entries with a tracking ref"
    step with a hard reject of any non-empty should_skip block. The
    consumer code is gone; entries would be silently ignored.
  * New step: reject test-suite escape-hatch primitives anywhere in
    .ts/.tsx/.go (EXPECTED_EMPTY, EXPECTED_TOLERATED, isKnownTolerated,
    tolerated404, expect.soft, should_tolerate, skipReason/SkipReason,
    APP_NOT_INSTALLED_BANNER_PATTERNS, DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE).
- lefthook.yml: mirror the same forbid-escape-hatch + should_skip
  guards in the pre-push hook.
- scripts/check-skip-additions.sh: DELETED. The "guard untracked
  entries" policy is replaced by "zero entries".

Docs
- docs/compatibility.md: replace the "Expected-failures allowlist"
  section with a "No allow-lists" section documenting the new policy.
- compatibility/loki/README.md: align the overlay description with
  the schema-placeholder reality.
- compatibility/prometheus/{test-cerberus.yml,scripts/run-compatibility.sh}:
  reword to remove the expected-failures references.

PR #701 follow-up (feat/quickstart-rich-observability)
The PR #701 branch adds two more EXPECTED_EMPTY entries on top of main:
- iterate-metrics-explorer.spec.ts: `system_` prefix + `clickhouse_event`
  prefix entries (b3a9dad).
- iterate-all-dashboards.spec.ts: broaden the `clickhouse_event` match
  pattern (b4526fd).
Once this PR merges and #701 rebases, those entries become deletion
conflicts the rebaser must hand-resolve to "deleted". The
`forbid-escape-hatch` gate will reject the PR if any survives.
tsouza added a commit that referenced this pull request May 22, 2026
Structural cleanup PR. Every test failure must surface as a real bug to
fix at the source (cerberus code, seed, dashboard, panel) — no allow-
list, tolerance, expected-empty, should_skip, expect.soft, or per-field
blank-skip is acceptable anywhere.

Playwright specs
- iterate-all-dashboards.spec.ts: delete EXPECTED_EMPTY_EXPR_SUBSTRINGS
  + isExpectedEmpty + the conditional in probeTarget. Every empty
  panel result is now a hard fail.
- iterate-metrics-explorer.spec.ts: delete EXPECTED_EMPTY,
  HISTOGRAM_COMPANION_SUFFIXES, dottedAlias, and the companion-
  suffix + dotted-alias fallbacks that masked bare-name 0-result
  failures. Every catalog-published metric must resolve to >= 1 series.
- iterate-drilldown-apps.spec.ts: delete ALERT_ERROR_PATTERNS (banner-
  substring allow-list), APP_NOT_INSTALLED_BANNER_PATTERNS, and
  DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE. Hard-fail every role=alert
  banner and every console error. Remove the install-probe early-skip;
  every catalogue entry must be installed (catalogue now lists three
  apps cerberus actually provisions; pyroscope is gone).
- helpers/drilldown.ts: drop grafana-pyroscope-app from the catalogue.
- compose_grafana_smoke.spec.ts: flip expect.soft -> hard throw; prune
  the retired-allowlist documentation block.
- helpers/{assertions,dom,sweep}.ts: prune retired-allowlist comments;
  reword "tolerated" prose.

Compatibility harnesses
- compatibility/loki/cerberus-test-queries.yml: empty the should_skip
  block (was 17 entries). File kept as schema placeholder.
- compatibility/loki/cmd/loki-compliance-tester/main.go: delete the
  Overlay loader, skipKey lookup, SkipReason field on Result, -overlay
  flag, and every overlay-driven branch in compareAll. Drop the yaml
  import.
- compatibility/loki/scripts/run-loki-compatibility.sh: drop -overlay
  arg + DRIVER_OVERLAY env + the `skipped` jq bucket.
- compatibility/prometheus/expected-failures.json: DELETED (was empty
  anyway; the file itself was the allowlist mechanism).
- compatibility/tempo/expected-failures.json: DELETED. Remove the
  --expected-failures flag, the loader, the docker-compose mount, the
  run-script DIFF_FLAGS branch.
- compatibility/tempo/driver/differ.go: remove the StartTimeUnixNano
  blank-skip — an asymmetric blank on either side is now a real
  divergence (the backend that omitted the field is the bug). Epsilon
  comparison for parsed numeric values is kept (float-noise absorption,
  not a case-allowlist).

CI / lefthook gates
- .github/workflows/ci.yml forbid-skip:
  * Replace the "guard new should_skip entries with a tracking ref"
    step with a hard reject of any non-empty should_skip block. The
    consumer code is gone; entries would be silently ignored.
  * New step: reject test-suite escape-hatch primitives anywhere in
    .ts/.tsx/.go (EXPECTED_EMPTY, EXPECTED_TOLERATED, isKnownTolerated,
    tolerated404, expect.soft, should_tolerate, skipReason/SkipReason,
    APP_NOT_INSTALLED_BANNER_PATTERNS, DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE).
- lefthook.yml: mirror the same forbid-escape-hatch + should_skip
  guards in the pre-push hook.
- scripts/check-skip-additions.sh: DELETED. The "guard untracked
  entries" policy is replaced by "zero entries".

Docs
- docs/compatibility.md: replace the "Expected-failures allowlist"
  section with a "No allow-lists" section documenting the new policy.
- compatibility/loki/README.md: align the overlay description with
  the schema-placeholder reality.
- compatibility/prometheus/{test-cerberus.yml,scripts/run-compatibility.sh}:
  reword to remove the expected-failures references.

PR #701 follow-up (feat/quickstart-rich-observability)
The PR #701 branch adds two more EXPECTED_EMPTY entries on top of main:
- iterate-metrics-explorer.spec.ts: `system_` prefix + `clickhouse_event`
  prefix entries (b3a9dad).
- iterate-all-dashboards.spec.ts: broaden the `clickhouse_event` match
  pattern (b4526fd).
Once this PR merges and #701 rebases, those entries become deletion
conflicts the rebaser must hand-resolve to "deleted". The
`forbid-escape-hatch` gate will reject the PR if any survives.
@tsouza tsouza force-pushed the feat/quickstart-rich-observability branch from a8d1769 to 183739f Compare May 22, 2026 22:18
tsouza added a commit that referenced this pull request May 22, 2026
Structural cleanup PR. Every test failure must surface as a real bug to
fix at the source (cerberus code, seed, dashboard, panel) — no allow-
list, tolerance, expected-empty, should_skip, expect.soft, or per-field
blank-skip is acceptable anywhere.

Playwright specs
- iterate-all-dashboards.spec.ts: delete EXPECTED_EMPTY_EXPR_SUBSTRINGS
  + isExpectedEmpty + the conditional in probeTarget. Every empty
  panel result is now a hard fail.
- iterate-metrics-explorer.spec.ts: delete EXPECTED_EMPTY,
  HISTOGRAM_COMPANION_SUFFIXES, dottedAlias, and the companion-
  suffix + dotted-alias fallbacks that masked bare-name 0-result
  failures. Every catalog-published metric must resolve to >= 1 series.
- iterate-drilldown-apps.spec.ts: delete ALERT_ERROR_PATTERNS (banner-
  substring allow-list), APP_NOT_INSTALLED_BANNER_PATTERNS, and
  DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE. Hard-fail every role=alert
  banner and every console error. Remove the install-probe early-skip;
  every catalogue entry must be installed (catalogue now lists three
  apps cerberus actually provisions; pyroscope is gone).
- helpers/drilldown.ts: drop grafana-pyroscope-app from the catalogue.
- compose_grafana_smoke.spec.ts: flip expect.soft -> hard throw; prune
  the retired-allowlist documentation block.
- helpers/{assertions,dom,sweep}.ts: prune retired-allowlist comments;
  reword "tolerated" prose.

Compatibility harnesses
- compatibility/loki/cerberus-test-queries.yml: empty the should_skip
  block (was 17 entries). File kept as schema placeholder.
- compatibility/loki/cmd/loki-compliance-tester/main.go: delete the
  Overlay loader, skipKey lookup, SkipReason field on Result, -overlay
  flag, and every overlay-driven branch in compareAll. Drop the yaml
  import.
- compatibility/loki/scripts/run-loki-compatibility.sh: drop -overlay
  arg + DRIVER_OVERLAY env + the `skipped` jq bucket.
- compatibility/prometheus/expected-failures.json: DELETED (was empty
  anyway; the file itself was the allowlist mechanism).
- compatibility/tempo/expected-failures.json: DELETED. Remove the
  --expected-failures flag, the loader, the docker-compose mount, the
  run-script DIFF_FLAGS branch.
- compatibility/tempo/driver/differ.go: remove the StartTimeUnixNano
  blank-skip — an asymmetric blank on either side is now a real
  divergence (the backend that omitted the field is the bug). Epsilon
  comparison for parsed numeric values is kept (float-noise absorption,
  not a case-allowlist).

CI / lefthook gates
- .github/workflows/ci.yml forbid-skip:
  * Replace the "guard new should_skip entries with a tracking ref"
    step with a hard reject of any non-empty should_skip block. The
    consumer code is gone; entries would be silently ignored.
  * New step: reject test-suite escape-hatch primitives anywhere in
    .ts/.tsx/.go (EXPECTED_EMPTY, EXPECTED_TOLERATED, isKnownTolerated,
    tolerated404, expect.soft, should_tolerate, skipReason/SkipReason,
    APP_NOT_INSTALLED_BANNER_PATTERNS, DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE).
- lefthook.yml: mirror the same forbid-escape-hatch + should_skip
  guards in the pre-push hook.
- scripts/check-skip-additions.sh: DELETED. The "guard untracked
  entries" policy is replaced by "zero entries".

Docs
- docs/compatibility.md: replace the "Expected-failures allowlist"
  section with a "No allow-lists" section documenting the new policy.
- compatibility/loki/README.md: align the overlay description with
  the schema-placeholder reality.
- compatibility/prometheus/{test-cerberus.yml,scripts/run-compatibility.sh}:
  reword to remove the expected-failures references.

PR #701 follow-up (feat/quickstart-rich-observability)
The PR #701 branch adds two more EXPECTED_EMPTY entries on top of main:
- iterate-metrics-explorer.spec.ts: `system_` prefix + `clickhouse_event`
  prefix entries (b3a9dad).
- iterate-all-dashboards.spec.ts: broaden the `clickhouse_event` match
  pattern (b4526fd).
Once this PR merges and #701 rebases, those entries become deletion
conflicts the rebaser must hand-resolve to "deleted". The
`forbid-escape-hatch` gate will reject the PR if any survives.
tsouza added a commit that referenced this pull request May 22, 2026
Structural cleanup PR. Every test failure must surface as a real bug to
fix at the source (cerberus code, seed, dashboard, panel) — no allow-
list, tolerance, expected-empty, should_skip, expect.soft, or per-field
blank-skip is acceptable anywhere.

Playwright specs
- iterate-all-dashboards.spec.ts: delete EXPECTED_EMPTY_EXPR_SUBSTRINGS
  + isExpectedEmpty + the conditional in probeTarget. Every empty
  panel result is now a hard fail.
- iterate-metrics-explorer.spec.ts: delete EXPECTED_EMPTY,
  HISTOGRAM_COMPANION_SUFFIXES, dottedAlias, and the companion-
  suffix + dotted-alias fallbacks that masked bare-name 0-result
  failures. Every catalog-published metric must resolve to >= 1 series.
- iterate-drilldown-apps.spec.ts: delete ALERT_ERROR_PATTERNS (banner-
  substring allow-list), APP_NOT_INSTALLED_BANNER_PATTERNS, and
  DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE. Hard-fail every role=alert
  banner and every console error. Remove the install-probe early-skip;
  every catalogue entry must be installed (catalogue now lists three
  apps cerberus actually provisions; pyroscope is gone).
- helpers/drilldown.ts: drop grafana-pyroscope-app from the catalogue.
- compose_grafana_smoke.spec.ts: flip expect.soft -> hard throw; prune
  the retired-allowlist documentation block.
- helpers/{assertions,dom,sweep}.ts: prune retired-allowlist comments;
  reword "tolerated" prose.

Compatibility harnesses
- compatibility/loki/cerberus-test-queries.yml: empty the should_skip
  block (was 17 entries). File kept as schema placeholder.
- compatibility/loki/cmd/loki-compliance-tester/main.go: delete the
  Overlay loader, skipKey lookup, SkipReason field on Result, -overlay
  flag, and every overlay-driven branch in compareAll. Drop the yaml
  import.
- compatibility/loki/scripts/run-loki-compatibility.sh: drop -overlay
  arg + DRIVER_OVERLAY env + the `skipped` jq bucket.
- compatibility/prometheus/expected-failures.json: DELETED (was empty
  anyway; the file itself was the allowlist mechanism).
- compatibility/tempo/expected-failures.json: DELETED. Remove the
  --expected-failures flag, the loader, the docker-compose mount, the
  run-script DIFF_FLAGS branch.
- compatibility/tempo/driver/differ.go: remove the StartTimeUnixNano
  blank-skip — an asymmetric blank on either side is now a real
  divergence (the backend that omitted the field is the bug). Epsilon
  comparison for parsed numeric values is kept (float-noise absorption,
  not a case-allowlist).

CI / lefthook gates
- .github/workflows/ci.yml forbid-skip:
  * Replace the "guard new should_skip entries with a tracking ref"
    step with a hard reject of any non-empty should_skip block. The
    consumer code is gone; entries would be silently ignored.
  * New step: reject test-suite escape-hatch primitives anywhere in
    .ts/.tsx/.go (EXPECTED_EMPTY, EXPECTED_TOLERATED, isKnownTolerated,
    tolerated404, expect.soft, should_tolerate, skipReason/SkipReason,
    APP_NOT_INSTALLED_BANNER_PATTERNS, DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE).
- lefthook.yml: mirror the same forbid-escape-hatch + should_skip
  guards in the pre-push hook.
- scripts/check-skip-additions.sh: DELETED. The "guard untracked
  entries" policy is replaced by "zero entries".

Docs
- docs/compatibility.md: replace the "Expected-failures allowlist"
  section with a "No allow-lists" section documenting the new policy.
- compatibility/loki/README.md: align the overlay description with
  the schema-placeholder reality.
- compatibility/prometheus/{test-cerberus.yml,scripts/run-compatibility.sh}:
  reword to remove the expected-failures references.

PR #701 follow-up (feat/quickstart-rich-observability)
The PR #701 branch adds two more EXPECTED_EMPTY entries on top of main:
- iterate-metrics-explorer.spec.ts: `system_` prefix + `clickhouse_event`
  prefix entries (b3a9dad).
- iterate-all-dashboards.spec.ts: broaden the `clickhouse_event` match
  pattern (b4526fd).
Once this PR merges and #701 rebases, those entries become deletion
conflicts the rebaser must hand-resolve to "deleted". The
`forbid-escape-hatch` gate will reject the PR if any survives.
tsouza added 8 commits May 22, 2026 23:01
…ed in via OTel collector

Adds three new metric receivers alongside the existing OTLP self-export
in test/e2e/otel-collector/compose-config.yaml so the docker-compose
quickstart paints a full-stack observability picture out of the box,
without bypassing cerberus on the read path:

- prometheus/self -- scrapes the collector's own :8888/metrics endpoint
  (service.telemetry.metrics now exposes it via the 0.123+ readers/pull
  syntax). Surfaces every otelcol_* receiver / processor / exporter
  counter, queue depth, batch send sizes, Go runtime memory.
- hostmetrics -- every supported scraper enabled (cpu / memory / disk /
  network / filesystem / load / paging / processes) including the
  disabled-by-default *.utilization gauges and conntrack counters.
- sqlquery/clickhouse -- queries system.metrics, system.events,
  system.asynchronous_metrics, and system.parts every 15s. The
  three name-pivoted families (clickhouse_metric / _event /
  _async_metric) cover ~400 CH server signals without enumerating each
  one up front.

A transform/metric_names processor rewrites OTel-dotted metric names
(system.cpu.time) to underscored PromQL-friendly ones (system_cpu_time)
before writing so dashboard queries don't need UTF-8 escaping. Three
new resource processors stamp service.name per source so PromQL filters
can pivot.

Three matching dashboards land under test/e2e/grafana/compose/
dashboards/ alongside cerberus-self.json:

- clickhouse-observability.json -- in-flight queries, parts on disk,
  memory, connections, merges, query rate by type, MergeTree I/O,
  network, caches, thread pools, replication state, errors, host
  resource gauges.
- otelcol-observability.json -- uptime, RSS, queue depth, send
  failures, processor refusals, Go heap, receiver / exporter throughput
  by signal, drops / failures, batch send-size quantiles, queue depth
  vs capacity.
- host-observability.json -- CPU by state, per-core utilisation,
  memory by state + utilisation, disk IOPS / throughput / operation
  time, network throughput + packets / errors / drops, filesystem
  utilisation by mount with red threshold at 90%, load average,
  paging.

Validated against otel/opentelemetry-collector-contrib:0.152.1 (latest
release). Real data flowed in a verify-network run: 750 sum rows + 1284
gauge rows across 64 distinct metric names produced inside ~60s.
Stack-level pickup needs the compose docker-compose.yml owner to add
the host /proc, /sys, / mounts (for true host visibility) and bump the
collector image pin from 0.116.1 to 0.152.1 -- coordinated with the
seed-removal agent rather than landed in this PR per the worktree
file-disjointness contract.
…proc /sys for hostmetrics

The hostmetricsreceiver wired in by PR #701 was reading the collector
container's own /proc + /sys namespace, not the host's — every
system_cpu_*, system_memory_*, system_disk_*, system_filesystem_*
series reflected container-scoped state instead of the host machine
the quickstart promises to surface.

Mount /proc, /sys, and / from the host into /hostfs (ro,rslave) and
point the receiver at them via the upstream-documented contract:
HOST_PROC / HOST_SYS / HOST_ETC env vars (cpu / memory / paging /
network / processes scrapers) + root_path: /hostfs in the receiver
config (filesystem + disk scrapers that walk the full tree).

Verified locally via `curl localhost:8080/api/v1/query?query=system_cpu_time`:
64 series (8 host CPUs x 8 states), cumulative seconds matching host
uptime — clearly host data, not the ~0s container would emit.

The image bump also lets us delete the
transform/servicegraph_drop_exemplars workaround: the v0.116.x
clickhouseexporter nil-deref on exemplar payloads (sum_metrics.go:129)
that the processor existed to dodge is fixed in 0.152.1, confirmed by
running the metrics/servicegraph pipeline for >2 metrics_flush_interval
ticks with no panic and the traces_service_graph_request_total series
flowing end-to-end into ClickHouse.

Same image bump on the k3d gateway + agent for parity (the
"bump both together" rule the original 0.120.0 comment called out).
Drops the same drop_exemplars processor on the k3d side; renames
the compose-side receivers + connector to their canonical
host_metrics / service_graph aliases to silence the 0.152.x
deprecation warnings the legacy hostmetrics / servicegraph names
now emit on startup.
@tsouza tsouza force-pushed the feat/quickstart-rich-observability branch from 183739f to fde1868 Compare May 22, 2026 23:09
…name to service_graph

0.152.1 fixed the upstream clickhouseexporter nil-deref on the
service-graph connector's exemplar payload that the earlier
transform/servicegraph_drop_exemplars processor existed to work around;
the same commit (fde1868) dropped the workaround from the k3d manifest
but the regression test was still pinning it as required. Rename the
connector key in the k3d manifest to the new canonical service_graph
form (the deprecation alias is still accepted but the compose stack
already pins the canonical name; mirror it here) and update the test
to match — assertions still cover connector presence + traces-pipeline
tap + metrics/servicegraph pipeline wiring.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant