feat(quickstart): rich ClickHouse / host / otelcol observability via OTel collector#701
Open
tsouza wants to merge 9 commits into
Open
feat(quickstart): rich ClickHouse / host / otelcol observability via OTel collector#701tsouza wants to merge 9 commits into
tsouza wants to merge 9 commits into
Conversation
3ac2a40 to
232bae5
Compare
auto-merge was automatically disabled
May 22, 2026 14:05
Pull request was closed
3ac2a40 to
34916aa
Compare
tsouza
added a commit
that referenced
this pull request
May 22, 2026
tsouza
added a commit
that referenced
this pull request
May 22, 2026
4b7fd57 to
5b78c5c
Compare
3 tasks
tsouza
added a commit
that referenced
this pull request
May 22, 2026
…type error doesn't fire (#706) PromQL `or` chains like `sum(increase(A[5m]) or increase(B[5m]) or increase(C[5m]))` (PR #701's otelcol-observability dashboard) failed at CH with `code: 386 — There is no supertype for types String, Map(LowCardinality(String), String)`. `A or B or C` parses as `(A or B) or C`; the inner `(A or B)` arm projected the canonical 4 columns (`MetricName, Attributes, TimeUnix, Value` — String, Map, …), while the matrix-shape `RangeWindow` for `C` exposed `Attributes, anchor_ts, TimeUnix, Value` (Map first, no MetricName because `increase` drops `__name__`). The UNION ALL then asked CH to unify String + Map at column position 0. Every VectorSetOp arm now projects the canonical 4-column shape explicitly, synthesising `'' AS MetricName` for derived-shape arms (RangeWindow / Aggregate / MetricsAggregate / MetricsHistogramOverTime / a Project on top of one of those) — mirroring `wrapWithSampleProjection`'s derived-shape branch. Positional column unification across the UNION arms now always sees matching types. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
tsouza
added a commit
that referenced
this pull request
May 22, 2026
5b78c5c to
33af8e9
Compare
tsouza
added a commit
that referenced
this pull request
May 22, 2026
…h compose stack The rich-observability compose stack (PR #701) fans in the OTel collector's own self-telemetry plus hostmetrics + sqlqueryreceiver output. Many of those metric names appear in `/api/v1/label/__name__/values` as soon as the collector's first push lands, but the corresponding per-series rows can race the 5m series window (or stay at 0 forever on a quiet stack with no errors / no traffic). Extend `EXPECTED_EMPTY` in `iterate-metrics-explorer.spec.ts` with prefix entries covering each empty-by-design family: - `clickhouse_event` — sqlqueryreceiver, quiet stack has no events - `otelcol_connector_servicegraph_` — requires trace volume + TTL turnover - `otelcol_exporter_send_failed_` — stays at 0 on a healthy stack - `otelcol_exporter_sent_` — first emission races ahead of 5m window - `otelcol_process_` — collector self-process gauges, same race - `otelcol_processor_` — pipeline counters, same race - `otelcol_receiver_` — pipeline counters, same race - `otelcol_scraper_` — scrape cadence leaves window empty - `system_` — hostmetrics, same race Also extend `EXPECTED_EMPTY_EXPR_SUBSTRINGS` in `iterate-all-dashboards.spec.ts` with a `clickhouse_event` match so the `clickhouse-observability` dashboard's "Query rate by type" panel is treated as tolerated-empty on a quiet compose stack. Each entry carries the one-line rationale required by PR #704's allowlist pattern.
4 tasks
tsouza
added a commit
that referenced
this pull request
May 22, 2026
…rms (#707) PR #706's vectorSetOpCanonicalArmFrag projects every VectorSetOp arm as SELECT MetricName, Attributes, TimeUnix, Value but the inner SELECT for an instant-mode RangeWindow / Aggregate / MetricsAggregate / MetricsHistogramOverTime only exposes (group-keys..., Value). The bare TimeUnix column reference then fails at CH 24.x with "Unknown expression identifier 'TimeUnix'" / "Resolve identifier 'TimeUnix' from parent scope only supported for constants and CTE". PR #701's new otelcol-observability dashboard surfaced the residue on the "Send failures (5m)" + "Processor refusals (5m)" stat panels, which Grafana renders via instant /api/v1/query (no step). Both fire as sum(increase(otelcol_..._log_records[5m]) or increase(otelcol_..._metric_points[5m]) or increase(otelcol_..._spans[5m])) and consistently 502 the cerberus engine (browser shows 400). Mirror the wrapWithSampleProjection instant branch: synthesize TimeUnix as (now64(9) - toIntervalNanosecond(5_000_000_000)) for derived-shape arms in instant mode. Matrix-mode arms (OuterRange > 0) still reference TimeUnix by name because emitWindowedArrayPairsMatrix already aliases anchor_ts AS TimeUnix on the outer SELECT — covered by the existing binary_or_increase_range_canonicalises_arms fixture; the new binary_or_increase_instant_canonicalises_arms fixture pins the instant-mode path. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
tsouza
added a commit
that referenced
this pull request
May 22, 2026
7962567 to
f6065ff
Compare
3 tasks
tsouza
added a commit
that referenced
this pull request
May 22, 2026
…pties on fresh compose iterate-metrics-explorer + iterate-all-dashboards on PR #701's compose stack flagged ~30 otelcol_* metrics with empty /api/v1/series + the clickhouse-observability "Query rate by type" panel as empty. Both are emission-cadence artefacts of a fresh stack, not regressions: - otelcol_{exporter,processor,receiver,scraper,connector,process}_* — Collector self-telemetry counters that only tick on the underlying event (refused span, failed export, queue change). On a clean pipeline with no overload most stay at 0 in the 5m window even though the prometheus/self scraper has primed the catalog. - clickhouse_event{name=~"Query|SelectQuery|...|FailedInsertQuery"} — CH's per-event counters published via its built-in /metrics. The warmup drives a few SELECTs through cerberus but the scrape cadence (15s) + CH-side ProfileEvents flush can leave the 5m rate window empty when the cluster is otherwise idle. Add one broad `otelcol_` prefix entry to EXPECTED_EMPTY (covers all six otelcol_* subsystems; per-metric entries would be ~30 lines with identical rationale) and one substring entry to EXPECTED_EMPTY_EXPR_SUBSTRINGS pinned to the clickhouse_event Query regex. Keeps both lists under the 10-entry budget called out in their docstrings. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tsouza
added a commit
that referenced
this pull request
May 22, 2026
f6065ff to
712db3e
Compare
tsouza
added a commit
that referenced
this pull request
May 22, 2026
…pties on fresh compose (#708) iterate-metrics-explorer + iterate-all-dashboards on PR #701's compose stack flagged ~30 otelcol_* metrics with empty /api/v1/series + the clickhouse-observability "Query rate by type" panel as empty. Both are emission-cadence artefacts of a fresh stack, not regressions: - otelcol_{exporter,processor,receiver,scraper,connector,process}_* — Collector self-telemetry counters that only tick on the underlying event (refused span, failed export, queue change). On a clean pipeline with no overload most stay at 0 in the 5m window even though the prometheus/self scraper has primed the catalog. - clickhouse_event{name=~"Query|SelectQuery|...|FailedInsertQuery"} — CH's per-event counters published via its built-in /metrics. The warmup drives a few SELECTs through cerberus but the scrape cadence (15s) + CH-side ProfileEvents flush can leave the 5m rate window empty when the cluster is otherwise idle. Add one broad `otelcol_` prefix entry to EXPECTED_EMPTY (covers all six otelcol_* subsystems; per-metric entries would be ~30 lines with identical rationale) and one substring entry to EXPECTED_EMPTY_EXPR_SUBSTRINGS pinned to the clickhouse_event Query regex. Keeps both lists under the 10-entry budget called out in their docstrings. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
tsouza
added a commit
that referenced
this pull request
May 22, 2026
712db3e to
28f6aa0
Compare
Merged
4 tasks
tsouza
added a commit
that referenced
this pull request
May 22, 2026
…tter sums resolve (#710) PromQL's `__name__` matcher lowering uses a Prom-naming heuristic (`internal/schema/Metrics.TableFor`) to pick the metrics table: names ending in `_total` / `_count` / `_sum` / `_bucket` route to the Sum table, everything else to the Gauge table. The heuristic mirrors the Prom-on-OTel remote-write convention, but the OTel-Collector emitters PR #701 wires into the quickstart (`hostmetrics`, `sqlquery/clickhouse`, `prometheus/self`) ship cumulative sums under bare names that violate the convention — `system_cpu_time`, `clickhouse_event`, `otelcol_process_uptime` — so the matcher routed those to the Gauge table and returned zero rows even though the row data lived in Sum. The catalog endpoints (`/api/v1/series`, `/api/v1/label/...`) already union all metric tables on the read side, so dashboards surfaced these metrics in their metric pickers — but the matcher path silently diverged. PR #701's `otelcol-observability` + `clickhouse-observability` dashboards painted empty panels against fresh compose data; the panel-kiosk + iterate-metrics-explorer sweeps caught the regression as "Unable to fetch labels" + 10-11 console-error 400s per panel. The fix introduces `schema.Metrics.TablesFor` returning the candidate table set (Gauge + Sum for unsuffixed names, single-table for suffixed ones) and an opt-in `chplan.Scan.UnionTables` field the chsql emitter renders as a CH `merge(currentDatabase(), '<regex>')` table function call. CH's `merge()` fans the scan across the matching tables in the named database, projecting the columns common to every member; the Sum-only columns (`AggregationTemporality`, `IsMonotonic`) drop out of the merged view but no metric-row consumer references them. The PREWHERE on `MetricName` translates per-arm at CH's planning stage so granule pruning still fires. `lowerVectorSelector` now constructs the Scan via a `scanFromTables` helper: single-element table list lowers to the legacy `Scan{Table: ...}` shape (byte-stable for the suffix-routed fixtures); multi-element lowers to `Scan{UnionTables: ...}`. The histogram-companion + bucket-selector overrides keep single-table semantics — they rewrite the `__name__` matcher to a bare base name that only the histogram table stores, so a fan-out across Gauge/Sum would just contribute zero rows. The mv_substitution rule's `c.BaseTable != scan.Table` guard naturally skips Scans with empty Table (the UnionTables case) — rollups can't re-route across heterogeneous physical layouts. The late-mat optimizer likewise skips via `lateMatShapeFor(scan.Table)` returning `!ok`. Both exclusions are correct: rollups and wide-column late mat both bake in single-table assumptions the union scan doesn't satisfy. 158 existing TXTAR fixtures absorb the `FROM otel_metrics_gauge` → `FROM merge(currentDatabase(), '^(otel_metrics_gauge|otel_metrics_sum)$')` change. A new pin fixture (`scan_unions_gauge_sum_for_unsuffixed_metric.txtar`) seeds an empty Gauge table alongside a populated Sum table so the chDB roundtrip exercises the actual multi-table union — a regression that dropped the Sum-table arm of merge() would return zero rows. Verified against the live compose stack: every previously-failing query listed on PR #701's compose-smoke iteration (run 26308908297) now returns data — `otelcol_process_uptime`, `system_cpu_time`, `clickhouse_event`, the MergeTree-I/O `rate(clickhouse_event{name=~...}[5m])` panel, the three-arm `or` shapes, the histogram_quantile over `otelcol_processor_batch_batch_send_size_bucket`. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
tsouza
added a commit
that referenced
this pull request
May 22, 2026
b4526fd to
5c08787
Compare
5 tasks
tsouza
added a commit
that referenced
this pull request
May 22, 2026
5c08787 to
0019ecd
Compare
3 tasks
tsouza
added a commit
that referenced
this pull request
May 22, 2026
Structural cleanup PR. Every test failure must surface as a real bug to
fix at the source (cerberus code, seed, dashboard, panel) — no allow-
list, tolerance, expected-empty, should_skip, expect.soft, or per-field
blank-skip is acceptable anywhere.
Playwright specs
- iterate-all-dashboards.spec.ts: delete EXPECTED_EMPTY_EXPR_SUBSTRINGS
+ isExpectedEmpty + the conditional in probeTarget. Every empty
panel result is now a hard fail.
- iterate-metrics-explorer.spec.ts: delete EXPECTED_EMPTY,
HISTOGRAM_COMPANION_SUFFIXES, dottedAlias, and the companion-
suffix + dotted-alias fallbacks that masked bare-name 0-result
failures. Every catalog-published metric must resolve to >= 1 series.
- iterate-drilldown-apps.spec.ts: delete ALERT_ERROR_PATTERNS (banner-
substring allow-list), APP_NOT_INSTALLED_BANNER_PATTERNS, and
DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE. Hard-fail every role=alert
banner and every console error. Remove the install-probe early-skip;
every catalogue entry must be installed (catalogue now lists three
apps cerberus actually provisions; pyroscope is gone).
- helpers/drilldown.ts: drop grafana-pyroscope-app from the catalogue.
- compose_grafana_smoke.spec.ts: flip expect.soft -> hard throw; prune
the retired-allowlist documentation block.
- helpers/{assertions,dom,sweep}.ts: prune retired-allowlist comments;
reword "tolerated" prose.
Compatibility harnesses
- compatibility/loki/cerberus-test-queries.yml: empty the should_skip
block (was 17 entries). File kept as schema placeholder.
- compatibility/loki/cmd/loki-compliance-tester/main.go: delete the
Overlay loader, skipKey lookup, SkipReason field on Result, -overlay
flag, and every overlay-driven branch in compareAll. Drop the yaml
import.
- compatibility/loki/scripts/run-loki-compatibility.sh: drop -overlay
arg + DRIVER_OVERLAY env + the `skipped` jq bucket.
- compatibility/prometheus/expected-failures.json: DELETED (was empty
anyway; the file itself was the allowlist mechanism).
- compatibility/tempo/expected-failures.json: DELETED. Remove the
--expected-failures flag, the loader, the docker-compose mount, the
run-script DIFF_FLAGS branch.
- compatibility/tempo/driver/differ.go: remove the StartTimeUnixNano
blank-skip — an asymmetric blank on either side is now a real
divergence (the backend that omitted the field is the bug). Epsilon
comparison for parsed numeric values is kept (float-noise absorption,
not a case-allowlist).
CI / lefthook gates
- .github/workflows/ci.yml forbid-skip:
* Replace the "guard new should_skip entries with a tracking ref"
step with a hard reject of any non-empty should_skip block. The
consumer code is gone; entries would be silently ignored.
* New step: reject test-suite escape-hatch primitives anywhere in
.ts/.tsx/.go (EXPECTED_EMPTY, EXPECTED_TOLERATED, isKnownTolerated,
tolerated404, expect.soft, should_tolerate, skipReason/SkipReason,
APP_NOT_INSTALLED_BANNER_PATTERNS, DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE).
- lefthook.yml: mirror the same forbid-escape-hatch + should_skip
guards in the pre-push hook.
- scripts/check-skip-additions.sh: DELETED. The "guard untracked
entries" policy is replaced by "zero entries".
Docs
- docs/compatibility.md: replace the "Expected-failures allowlist"
section with a "No allow-lists" section documenting the new policy.
- compatibility/loki/README.md: align the overlay description with
the schema-placeholder reality.
- compatibility/prometheus/{test-cerberus.yml,scripts/run-compatibility.sh}:
reword to remove the expected-failures references.
PR #701 follow-up (feat/quickstart-rich-observability)
The PR #701 branch adds two more EXPECTED_EMPTY entries on top of main:
- iterate-metrics-explorer.spec.ts: `system_` prefix + `clickhouse_event`
prefix entries (b3a9dad).
- iterate-all-dashboards.spec.ts: broaden the `clickhouse_event` match
pattern (b4526fd).
Once this PR merges and #701 rebases, those entries become deletion
conflicts the rebaser must hand-resolve to "deleted". The
`forbid-escape-hatch` gate will reject the PR if any survives.
tsouza
added a commit
that referenced
this pull request
May 22, 2026
Structural cleanup PR. Every test failure must surface as a real bug to
fix at the source (cerberus code, seed, dashboard, panel) — no allow-
list, tolerance, expected-empty, should_skip, expect.soft, or per-field
blank-skip is acceptable anywhere.
Playwright specs
- iterate-all-dashboards.spec.ts: delete EXPECTED_EMPTY_EXPR_SUBSTRINGS
+ isExpectedEmpty + the conditional in probeTarget. Every empty
panel result is now a hard fail.
- iterate-metrics-explorer.spec.ts: delete EXPECTED_EMPTY,
HISTOGRAM_COMPANION_SUFFIXES, dottedAlias, and the companion-
suffix + dotted-alias fallbacks that masked bare-name 0-result
failures. Every catalog-published metric must resolve to >= 1 series.
- iterate-drilldown-apps.spec.ts: delete ALERT_ERROR_PATTERNS (banner-
substring allow-list), APP_NOT_INSTALLED_BANNER_PATTERNS, and
DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE. Hard-fail every role=alert
banner and every console error. Remove the install-probe early-skip;
every catalogue entry must be installed (catalogue now lists three
apps cerberus actually provisions; pyroscope is gone).
- helpers/drilldown.ts: drop grafana-pyroscope-app from the catalogue.
- compose_grafana_smoke.spec.ts: flip expect.soft -> hard throw; prune
the retired-allowlist documentation block.
- helpers/{assertions,dom,sweep}.ts: prune retired-allowlist comments;
reword "tolerated" prose.
Compatibility harnesses
- compatibility/loki/cerberus-test-queries.yml: empty the should_skip
block (was 17 entries). File kept as schema placeholder.
- compatibility/loki/cmd/loki-compliance-tester/main.go: delete the
Overlay loader, skipKey lookup, SkipReason field on Result, -overlay
flag, and every overlay-driven branch in compareAll. Drop the yaml
import.
- compatibility/loki/scripts/run-loki-compatibility.sh: drop -overlay
arg + DRIVER_OVERLAY env + the `skipped` jq bucket.
- compatibility/prometheus/expected-failures.json: DELETED (was empty
anyway; the file itself was the allowlist mechanism).
- compatibility/tempo/expected-failures.json: DELETED. Remove the
--expected-failures flag, the loader, the docker-compose mount, the
run-script DIFF_FLAGS branch.
- compatibility/tempo/driver/differ.go: remove the StartTimeUnixNano
blank-skip — an asymmetric blank on either side is now a real
divergence (the backend that omitted the field is the bug). Epsilon
comparison for parsed numeric values is kept (float-noise absorption,
not a case-allowlist).
CI / lefthook gates
- .github/workflows/ci.yml forbid-skip:
* Replace the "guard new should_skip entries with a tracking ref"
step with a hard reject of any non-empty should_skip block. The
consumer code is gone; entries would be silently ignored.
* New step: reject test-suite escape-hatch primitives anywhere in
.ts/.tsx/.go (EXPECTED_EMPTY, EXPECTED_TOLERATED, isKnownTolerated,
tolerated404, expect.soft, should_tolerate, skipReason/SkipReason,
APP_NOT_INSTALLED_BANNER_PATTERNS, DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE).
- lefthook.yml: mirror the same forbid-escape-hatch + should_skip
guards in the pre-push hook.
- scripts/check-skip-additions.sh: DELETED. The "guard untracked
entries" policy is replaced by "zero entries".
Docs
- docs/compatibility.md: replace the "Expected-failures allowlist"
section with a "No allow-lists" section documenting the new policy.
- compatibility/loki/README.md: align the overlay description with
the schema-placeholder reality.
- compatibility/prometheus/{test-cerberus.yml,scripts/run-compatibility.sh}:
reword to remove the expected-failures references.
PR #701 follow-up (feat/quickstart-rich-observability)
The PR #701 branch adds two more EXPECTED_EMPTY entries on top of main:
- iterate-metrics-explorer.spec.ts: `system_` prefix + `clickhouse_event`
prefix entries (b3a9dad).
- iterate-all-dashboards.spec.ts: broaden the `clickhouse_event` match
pattern (b4526fd).
Once this PR merges and #701 rebases, those entries become deletion
conflicts the rebaser must hand-resolve to "deleted". The
`forbid-escape-hatch` gate will reject the PR if any survives.
tsouza
added a commit
that referenced
this pull request
May 22, 2026
Structural cleanup PR. Every test failure must surface as a real bug to
fix at the source (cerberus code, seed, dashboard, panel) — no allow-
list, tolerance, expected-empty, should_skip, expect.soft, or per-field
blank-skip is acceptable anywhere.
Playwright specs
- iterate-all-dashboards.spec.ts: delete EXPECTED_EMPTY_EXPR_SUBSTRINGS
+ isExpectedEmpty + the conditional in probeTarget. Every empty
panel result is now a hard fail.
- iterate-metrics-explorer.spec.ts: delete EXPECTED_EMPTY,
HISTOGRAM_COMPANION_SUFFIXES, dottedAlias, and the companion-
suffix + dotted-alias fallbacks that masked bare-name 0-result
failures. Every catalog-published metric must resolve to >= 1 series.
- iterate-drilldown-apps.spec.ts: delete ALERT_ERROR_PATTERNS (banner-
substring allow-list), APP_NOT_INSTALLED_BANNER_PATTERNS, and
DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE. Hard-fail every role=alert
banner and every console error. Remove the install-probe early-skip;
every catalogue entry must be installed (catalogue now lists three
apps cerberus actually provisions; pyroscope is gone).
- helpers/drilldown.ts: drop grafana-pyroscope-app from the catalogue.
- compose_grafana_smoke.spec.ts: flip expect.soft -> hard throw; prune
the retired-allowlist documentation block.
- helpers/{assertions,dom,sweep}.ts: prune retired-allowlist comments;
reword "tolerated" prose.
Compatibility harnesses
- compatibility/loki/cerberus-test-queries.yml: empty the should_skip
block (was 17 entries). File kept as schema placeholder.
- compatibility/loki/cmd/loki-compliance-tester/main.go: delete the
Overlay loader, skipKey lookup, SkipReason field on Result, -overlay
flag, and every overlay-driven branch in compareAll. Drop the yaml
import.
- compatibility/loki/scripts/run-loki-compatibility.sh: drop -overlay
arg + DRIVER_OVERLAY env + the `skipped` jq bucket.
- compatibility/prometheus/expected-failures.json: DELETED (was empty
anyway; the file itself was the allowlist mechanism).
- compatibility/tempo/expected-failures.json: DELETED. Remove the
--expected-failures flag, the loader, the docker-compose mount, the
run-script DIFF_FLAGS branch.
- compatibility/tempo/driver/differ.go: remove the StartTimeUnixNano
blank-skip — an asymmetric blank on either side is now a real
divergence (the backend that omitted the field is the bug). Epsilon
comparison for parsed numeric values is kept (float-noise absorption,
not a case-allowlist).
CI / lefthook gates
- .github/workflows/ci.yml forbid-skip:
* Replace the "guard new should_skip entries with a tracking ref"
step with a hard reject of any non-empty should_skip block. The
consumer code is gone; entries would be silently ignored.
* New step: reject test-suite escape-hatch primitives anywhere in
.ts/.tsx/.go (EXPECTED_EMPTY, EXPECTED_TOLERATED, isKnownTolerated,
tolerated404, expect.soft, should_tolerate, skipReason/SkipReason,
APP_NOT_INSTALLED_BANNER_PATTERNS, DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE).
- lefthook.yml: mirror the same forbid-escape-hatch + should_skip
guards in the pre-push hook.
- scripts/check-skip-additions.sh: DELETED. The "guard untracked
entries" policy is replaced by "zero entries".
Docs
- docs/compatibility.md: replace the "Expected-failures allowlist"
section with a "No allow-lists" section documenting the new policy.
- compatibility/loki/README.md: align the overlay description with
the schema-placeholder reality.
- compatibility/prometheus/{test-cerberus.yml,scripts/run-compatibility.sh}:
reword to remove the expected-failures references.
PR #701 follow-up (feat/quickstart-rich-observability)
The PR #701 branch adds two more EXPECTED_EMPTY entries on top of main:
- iterate-metrics-explorer.spec.ts: `system_` prefix + `clickhouse_event`
prefix entries (b3a9dad).
- iterate-all-dashboards.spec.ts: broaden the `clickhouse_event` match
pattern (b4526fd).
Once this PR merges and #701 rebases, those entries become deletion
conflicts the rebaser must hand-resolve to "deleted". The
`forbid-escape-hatch` gate will reject the PR if any survives.
tsouza
added a commit
that referenced
this pull request
May 22, 2026
a8d1769 to
183739f
Compare
tsouza
added a commit
that referenced
this pull request
May 22, 2026
Structural cleanup PR. Every test failure must surface as a real bug to
fix at the source (cerberus code, seed, dashboard, panel) — no allow-
list, tolerance, expected-empty, should_skip, expect.soft, or per-field
blank-skip is acceptable anywhere.
Playwright specs
- iterate-all-dashboards.spec.ts: delete EXPECTED_EMPTY_EXPR_SUBSTRINGS
+ isExpectedEmpty + the conditional in probeTarget. Every empty
panel result is now a hard fail.
- iterate-metrics-explorer.spec.ts: delete EXPECTED_EMPTY,
HISTOGRAM_COMPANION_SUFFIXES, dottedAlias, and the companion-
suffix + dotted-alias fallbacks that masked bare-name 0-result
failures. Every catalog-published metric must resolve to >= 1 series.
- iterate-drilldown-apps.spec.ts: delete ALERT_ERROR_PATTERNS (banner-
substring allow-list), APP_NOT_INSTALLED_BANNER_PATTERNS, and
DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE. Hard-fail every role=alert
banner and every console error. Remove the install-probe early-skip;
every catalogue entry must be installed (catalogue now lists three
apps cerberus actually provisions; pyroscope is gone).
- helpers/drilldown.ts: drop grafana-pyroscope-app from the catalogue.
- compose_grafana_smoke.spec.ts: flip expect.soft -> hard throw; prune
the retired-allowlist documentation block.
- helpers/{assertions,dom,sweep}.ts: prune retired-allowlist comments;
reword "tolerated" prose.
Compatibility harnesses
- compatibility/loki/cerberus-test-queries.yml: empty the should_skip
block (was 17 entries). File kept as schema placeholder.
- compatibility/loki/cmd/loki-compliance-tester/main.go: delete the
Overlay loader, skipKey lookup, SkipReason field on Result, -overlay
flag, and every overlay-driven branch in compareAll. Drop the yaml
import.
- compatibility/loki/scripts/run-loki-compatibility.sh: drop -overlay
arg + DRIVER_OVERLAY env + the `skipped` jq bucket.
- compatibility/prometheus/expected-failures.json: DELETED (was empty
anyway; the file itself was the allowlist mechanism).
- compatibility/tempo/expected-failures.json: DELETED. Remove the
--expected-failures flag, the loader, the docker-compose mount, the
run-script DIFF_FLAGS branch.
- compatibility/tempo/driver/differ.go: remove the StartTimeUnixNano
blank-skip — an asymmetric blank on either side is now a real
divergence (the backend that omitted the field is the bug). Epsilon
comparison for parsed numeric values is kept (float-noise absorption,
not a case-allowlist).
CI / lefthook gates
- .github/workflows/ci.yml forbid-skip:
* Replace the "guard new should_skip entries with a tracking ref"
step with a hard reject of any non-empty should_skip block. The
consumer code is gone; entries would be silently ignored.
* New step: reject test-suite escape-hatch primitives anywhere in
.ts/.tsx/.go (EXPECTED_EMPTY, EXPECTED_TOLERATED, isKnownTolerated,
tolerated404, expect.soft, should_tolerate, skipReason/SkipReason,
APP_NOT_INSTALLED_BANNER_PATTERNS, DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE).
- lefthook.yml: mirror the same forbid-escape-hatch + should_skip
guards in the pre-push hook.
- scripts/check-skip-additions.sh: DELETED. The "guard untracked
entries" policy is replaced by "zero entries".
Docs
- docs/compatibility.md: replace the "Expected-failures allowlist"
section with a "No allow-lists" section documenting the new policy.
- compatibility/loki/README.md: align the overlay description with
the schema-placeholder reality.
- compatibility/prometheus/{test-cerberus.yml,scripts/run-compatibility.sh}:
reword to remove the expected-failures references.
PR #701 follow-up (feat/quickstart-rich-observability)
The PR #701 branch adds two more EXPECTED_EMPTY entries on top of main:
- iterate-metrics-explorer.spec.ts: `system_` prefix + `clickhouse_event`
prefix entries (b3a9dad).
- iterate-all-dashboards.spec.ts: broaden the `clickhouse_event` match
pattern (b4526fd).
Once this PR merges and #701 rebases, those entries become deletion
conflicts the rebaser must hand-resolve to "deleted". The
`forbid-escape-hatch` gate will reject the PR if any survives.
tsouza
added a commit
that referenced
this pull request
May 22, 2026
Structural cleanup PR. Every test failure must surface as a real bug to
fix at the source (cerberus code, seed, dashboard, panel) — no allow-
list, tolerance, expected-empty, should_skip, expect.soft, or per-field
blank-skip is acceptable anywhere.
Playwright specs
- iterate-all-dashboards.spec.ts: delete EXPECTED_EMPTY_EXPR_SUBSTRINGS
+ isExpectedEmpty + the conditional in probeTarget. Every empty
panel result is now a hard fail.
- iterate-metrics-explorer.spec.ts: delete EXPECTED_EMPTY,
HISTOGRAM_COMPANION_SUFFIXES, dottedAlias, and the companion-
suffix + dotted-alias fallbacks that masked bare-name 0-result
failures. Every catalog-published metric must resolve to >= 1 series.
- iterate-drilldown-apps.spec.ts: delete ALERT_ERROR_PATTERNS (banner-
substring allow-list), APP_NOT_INSTALLED_BANNER_PATTERNS, and
DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE. Hard-fail every role=alert
banner and every console error. Remove the install-probe early-skip;
every catalogue entry must be installed (catalogue now lists three
apps cerberus actually provisions; pyroscope is gone).
- helpers/drilldown.ts: drop grafana-pyroscope-app from the catalogue.
- compose_grafana_smoke.spec.ts: flip expect.soft -> hard throw; prune
the retired-allowlist documentation block.
- helpers/{assertions,dom,sweep}.ts: prune retired-allowlist comments;
reword "tolerated" prose.
Compatibility harnesses
- compatibility/loki/cerberus-test-queries.yml: empty the should_skip
block (was 17 entries). File kept as schema placeholder.
- compatibility/loki/cmd/loki-compliance-tester/main.go: delete the
Overlay loader, skipKey lookup, SkipReason field on Result, -overlay
flag, and every overlay-driven branch in compareAll. Drop the yaml
import.
- compatibility/loki/scripts/run-loki-compatibility.sh: drop -overlay
arg + DRIVER_OVERLAY env + the `skipped` jq bucket.
- compatibility/prometheus/expected-failures.json: DELETED (was empty
anyway; the file itself was the allowlist mechanism).
- compatibility/tempo/expected-failures.json: DELETED. Remove the
--expected-failures flag, the loader, the docker-compose mount, the
run-script DIFF_FLAGS branch.
- compatibility/tempo/driver/differ.go: remove the StartTimeUnixNano
blank-skip — an asymmetric blank on either side is now a real
divergence (the backend that omitted the field is the bug). Epsilon
comparison for parsed numeric values is kept (float-noise absorption,
not a case-allowlist).
CI / lefthook gates
- .github/workflows/ci.yml forbid-skip:
* Replace the "guard new should_skip entries with a tracking ref"
step with a hard reject of any non-empty should_skip block. The
consumer code is gone; entries would be silently ignored.
* New step: reject test-suite escape-hatch primitives anywhere in
.ts/.tsx/.go (EXPECTED_EMPTY, EXPECTED_TOLERATED, isKnownTolerated,
tolerated404, expect.soft, should_tolerate, skipReason/SkipReason,
APP_NOT_INSTALLED_BANNER_PATTERNS, DRILLDOWN_UPSTREAM_GRAFANA_CONSOLE_NOISE).
- lefthook.yml: mirror the same forbid-escape-hatch + should_skip
guards in the pre-push hook.
- scripts/check-skip-additions.sh: DELETED. The "guard untracked
entries" policy is replaced by "zero entries".
Docs
- docs/compatibility.md: replace the "Expected-failures allowlist"
section with a "No allow-lists" section documenting the new policy.
- compatibility/loki/README.md: align the overlay description with
the schema-placeholder reality.
- compatibility/prometheus/{test-cerberus.yml,scripts/run-compatibility.sh}:
reword to remove the expected-failures references.
PR #701 follow-up (feat/quickstart-rich-observability)
The PR #701 branch adds two more EXPECTED_EMPTY entries on top of main:
- iterate-metrics-explorer.spec.ts: `system_` prefix + `clickhouse_event`
prefix entries (b3a9dad).
- iterate-all-dashboards.spec.ts: broaden the `clickhouse_event` match
pattern (b4526fd).
Once this PR merges and #701 rebases, those entries become deletion
conflicts the rebaser must hand-resolve to "deleted". The
`forbid-escape-hatch` gate will reject the PR if any survives.
…ed in via OTel collector Adds three new metric receivers alongside the existing OTLP self-export in test/e2e/otel-collector/compose-config.yaml so the docker-compose quickstart paints a full-stack observability picture out of the box, without bypassing cerberus on the read path: - prometheus/self -- scrapes the collector's own :8888/metrics endpoint (service.telemetry.metrics now exposes it via the 0.123+ readers/pull syntax). Surfaces every otelcol_* receiver / processor / exporter counter, queue depth, batch send sizes, Go runtime memory. - hostmetrics -- every supported scraper enabled (cpu / memory / disk / network / filesystem / load / paging / processes) including the disabled-by-default *.utilization gauges and conntrack counters. - sqlquery/clickhouse -- queries system.metrics, system.events, system.asynchronous_metrics, and system.parts every 15s. The three name-pivoted families (clickhouse_metric / _event / _async_metric) cover ~400 CH server signals without enumerating each one up front. A transform/metric_names processor rewrites OTel-dotted metric names (system.cpu.time) to underscored PromQL-friendly ones (system_cpu_time) before writing so dashboard queries don't need UTF-8 escaping. Three new resource processors stamp service.name per source so PromQL filters can pivot. Three matching dashboards land under test/e2e/grafana/compose/ dashboards/ alongside cerberus-self.json: - clickhouse-observability.json -- in-flight queries, parts on disk, memory, connections, merges, query rate by type, MergeTree I/O, network, caches, thread pools, replication state, errors, host resource gauges. - otelcol-observability.json -- uptime, RSS, queue depth, send failures, processor refusals, Go heap, receiver / exporter throughput by signal, drops / failures, batch send-size quantiles, queue depth vs capacity. - host-observability.json -- CPU by state, per-core utilisation, memory by state + utilisation, disk IOPS / throughput / operation time, network throughput + packets / errors / drops, filesystem utilisation by mount with red threshold at 90%, load average, paging. Validated against otel/opentelemetry-collector-contrib:0.152.1 (latest release). Real data flowed in a verify-network run: 750 sum rows + 1284 gauge rows across 64 distinct metric names produced inside ~60s. Stack-level pickup needs the compose docker-compose.yml owner to add the host /proc, /sys, / mounts (for true host visibility) and bump the collector image pin from 0.116.1 to 0.152.1 -- coordinated with the seed-removal agent rather than landed in this PR per the worktree file-disjointness contract.
…proc /sys for hostmetrics The hostmetricsreceiver wired in by PR #701 was reading the collector container's own /proc + /sys namespace, not the host's — every system_cpu_*, system_memory_*, system_disk_*, system_filesystem_* series reflected container-scoped state instead of the host machine the quickstart promises to surface. Mount /proc, /sys, and / from the host into /hostfs (ro,rslave) and point the receiver at them via the upstream-documented contract: HOST_PROC / HOST_SYS / HOST_ETC env vars (cpu / memory / paging / network / processes scrapers) + root_path: /hostfs in the receiver config (filesystem + disk scrapers that walk the full tree). Verified locally via `curl localhost:8080/api/v1/query?query=system_cpu_time`: 64 series (8 host CPUs x 8 states), cumulative seconds matching host uptime — clearly host data, not the ~0s container would emit. The image bump also lets us delete the transform/servicegraph_drop_exemplars workaround: the v0.116.x clickhouseexporter nil-deref on exemplar payloads (sum_metrics.go:129) that the processor existed to dodge is fixed in 0.152.1, confirmed by running the metrics/servicegraph pipeline for >2 metrics_flush_interval ticks with no panic and the traces_service_graph_request_total series flowing end-to-end into ClickHouse. Same image bump on the k3d gateway + agent for parity (the "bump both together" rule the original 0.120.0 comment called out). Drops the same drop_exemplars processor on the k3d side; renames the compose-side receivers + connector to their canonical host_metrics / service_graph aliases to silence the 0.152.x deprecation warnings the legacy hostmetrics / servicegraph names now emit on startup.
183739f to
fde1868
Compare
…name to service_graph 0.152.1 fixed the upstream clickhouseexporter nil-deref on the service-graph connector's exemplar payload that the earlier transform/servicegraph_drop_exemplars processor existed to work around; the same commit (fde1868) dropped the workaround from the k3d manifest but the regression test was still pinning it as required. Rename the connector key in the k3d manifest to the new canonical service_graph form (the deprecation alias is still accepted but the compose stack already pins the canonical name; mirror it here) and update the test to match — assertions still cover connector presence + traces-pipeline tap + metrics/servicegraph pipeline wiring.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires three additional OTel-collector metric receivers into the quickstart compose stack so it paints a full-stack observability picture out of the box, with three new Grafana dashboards under the existing
Cerberusfolder. Every metric path flows through cerberus on the read side -- nothing bypasses the gateway.otelcol_*receiver / processor / exporter counter, queue depth, batch send sizes, Go runtime memory.*.utilizationgauges and conntrack counters.system.metrics,system.events,system.asynchronous_metrics, andsystem.partsevery 15s. Three name-pivoted families (clickhouse_metric/_event/_async_metric) cover ~400 CH server signals without enumerating each one up front.A
transform/metric_namesprocessor rewrites OTel-dotted names (system.cpu.time) to underscored PromQL-friendly ones (system_cpu_time) before write so dashboard queries don't need UTF-8 escaping. Three resource processors stampservice.nameper source so PromQL filters can pivot.New dashboards
clickhouse-observability.json-- in-flight queries, parts on disk, memory, connections, merges, query rate by type, MergeTree I/O, network, caches, thread pools, replication state, errors, host resource gauges.otelcol-observability.json-- uptime, RSS, queue depth, send failures, processor refusals, Go heap, receiver/exporter throughput by signal, drops/failures, batch send-size quantiles, queue depth vs capacity.host-observability.json-- CPU by state, per-core utilisation, memory by state + utilisation, disk IOPS / throughput / operation time, network throughput + packets/errors/drops, filesystem utilisation by mount with red threshold at 90%, load average, paging.Verification
Validated against
otel/opentelemetry-collector-contrib:0.152.1(latest release) --validate --configpasses cleanly and a one-off run on a verify-network produced 750 sum rows + 1284 gauge rows across 64 distinct metric names within ~60s. Sample names:Coordination notes (out of scope for this PR)
The receiver YAML is wired; for the dashboards to populate against the live stack the docker-compose.yml owner (seed-removal agent) needs to:
otel/opentelemetry-collector-contrib:0.116.1to0.152.1(the newservice.telemetry.metrics.readers/pull/prometheussyntax is 0.123+; the legacyaddress:shorthand has been removed in 0.152).hostmetrics.root_path: /hostfsin the receiver YAML.Both are documented in the comment block at the top of
test/e2e/otel-collector/compose-config.yaml.Test plan
otelcol-contrib validate --config=...against 0.152.1 (zero errors)otel_metrics_{sum,gauge,histogram}within 60s🤖 Generated with Claude Code