fix(test/e2e): allowlist otelcol_* + clickhouse_event empties on fresh compose#708
Merged
Conversation
4 tasks
tsouza
added a commit
that referenced
this pull request
May 22, 2026
…xy.golang.org HTTP/2 flakes (#709) The three compatibility harnesses (prom/loki/tempo) all build cerberus from Dockerfile.local on every CI run. The `RUN go mod download` step has no retry logic and no module cache mount, so a single transient `proxy.golang.org` HTTP/2 `stream error ... INTERNAL_ERROR; received from peer` mid-stream takes the whole compat job down with it. Observed on PR #708 / run 26306912141, compatibility/loki job 77445902857: `go: github.com/grpc-ecosystem/grpc-gateway/v2@v2.29.0: read "https://proxy.golang.org/.../v2.29.0.zip": stream error; INTERNAL_ERROR; received from peer`. The mandate is no-retry-rerun — fix the underlying fragility instead of bandaiding. Two structural changes to Dockerfile.local: 1. Wrap `go mod download` in a 5-attempt retry loop with linear backoff (3/6/9/12s). The Go module resolver does not retry past a bad HTTP/2 frame, so the wrapper is needed at the shell layer. 2. Add BuildKit `--mount=type=cache` for /go/pkg/mod and /root/.cache/go-build (sharing=locked because the three compat harnesses build this Dockerfile in parallel on the same runner). Warm caches mean transient proxy failures stop being possible on subsequent builds and the proxy hit surface narrows to first-build only. This is a fix to a flake class, not a single point; the same outage would have hit prom or tempo if the unlucky frame had landed there first.
…pties on fresh compose iterate-metrics-explorer + iterate-all-dashboards on PR #701's compose stack flagged ~30 otelcol_* metrics with empty /api/v1/series + the clickhouse-observability "Query rate by type" panel as empty. Both are emission-cadence artefacts of a fresh stack, not regressions: - otelcol_{exporter,processor,receiver,scraper,connector,process}_* — Collector self-telemetry counters that only tick on the underlying event (refused span, failed export, queue change). On a clean pipeline with no overload most stay at 0 in the 5m window even though the prometheus/self scraper has primed the catalog. - clickhouse_event{name=~"Query|SelectQuery|...|FailedInsertQuery"} — CH's per-event counters published via its built-in /metrics. The warmup drives a few SELECTs through cerberus but the scrape cadence (15s) + CH-side ProfileEvents flush can leave the 5m rate window empty when the cluster is otherwise idle. Add one broad `otelcol_` prefix entry to EXPECTED_EMPTY (covers all six otelcol_* subsystems; per-metric entries would be ~30 lines with identical rationale) and one substring entry to EXPECTED_EMPTY_EXPR_SUBSTRINGS pinned to the clickhouse_event Query regex. Keeps both lists under the 10-entry budget called out in their docstrings. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4aadd1a to
a82f75a
Compare
Merged
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Unblocks PR #701's compose-smoke job by allowlisting two emission-cadence empty-result patterns the new quickstart-rich-observability dashboards expose:
iterate-metrics-explorer.spec.ts— add a broadotelcol_prefix toEXPECTED_EMPTY. ~30otelcol_{exporter,processor,receiver,scraper,connector,process}_*metrics are catalog-enumerated by the prometheus/self scrape but their per-event counters legitimately stay at 0 in the 5m series window on a fresh stack (no overload, clean pipeline, refused/failed/dropped paths never fire). A single prefix entry covers all six otelcol_* subsystems with one rationale rather than 30 near-identical lines.iterate-all-dashboards.spec.ts— addclickhouse_event{name=~"QuerytoEXPECTED_EMPTY_EXPR_SUBSTRINGS. The newclickhouse-observabilitydashboard's "Query rate by type" panel ratesclickhouse_event{name=~"Query|SelectQuery|InsertQuery|AsyncInsertQuery|FailedQuery|FailedSelectQuery|FailedInsertQuery"}[5m]. CH's per-event counters tick only on the matched-named events; with the 15s scrape cadence + ProfileEvents flush latency the 5m rate window can legitimately be empty when the cluster is idle on these classes.Failure 1 (iterate-metrics-explorer) — root cause + fix
From the failed compose-smoke run on PR #701: 98 enumerated metrics, 45 non-empty, ~30
otelcol_*flagged with "`/api/v1/series` returned 0 series". The catalog endpoint sees the metric (the prometheus/self scrape pushed something into ClickHouse during pipeline bring-up — usually the gauge variants), but the per-counter rows lag because they only emit when the underlying event fires. Allowlisting the namespace with the cadence rationale is the documented escape hatch for this class — same pattern thek8s_/container_entries already use.Failure 2 (panel-kiosk) — why this PR does not fix it
The brief described two distinct compose-smoke failures, the second being
iterate-panel-kioskconsole-error 400s on theOpenTelemetry Collector - self-observabilitydashboard ("Exporter queue depth", "Send failures (5m)", "Processor refusals (5m)"). Re-reading the actual failed CI run on PR #701 (run 26305643537):iterate-panel-kioskfailed on the FIRST attempt with 10/10/8 console-error 400s on those three panels.iterate-all-dashboards(clickhouse-observability) anditerate-metrics-explorer(otelcol_*).So panel-kiosk is not what's actually blocking compose-smoke right now. The first-attempt 400s look like a propagation race — the underlying
otel_metrics_gauge/otel_metrics_sumrows for the newotelcol_exporter_send_failed_*/otelcol_processor_refused_*metrics haven't landed yet when Grafana fires the kiosk navigation, so the lowering hits the empty-rows path that surfaces as a 400 (vs. the "200 with empty result" Prometheus semantics).Diagnosing that 400 properly needs a running compose stack + the actual error body — which the brief explicitly forbids spinning up locally. Filing a follow-up note rather than guessing at the lowering bug: the only failure-after-retry signal is the two specs this PR addresses, and shipping the allowlist fixes unblocks PR #701's compose-smoke gate so the maintainer can rebase + re-run.
Test plan
compose-smokelane passes on this PR.iterate-metrics-explorer+iterate-all-dashboardsto pass.iterate-panel-kioskstill flakes on PR feat(quickstart): rich ClickHouse / host / otelcol observability via OTel collector #701 after rebase, file a follow-up to chase the 400s on empty otel_metrics_* tables — the 400 body is the actual signal and needs the running stack to capture.🤖 Generated with Claude Code