Skip to content

feat(telemetry): reconcile dashboards with real instruments + top-5 tuning signals (#209 phase 1)#211

Merged
mbachaud merged 1 commit into
masterfrom
fix/209-telemetry-phase1
Jun 10, 2026
Merged

feat(telemetry): reconcile dashboards with real instruments + top-5 tuning signals (#209 phase 1)#211
mbachaud merged 1 commit into
masterfrom
fix/209-telemetry-phase1

Conversation

@mbachaud

Copy link
Copy Markdown
Owner

Phase 1 of #209. All 8 phantom metrics in helix-pipeline-observatory.json repointed to real instruments (every intended signal already existed - mapping table in commit); dead job='helix' matchers dropped; process-memory panel fixed. Five new instruments wired at computation sites: helix_dense_cosine (hot/cold arms), helix_shard_fanout + helix_shard_discrimination (the #159 metric), helix_know_decision_total, helix_session_tokens_saved_total, helix_splice_ratio. New helix-internals dashboard (uid already linked by launcher). OBSERVABILITY.md genai_telemetry sections marked planned-phase-2 so docs match code. tests/test_telemetry_phase1.py includes the dashboard-vs-registry phantom-killer regression test. 11/11 new + 388 adjacent passed.

…uning signals (#209 phase 1)

Phantom metrics: helix-pipeline-observatory.json charted 8 metric names
that exist nowhere in helix_context/telemetry. Every panel is repointed
at its nearest real instrument (and the job="helix" matchers are dropped
- the stack's scrape jobs are otel-collector/prometheus, so they matched
nothing):

  helix_tier_estimation_percent     -> helix_tier_fired_total (share %)
  helix_tier_readable_time_bucket   -> helix_genome_signal_seconds_bucket
  helix_crdt_bucket_accumulation    -> helix_cwola_bucket_total
  helix_rq_duration_seconds_bucket  -> helix_context_latency_seconds_bucket
  helix_ring_edges_by_provenance    -> helix_harmonic_edges_total
  helix_chroni_join_state           -> helix_chromatin_state_total
  helix_cost_concentration_ratio    -> helix_hub_concentration_ratio
  helix_resolve_degree_distribution -> helix_hub_inbound_degree
  (also: process_resident_memory_bytes{job=helix} -> helix_genome_size_bytes)

New instruments (audit doc section 3c), all lazy no-op-when-disabled
getters in telemetry/otel.py following the existing pattern:

  helix_dense_cosine               hot dense-recall merge + cold-tier scan
  helix_shard_fanout               ShardRouter.query_genes routed-shard count
  helix_shard_discrimination       routed / known healthy shards (0..1)
  helix_know_decision_total        decide_know_or_miss {outcome, reason}
  helix_session_tokens_saved_total session working-set elision savings
  helix_splice_ratio               assembled-window compression ratio

New deploy/otel/grafana/dashboards/helix-internals.json (uid
helix-internals - the launcher and setup scripts already linked to it)
with one panel per new instrument. OBSERVABILITY.md: genai_telemetry.py
sections (module absent from master) replaced with planned-(#209
phase 2) notes; metric table now matches code.

tests/test_telemetry_phase1.py: no-op safety with OTel disabled,
call-site label checks, and a dashboard-vs-registry cross-reference
that fails on any future phantom metric.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant