Skip to content

Reformat metric labels to consistently use snake case#4893

Open
JamesMurkin wants to merge 1 commit intomasterfrom
reformat_metrics
Open

Reformat metric labels to consistently use snake case#4893
JamesMurkin wants to merge 1 commit intomasterfrom
reformat_metrics

Conversation

@JamesMurkin
Copy link
Copy Markdown
Contributor

Add snakecase equivalent of all labels to our metrics, with the longer term goal of moving wholesale to snake case labels

This will take 2 steps:

  • Add snakecase labels equivalent to all metrics (this PR)
  • Remove all non-snake case labels
    • This will not happen for a while, to allow for various people to migrate to the new labels

@JamesMurkin JamesMurkin marked this pull request as ready for review April 30, 2026 14:36
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR adds snake_case aliases alongside every existing camelCase Prometheus metric label (e.g. priorityClasspriority_class, nodeTypenode_type, overAllocatedover_allocated) across the scheduler, executor, and ingest services, to support a two-phase migration away from camelCase labels. Each metric descriptor now carries both forms and every recording call passes the same value twice — once for the legacy label and once for the new one.

The mechanical changes are consistent throughout: label counts in descriptors match value counts in every MustNewConstMetric, MustNewConstHistogram, and WithLabelValues call. The test updates also include a pre-existing semantic fix — jobErrorsByQueue/jobErrorsByNode are now tested with error-category labels rather than the state labels that were previously used by coincidence of equal arity.

Confidence Score: 4/5

Safe to merge — all label/value cardinalities are consistent and tests pass the updated counts.

No bugs found. Label counts in every descriptor match the value positions in every recording call. The test changes are correct and include a pre-existing semantic fix. Score is 4 rather than 5 purely because of the breadth of mechanical changes across many metrics — a subtle off-by-one in any single call site would cause a runtime panic.

internal/common/metrics/scheduler_metrics.go and internal/scheduler/metrics/cycle_metrics.go have the most label changes and are worth a final spot-check.

Important Files Changed

Filename Overview
internal/common/metrics/scheduler_metrics.go Adds snake_case aliases (priority_class, price_band, resource, node_type, set_by_user) for all camelCase labels across ~30 metric descriptors and their corresponding metric-construction functions; label counts and value positions are consistent throughout.
internal/common/ingest/metrics/metrics.go Adds event_type and msg_type snake_case labels to the eventsProcessed CounterVec; With() maps are updated with matching keys; counts are consistent.
internal/executor/metrics/pod_metrics/cluster_context.go Introduces resourceTypeLabelSnake ("resource") and nodeTypeLabelSnake ("node_type") constants, adds them to all pod/node metric descriptors, and duplicates their values in every MustNewConstMetric call; counts are consistent.
internal/scheduler/metrics/constants.go Adds three new snake_case label constants: nodeTypeLabelSnake ("node_type"), priorStateLabelSnake ("prior_state"), overAllocatedLabelSnake ("over_allocated").
internal/scheduler/metrics/cycle_metrics.go Adds node_type/over_allocated snake_case aliases to nodeLabels and the node_preemptibility GaugeVec, and duplicates corresponding values in ReportSchedulerResult; label/value counts match throughout.
internal/scheduler/metrics/state_metrics.go Adds prior_state snake_case alias to all six jobState CounterVecs and duplicates the priorState value in every WithLabelValues call; counts are consistent.
internal/scheduler/metrics/cycle_metrics_test.go Test label value arrays updated to match the expanded label sets for nodeAllocatableResource and nodeAllocatedResource; all array sizes are correct.
internal/scheduler/metrics/state_metrics_test.go Array-key types widened to match new label cardinality; error-metric test labels (byQueueErrorLabels/byNodeErrorLabels) correctly separated from state-metric labels, fixing a pre-existing semantic mismatch.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Metric recorded in code] --> B{Label type}
    B -->|camelCase legacy label| C["e.g. priorityClass=X\nnodeType=Y\noverAllocated=Z"]
    B -->|snake_case new label| D["e.g. priority_class=X\nnode_type=Y\nover_allocated=Z"]
    C --> E[Single Prometheus time series\ncarries both label sets]
    D --> E
    E --> F[Phase 2: Remove camelCase labels\nfuture PR]
    E --> G[Dashboards/alerts migrate\nto snake_case]
Loading

Reviews (1): Last reviewed commit: "Reformat metric labels to consistently u..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant