Skip to content

fix: remove project label from projectstorage metrics to reduce cardinality#558

Merged
zachsmith1 merged 1 commit intomainfrom
fix/reduce-projectstorage-metric-cardinality
Apr 2, 2026
Merged

fix: remove project label from projectstorage metrics to reduce cardinality#558
zachsmith1 merged 1 commit intomainfrom
fix/reduce-projectstorage-metric-cardinality

Conversation

@zachsmith1
Copy link
Copy Markdown
Contributor

Summary

  • Remove the project label from all three projectstorage_* metrics

Problem

The project label creates cardinality explosion in VictoriaMetrics production storage (datum-cloud/infra#2113). PVCs are at 93% capacity.

Metric Before (per pod) After (per pod)
projectstorage_first_ready_seconds 414 × 82 × 12 = 407K series 82 × 12 = 984 series
projectstorage_child_creations_total 414 × 82 = 34K series 82 series
projectstorage_reinitializing_errors_total 414 × 82 × 7 = 237K series 82 × 7 = 574 series
Total ~678K / pod, ~6.1M across 9 pods ~1.6K / pod, ~14K across 9 pods

~430x reduction in cardinality.

What changed

  • Removed project from label dimensions on all three metrics
  • Removed project field from instrumentedStorage struct
  • Updated recordFirstReady, incrReinit, and childCreations.WithLabelValues call sites

The distribution by resource_group and resource_kind is the useful signal for understanding storage init performance. Per-project granularity is not actionable and is the source of the cardinality problem.

Test plan

  • go build ./internal/apiserver/storage/project/ passes
  • Deploy to staging, verify metrics still emit with reduced labels
  • Confirm VictoriaMetrics series count drops after old series expire

The project label on projectstorage_first_ready_seconds,
projectstorage_child_creations_total, and
projectstorage_reinitializing_errors_total creates a cardinality
explosion: 414 projects × 82 resource kinds × histogram buckets ×
9 pods = ~7.2M series, consuming 28% of all VictoriaMetrics
storage in production.

Drop the project label from all three metrics. The distribution
of storage init latency by resource_group and resource_kind is
the useful signal; per-project granularity is not needed and
causes the cardinality problem.

Reduces total series from ~4.7M to ~11K per pod.

Ref: datum-cloud/infra#2113
@joggrbot
Copy link
Copy Markdown
Contributor

joggrbot bot commented Apr 2, 2026

📝 Documentation Analysis

All docs are up to date! 🎉


✅ Latest commit analyzed: d54ebef | Powered by Joggr

@zachsmith1 zachsmith1 requested a review from scotwells April 2, 2026 04:55
@zachsmith1 zachsmith1 merged commit 8eee3af into main Apr 2, 2026
7 of 9 checks passed
@zachsmith1 zachsmith1 deleted the fix/reduce-projectstorage-metric-cardinality branch April 2, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants