Skip to content

feat(metrics): TECH-6381 add org_slug label to keeperhub_workflow_executions_total#1146

Open
chong-techops wants to merge 4 commits intostagingfrom
TECH-6381-executions-total-org-slug-label
Open

feat(metrics): TECH-6381 add org_slug label to keeperhub_workflow_executions_total#1146
chong-techops wants to merge 4 commits intostagingfrom
TECH-6381-executions-total-org-slug-label

Conversation

@chong-techops
Copy link
Copy Markdown

@chong-techops chong-techops commented May 6, 2026

Summary

  • Adds org_slug label to keeperhub_workflow_executions_total so dashboards/alerts can scope success rate and traffic panels to managed clients (techops-services, ajna) the same way the errors gauge already supports.
  • Replaces the two separate DB queries (status totals + per-org error breakdown) with a single combined query that groups by status AND organization.slug. Per-status totals and errorByOrgSlug are derived from the combined result, keeping the existing errors gauge unchanged.
  • Anonymous workflows continue to bucket under org_slug="_anonymous" so per-org sums match the unfiltered totals.
  • Updates METRICS_REFERENCE.md with the corrected aggregation pattern (sum(max by (..., org_slug) (...))) — the previously documented max by (status) would silently underreport totals once the partition label exists.

Why

TechOps wants the workflow Success Rate panel split into User vs System (managed orgs only), so user-workflow failures don't drag the system rate down. The errors gauge already had org_slug; the executions gauge didn't, which made a system-only success rate impossible to compute.

Deploy ordering

This PR should ship before the corresponding techops_infrastructure PR so the dashboard panels render against the new label as soon as they reach prod.

Replace the two queries (status counts + per-org error breakdown) with
a single combined query that groups by status AND organization.slug.
Adds executionsByStatusAndOrgSlug to WorkflowStats so the prometheus
collector can label keeperhub_workflow_executions_total by org_slug,
following the same convention as the errors gauge (anonymous workflows
bucket under '_anonymous' so per-org sums match the global totals).

totalSuccess/totalError/etc and errorByOrgSlug are now derived from the
combined query, keeping the existing errors gauge wiring unchanged.
Add org_slug to keeperhub_workflow_executions_total so dashboards and
alerts can scope the success rate to managed clients (the errors gauge
already had this label; the executions gauge didn't, which made it
impossible to compute a system-only success rate).

Reset before populating so series for orgs that drop to zero in a given
status clear out instead of going stale -- same pattern as the errors
gauge.
…rrors

Document the convention so dashboard authors know they can scope these
gauges by managed-client org_slug, and that '_anonymous' is reserved for
personal workflows.
…eled gauges

The existing 'use max by (label)' guidance was correct when status was
the only label and pods were the only repetition source. With org_slug
now a real partition dimension on workflow_executions_total and
workflow_execution_errors_total, max by (status) returns the busiest
single org instead of the total -- a silent regression for any panel
using that pattern.

Document the corrected pattern (sum-of-max) so dashboard authors get
the total across orgs while still deduping pods, and update the
delta() examples to match.
@chong-techops chong-techops force-pushed the TECH-6381-executions-total-org-slug-label branch from 3acb76a to fd045a7 Compare May 6, 2026 06:58
@chong-techops chong-techops requested a review from a team May 6, 2026 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant