feat(metrics): TECH-6381 add org_slug label to keeperhub_workflow_executions_total#1146
Open
chong-techops wants to merge 4 commits intostagingfrom
Open
feat(metrics): TECH-6381 add org_slug label to keeperhub_workflow_executions_total#1146chong-techops wants to merge 4 commits intostagingfrom
chong-techops wants to merge 4 commits intostagingfrom
Conversation
Replace the two queries (status counts + per-org error breakdown) with a single combined query that groups by status AND organization.slug. Adds executionsByStatusAndOrgSlug to WorkflowStats so the prometheus collector can label keeperhub_workflow_executions_total by org_slug, following the same convention as the errors gauge (anonymous workflows bucket under '_anonymous' so per-org sums match the global totals). totalSuccess/totalError/etc and errorByOrgSlug are now derived from the combined query, keeping the existing errors gauge wiring unchanged.
Add org_slug to keeperhub_workflow_executions_total so dashboards and alerts can scope the success rate to managed clients (the errors gauge already had this label; the executions gauge didn't, which made it impossible to compute a system-only success rate). Reset before populating so series for orgs that drop to zero in a given status clear out instead of going stale -- same pattern as the errors gauge.
…rrors Document the convention so dashboard authors know they can scope these gauges by managed-client org_slug, and that '_anonymous' is reserved for personal workflows.
…eled gauges The existing 'use max by (label)' guidance was correct when status was the only label and pods were the only repetition source. With org_slug now a real partition dimension on workflow_executions_total and workflow_execution_errors_total, max by (status) returns the busiest single org instead of the total -- a silent regression for any panel using that pattern. Document the corrected pattern (sum-of-max) so dashboard authors get the total across orgs while still deduping pods, and update the delta() examples to match.
3acb76a to
fd045a7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
org_sluglabel tokeeperhub_workflow_executions_totalso dashboards/alerts can scope success rate and traffic panels to managed clients (techops-services, ajna) the same way the errors gauge already supports.statusANDorganization.slug. Per-status totals anderrorByOrgSlugare derived from the combined result, keeping the existing errors gauge unchanged.org_slug="_anonymous"so per-org sums match the unfiltered totals.METRICS_REFERENCE.mdwith the corrected aggregation pattern (sum(max by (..., org_slug) (...))) — the previously documentedmax by (status)would silently underreport totals once the partition label exists.Why
TechOps wants the workflow Success Rate panel split into User vs System (managed orgs only), so user-workflow failures don't drag the system rate down. The errors gauge already had
org_slug; the executions gauge didn't, which made a system-only success rate impossible to compute.Deploy ordering
This PR should ship before the corresponding techops_infrastructure PR so the dashboard panels render against the new label as soon as they reach prod.