Add per-rule duration histogram for failure classification#4884
Add per-rule duration histogram for failure classification#4884dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
Conversation
Greptile SummaryThis PR adds a Confidence Score: 5/5Safe to merge; only a minor test-quality P2 finding with no production impact. All findings are P2 (style). The production code path is correct, the empty-category guard is present, and the metric naming and buckets are appropriate. internal/executor/metrics/metrics_test.go — the "empty category is a no-op" test case inadvertently registers an empty-category label in the global registry. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant Classifier
participant ruleMatches
participant metrics
Caller->>Classifier: Classify(pod)
loop for each category / rule
Classifier->>Classifier: start = time.Now()
Classifier->>ruleMatches: ruleMatches(r, containers, podReason)
ruleMatches-->>Classifier: matched bool
Classifier->>metrics: RecordRuleEvaluationDuration(cat, subcategory, elapsed)
metrics->>metrics: guard: category == empty → return
metrics->>metrics: histogram.Observe(elapsed.Seconds())
alt matched
Classifier-->>Caller: ClassifyResult{Category, Subcategory}
end
end
Classifier-->>Caller: ClassifyResult{defaultCategory, defaultSubcategory}
Reviews (6): Last reviewed commit: "Add per-rule duration histogram for fail..." | Re-trigger Greptile |
6e3ad8f to
2910508
Compare
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
2910508 to
0f275bb
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it
Adds a per-rule duration histogram to the executor's failure classifier so operators have visibility into how long each classification rule takes to evaluate.
The classifier runs on the executor's hot path: every pod failure is run through every rule until one matches (or the default category is returned). Rules can include regex evaluation against the termination message. Without timing data, operators have no way to see when a rule has become unexpectedly slow as their config grows or when a single regex is dominating per-classification cost.
The metric records every rule evaluation regardless of match outcome, so a slow non-matching rule (one that adds latency to every classification but never triggers a category change) is still attributable to its
(category, subcategory).Suggested alerting