Add per-rule duration histogram for failure classification by dejanzele · Pull Request #4884 · armadaproject/armada

dejanzele · 2026-04-28T13:07:38Z

What type of PR is this?

/kind feature

What this PR does / why we need it

Adds a per-rule duration histogram to the executor's failure classifier so operators have visibility into how long each classification rule takes to evaluate.

The classifier runs on the executor's hot path: every pod failure is run through every rule until one matches (or the default category is returned). Rules can include regex evaluation against the termination message. Without timing data, operators have no way to see when a rule has become unexpectedly slow as their config grows or when a single regex is dominating per-classification cost.

The metric records every rule evaluation regardless of match outcome, so a slow non-matching rule (one that adds latency to every classification but never triggers a category change) is still attributable to its (category, subcategory).

armada_executor_job_failure_rule_evaluation_duration_seconds{failure_category, failure_subcategory}

Suggested alerting

# Any rule with p99 > 50ms is a candidate for review on RE2 + bounded input
histogram_quantile(0.99,
  sum(rate(armada_executor_job_failure_rule_evaluation_duration_seconds_bucket[5m]))
    by (le, failure_category, failure_subcategory)
) > 0.05

armada_executor_job_failure_rule_evaluation_duration_seconds_bucket{failure_category="stress_test",failure_subcategory="stupid_regex",le="0.01"} 0
armada_executor_job_failure_rule_evaluation_duration_seconds_bucket{failure_category="stress_test",failure_subcategory="stupid_regex",le="0.05"} 1
armada_executor_job_failure_rule_evaluation_duration_seconds_bucket{failure_category="stress_test",failure_subcategory="stupid_regex",le="0.1"}  1
armada_executor_job_failure_rule_evaluation_duration_seconds_bucket{failure_category="stress_test",failure_subcategory="stupid_regex",le="+Inf"} 1
armada_executor_job_failure_rule_evaluation_duration_seconds_sum{failure_category="stress_test",failure_subcategory="stupid_regex"} 0.015965125
armada_executor_job_failure_rule_evaluation_duration_seconds_count{failure_category="stress_test",failure_subcategory="stupid_regex"} 1

dejanzele · 2026-04-28T13:10:02Z

@greptileai

greptile-apps · 2026-04-28T13:11:40Z

Greptile Summary

This PR adds a HistogramVec metric (armada_executor_job_failure_rule_evaluation_duration_seconds) to surface per-rule evaluation latency in the executor's failure classifier, enabling operators to identify slow regex or condition rules on the hot path. The implementation is clean: the empty-category guard is consistent with RecordJobFailure, bucket boundaries cover the expected µs–ms range, and both the unit and integration tests use a before/after delta strategy to isolate observations.

Confidence Score: 5/5

Safe to merge; only a minor test-quality P2 finding with no production impact.

All findings are P2 (style). The production code path is correct, the empty-category guard is present, and the metric naming and buckets are appropriate.

internal/executor/metrics/metrics_test.go — the "empty category is a no-op" test case inadvertently registers an empty-category label in the global registry.

Important Files Changed

Filename	Overview
internal/executor/metrics/metrics.go	Adds `jobFailureRuleEvaluationDurationSeconds` HistogramVec and `RecordRuleEvaluationDuration`; includes empty-category guard consistent with `RecordJobFailure`; bucket range (10µs–250ms) is appropriate for the hot-path use case.
internal/executor/categorizer/classifier.go	Adds per-rule timing around `ruleMatches` using `time.Now()`/`time.Since`; delegates to `metrics.RecordRuleEvaluationDuration`. Logic and placement are correct.
internal/executor/categorizer/classifier_test.go	New integration test verifies pre-match, matched, and post-match rules are/aren't recorded correctly; uses delta approach against the global registry which is safe given no `t.Parallel()` usage and unique label values.
internal/executor/metrics/metrics_test.go	Tests for `RecordRuleEvaluationDuration`; the "empty category is a no-op" case creates the guarded empty-category label in the registry as a side effect of `WithLabelValues`, slightly defeating the stated invariant.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Classifier
    participant ruleMatches
    participant metrics

    Caller->>Classifier: Classify(pod)
    loop for each category / rule
        Classifier->>Classifier: start = time.Now()
        Classifier->>ruleMatches: ruleMatches(r, containers, podReason)
        ruleMatches-->>Classifier: matched bool
        Classifier->>metrics: RecordRuleEvaluationDuration(cat, subcategory, elapsed)
        metrics->>metrics: guard: category == empty → return
        metrics->>metrics: histogram.Observe(elapsed.Seconds())
        alt matched
            Classifier-->>Caller: ClassifyResult{Category, Subcategory}
        end
    end
    Classifier-->>Caller: ClassifyResult{defaultCategory, defaultSubcategory}

_{Reviews (6): Last reviewed commit: "Add per-rule duration histogram for fail..." | Re-trigger Greptile}

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

dejanzele marked this pull request as ready for review April 28, 2026 13:08

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread internal/executor/metrics/metrics.go Outdated

dejanzele force-pushed the classification-rule-eval-histogram branch 5 times, most recently from 6e3ad8f to 2910508 Compare April 28, 2026 14:54

Add per-rule duration histogram for failure classification

0f275bb

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

dejanzele force-pushed the classification-rule-eval-histogram branch from 2910508 to 0f275bb Compare April 29, 2026 13:46

mauriceyap approved these changes Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-rule duration histogram for failure classification#4884

Add per-rule duration histogram for failure classification#4884
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:classification-rule-eval-histogram

dejanzele commented Apr 28, 2026 •

edited

Loading

Uh oh!

dejanzele commented Apr 28, 2026

Uh oh!

greptile-apps Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dejanzele commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it

Suggested alerting

Uh oh!

dejanzele commented Apr 28, 2026

Uh oh!

greptile-apps Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dejanzele commented Apr 28, 2026 •

edited

Loading

greptile-apps Bot commented Apr 28, 2026 •

edited

Loading