Skip to content

Add per-rule duration histogram for failure classification#4884

Open
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:classification-rule-eval-histogram
Open

Add per-rule duration histogram for failure classification#4884
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:classification-rule-eval-histogram

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Apr 28, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it

Adds a per-rule duration histogram to the executor's failure classifier so operators have visibility into how long each classification rule takes to evaluate.

The classifier runs on the executor's hot path: every pod failure is run through every rule until one matches (or the default category is returned). Rules can include regex evaluation against the termination message. Without timing data, operators have no way to see when a rule has become unexpectedly slow as their config grows or when a single regex is dominating per-classification cost.

The metric records every rule evaluation regardless of match outcome, so a slow non-matching rule (one that adds latency to every classification but never triggers a category change) is still attributable to its (category, subcategory).

armada_executor_job_failure_rule_evaluation_duration_seconds{failure_category, failure_subcategory}
Suggested alerting
# Any rule with p99 > 50ms is a candidate for review on RE2 + bounded input
histogram_quantile(0.99,
  sum(rate(armada_executor_job_failure_rule_evaluation_duration_seconds_bucket[5m]))
    by (le, failure_category, failure_subcategory)
) > 0.05
armada_executor_job_failure_rule_evaluation_duration_seconds_bucket{failure_category="stress_test",failure_subcategory="stupid_regex",le="0.01"} 0
armada_executor_job_failure_rule_evaluation_duration_seconds_bucket{failure_category="stress_test",failure_subcategory="stupid_regex",le="0.05"} 1
armada_executor_job_failure_rule_evaluation_duration_seconds_bucket{failure_category="stress_test",failure_subcategory="stupid_regex",le="0.1"}  1
armada_executor_job_failure_rule_evaluation_duration_seconds_bucket{failure_category="stress_test",failure_subcategory="stupid_regex",le="+Inf"} 1
armada_executor_job_failure_rule_evaluation_duration_seconds_sum{failure_category="stress_test",failure_subcategory="stupid_regex"} 0.015965125
armada_executor_job_failure_rule_evaluation_duration_seconds_count{failure_category="stress_test",failure_subcategory="stupid_regex"} 1

@dejanzele dejanzele marked this pull request as ready for review April 28, 2026 13:08
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

This PR adds a HistogramVec metric (armada_executor_job_failure_rule_evaluation_duration_seconds) to surface per-rule evaluation latency in the executor's failure classifier, enabling operators to identify slow regex or condition rules on the hot path. The implementation is clean: the empty-category guard is consistent with RecordJobFailure, bucket boundaries cover the expected µs–ms range, and both the unit and integration tests use a before/after delta strategy to isolate observations.

Confidence Score: 5/5

Safe to merge; only a minor test-quality P2 finding with no production impact.

All findings are P2 (style). The production code path is correct, the empty-category guard is present, and the metric naming and buckets are appropriate.

internal/executor/metrics/metrics_test.go — the "empty category is a no-op" test case inadvertently registers an empty-category label in the global registry.

Important Files Changed

Filename Overview
internal/executor/metrics/metrics.go Adds jobFailureRuleEvaluationDurationSeconds HistogramVec and RecordRuleEvaluationDuration; includes empty-category guard consistent with RecordJobFailure; bucket range (10µs–250ms) is appropriate for the hot-path use case.
internal/executor/categorizer/classifier.go Adds per-rule timing around ruleMatches using time.Now()/time.Since; delegates to metrics.RecordRuleEvaluationDuration. Logic and placement are correct.
internal/executor/categorizer/classifier_test.go New integration test verifies pre-match, matched, and post-match rules are/aren't recorded correctly; uses delta approach against the global registry which is safe given no t.Parallel() usage and unique label values.
internal/executor/metrics/metrics_test.go Tests for RecordRuleEvaluationDuration; the "empty category is a no-op" case creates the guarded empty-category label in the registry as a side effect of WithLabelValues, slightly defeating the stated invariant.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Classifier
    participant ruleMatches
    participant metrics

    Caller->>Classifier: Classify(pod)
    loop for each category / rule
        Classifier->>Classifier: start = time.Now()
        Classifier->>ruleMatches: ruleMatches(r, containers, podReason)
        ruleMatches-->>Classifier: matched bool
        Classifier->>metrics: RecordRuleEvaluationDuration(cat, subcategory, elapsed)
        metrics->>metrics: guard: category == empty → return
        metrics->>metrics: histogram.Observe(elapsed.Seconds())
        alt matched
            Classifier-->>Caller: ClassifyResult{Category, Subcategory}
        end
    end
    Classifier-->>Caller: ClassifyResult{defaultCategory, defaultSubcategory}
Loading

Reviews (6): Last reviewed commit: "Add per-rule duration histogram for fail..." | Re-trigger Greptile

Comment thread internal/executor/metrics/metrics.go Outdated
@dejanzele dejanzele force-pushed the classification-rule-eval-histogram branch 5 times, most recently from 6e3ad8f to 2910508 Compare April 28, 2026 14:54
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the classification-rule-eval-histogram branch from 2910508 to 0f275bb Compare April 29, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants