Skip to content

Refactor executor classifier to first-match-wins with subcategory#4859

Closed
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:slim-classifier-refactor
Closed

Refactor executor classifier to first-match-wins with subcategory#4859
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:slim-classifier-refactor

Conversation

@dejanzele
Copy link
Copy Markdown
Member

What type of PR is this?

Refactor (1 of 2 splitting #4843 into smaller reviews)

What this PR does / why we need it

Rewrites the executor error classifier from returning all matching categories to a single (category, subcategory) pair with first-match-wins semantics. Gated behind a new executor flag (off by default) so existing deployments are untouched until they opt in.

Wire format is preserved. ExtractFailureInfo now takes ClassifyResult and packs non-empty category + subcategory into the existing FailureInfo.Categories list. No proto change, no migration in this PR. A follow-up PR will flatten FailureInfo into scalar Error.failure_category / failure_subcategory fields on the event proto and update the Lookout ingester.

Split from #4843 at the request of reviewers who asked for smaller PRs. This PR contains only the classifier logic change; the wire-format swap ships separately.

Classifier changes

  • NewClassifier now takes ErrorCategoriesConfig (which adds an optional defaultCategory) instead of []CategoryConfig
  • Classify returns ClassifyResult{Category, Subcategory} instead of []string
  • First-match-wins: rules within a category are evaluated in config order and the first matching rule (carrying its optional subcategory) wins
  • When no rule matches, the configured defaultCategory (or the built-in uncategorized) is returned
  • Nil-receiver / nil-pod safety preserved

Feature flag

enableJobErrorCategorization (default false) in executor ApplicationConfiguration. When off, the classifier is never constructed; Classify on a nil receiver returns an empty ClassifyResult and ExtractFailureInfo writes an empty Categories list - identical behavior to the pre-PR executor.

Special notes for your reviewer

  • Focus review on internal/executor/categorizer/* (classifier semantics) and internal/executor/util/pod_status.go (the ExtractFailureInfo adapter that preserves the FailureInfo.Categories wire shape)
  • Callers in internal/executor/reporter/event.go and internal/executor/service/pod_issue_handler.go type-check unchanged because classifier.Classify() and ExtractFailureInfo() types shifted in lockstep
  • No proto, migration, or ingester changes in this PR - they live in the follow-up

Part of

Split from #4843. Follow-up PR will flatten the wire format and update the Lookout ingester.

…ind feature flag

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR refactors the executor's error classifier from returning all matching categories ([]string) to a single first-match-wins ClassifyResult{Category, Subcategory} pair, gated behind a new enableJobErrorCategorization flag (default false). The wire format is preserved via an adapter in ExtractFailureInfo that packs the two fields into the existing FailureInfo.Categories list, so no proto or ingester changes are needed here.

Confidence Score: 5/5

Safe to merge — the feature flag keeps existing deployments entirely unaffected and the classifier logic is well-tested.

All findings are P2 style/suggestion issues (a premature doc comment, a defensive test for an unreachable state, and a missing startup warning for empty categories config). No correctness, data-integrity, or security issues were found.

No files require special attention; the three P2 notes are optional improvements.

Important Files Changed

Filename Overview
internal/executor/categorizer/classifier.go Core semantic change: first-match-wins Classify returning ClassifyResult{Category, Subcategory}. Nil-receiver and nil-pod safety preserved; default category fallback is clean.
internal/executor/categorizer/classifier_test.go Tests fully updated for new semantics; new cases for first-match-wins, custom default category, and subcategory added. Coverage is thorough.
internal/executor/util/pod_status.go Adapter correctly packs ClassifyResult into FailureInfo.Categories as [category, subcategory], filtering empty strings to preserve prior no-category behavior.
internal/executor/util/pod_status_test.go Added test cases for the new packing behavior; one test case (subcategory without category) exercises a state the Classifier cannot actually produce.
internal/executor/application.go Feature flag guard added; classifier construction skipped when off. No warning logged when flag is enabled but categories list is empty.
internal/executor/categorizer/types.go New ErrorCategoriesConfig wrapper type and Subcategory field on CategoryRule are correct and well-documented.
internal/executor/configuration/types.go EnableJobErrorCategorization flag and updated ErrorCategories type are clean; YAML tags are consistent.
e2e/config/executor_config.yaml E2E config updated to use the new nested categories structure and enables the flag; rules are unchanged from before.
internal/executor/reporter/event_test.go Updated to use ErrorCategoriesConfig wrapper; no logic change, type-check only.
internal/executor/service/pod_issue_handler_test.go Updated to use ErrorCategoriesConfig wrapper; no logic change, type-check only.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Pod fails] --> B{EnableJobErrorCategorization?}
    B -->|false| C[classifier is nil / Classify returns empty ClassifyResult]
    B -->|true| D[NewClassifier with ErrorCategoriesConfig]
    D --> E[Classify pod]
    E --> F{Any rule matches?}
    F -->|yes - rule R in category C| G[ClassifyResult with Category and Subcategory]
    F -->|no| H[ClassifyResult with defaultCategory only]
    C --> I[ExtractFailureInfo - categories is nil]
    G --> J[ExtractFailureInfo - pack category and subcategory]
    H --> J
    I --> K[FailureInfo.Categories is nil - no-op for downstream]
    J --> L["FailureInfo.Categories = ['infrastructure', 'oom']"]
Loading

Comments Outside Diff (1)

  1. internal/executor/categorizer/doc.go, line 583-588 (link)

    P2 Doc comment describes follow-up wire format, not current behavior

    The updated package doc says "category and subcategory are set on the Error proto attached to events," but in this PR they're still packed into FailureInfo.Categories as [category, subcategory]. The Error proto scalar fields (failure_category/failure_subcategory) come in the follow-up PR. The doc should describe the current wire representation to avoid confusing readers of this PR.

Reviews (1): Last reviewed commit: "Refactor executor classifier to first-ma..." | Re-trigger Greptile

Comment on lines +173 to 180
"subcategory without category is still included": {
pod: customErrorPod,
result: categorizer.ClassifyResult{Subcategory: "orphan"},
expectedExitCode: 1,
expectedCategories: []string{"orphan"},
expectedTermMsg: "Custom error",
expectedContainerName: "custom-error",
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Test exercises state the Classifier can never actually produce

ClassifyResult{Subcategory: "orphan", Category: ""} can't be returned by Classify: a non-nil classifier on a non-nil pod always sets Category (either a matched category name or defaultCategory). The test validates defensive handling in ExtractFailureInfo, but it's documenting a contract that has no production path today, and may mislead future authors into thinking an orphan-subcategory result is expected. If the intent is purely defensive, a comment explaining that this is a guard against direct-construction callers would help.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +206 to 212
var classifier *categorizer.Classifier
if config.Application.EnableJobErrorCategorization {
classifier, err = categorizer.NewClassifier(config.Application.ErrorCategories)
if err != nil {
ctx.Fatalf("Config error in error categories: %s", err)
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Empty errorCategories.categories with flag on silently labels every failure "uncategorized"

When enableJobErrorCategorization: true but categories is empty, NewClassifier succeeds and builds a classifier with no rules. Because Classify always returns the defaultCategory on no-match, every pod failure will produce FailureInfo.Categories = ["uncategorized"]. An operator who enables the flag while still authoring their category config gets noisy, meaningless data in Lookout. A startup warning log here (or a validation error when categories is empty while the flag is on) would prevent silent misconfiguration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant