Refactor executor classifier to first-match-wins with subcategory#4859
Refactor executor classifier to first-match-wins with subcategory#4859dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
Conversation
…ind feature flag Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Greptile SummaryThis PR refactors the executor's error classifier from returning all matching categories ( Confidence Score: 5/5Safe to merge — the feature flag keeps existing deployments entirely unaffected and the classifier logic is well-tested. All findings are P2 style/suggestion issues (a premature doc comment, a defensive test for an unreachable state, and a missing startup warning for empty categories config). No correctness, data-integrity, or security issues were found. No files require special attention; the three P2 notes are optional improvements. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Pod fails] --> B{EnableJobErrorCategorization?}
B -->|false| C[classifier is nil / Classify returns empty ClassifyResult]
B -->|true| D[NewClassifier with ErrorCategoriesConfig]
D --> E[Classify pod]
E --> F{Any rule matches?}
F -->|yes - rule R in category C| G[ClassifyResult with Category and Subcategory]
F -->|no| H[ClassifyResult with defaultCategory only]
C --> I[ExtractFailureInfo - categories is nil]
G --> J[ExtractFailureInfo - pack category and subcategory]
H --> J
I --> K[FailureInfo.Categories is nil - no-op for downstream]
J --> L["FailureInfo.Categories = ['infrastructure', 'oom']"]
|
| "subcategory without category is still included": { | ||
| pod: customErrorPod, | ||
| result: categorizer.ClassifyResult{Subcategory: "orphan"}, | ||
| expectedExitCode: 1, | ||
| expectedCategories: []string{"orphan"}, | ||
| expectedTermMsg: "Custom error", | ||
| expectedContainerName: "custom-error", | ||
| }, |
There was a problem hiding this comment.
Test exercises state the
Classifier can never actually produce
ClassifyResult{Subcategory: "orphan", Category: ""} can't be returned by Classify: a non-nil classifier on a non-nil pod always sets Category (either a matched category name or defaultCategory). The test validates defensive handling in ExtractFailureInfo, but it's documenting a contract that has no production path today, and may mislead future authors into thinking an orphan-subcategory result is expected. If the intent is purely defensive, a comment explaining that this is a guard against direct-construction callers would help.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| var classifier *categorizer.Classifier | ||
| if config.Application.EnableJobErrorCategorization { | ||
| classifier, err = categorizer.NewClassifier(config.Application.ErrorCategories) | ||
| if err != nil { | ||
| ctx.Fatalf("Config error in error categories: %s", err) | ||
| } | ||
| } |
There was a problem hiding this comment.
Empty
errorCategories.categories with flag on silently labels every failure "uncategorized"
When enableJobErrorCategorization: true but categories is empty, NewClassifier succeeds and builds a classifier with no rules. Because Classify always returns the defaultCategory on no-match, every pod failure will produce FailureInfo.Categories = ["uncategorized"]. An operator who enables the flag while still authoring their category config gets noisy, meaningless data in Lookout. A startup warning log here (or a validation error when categories is empty while the flag is on) would prevent silent misconfiguration.
What type of PR is this?
Refactor (1 of 2 splitting #4843 into smaller reviews)
What this PR does / why we need it
Rewrites the executor error classifier from returning all matching categories to a single (category, subcategory) pair with first-match-wins semantics. Gated behind a new executor flag (off by default) so existing deployments are untouched until they opt in.
Wire format is preserved.
ExtractFailureInfonow takesClassifyResultand packs non-empty category + subcategory into the existingFailureInfo.Categorieslist. No proto change, no migration in this PR. A follow-up PR will flattenFailureInfointo scalarError.failure_category/failure_subcategoryfields on the event proto and update the Lookout ingester.Split from #4843 at the request of reviewers who asked for smaller PRs. This PR contains only the classifier logic change; the wire-format swap ships separately.
Classifier changes
NewClassifiernow takesErrorCategoriesConfig(which adds an optionaldefaultCategory) instead of[]CategoryConfigClassifyreturnsClassifyResult{Category, Subcategory}instead of[]stringdefaultCategory(or the built-inuncategorized) is returnedFeature flag
enableJobErrorCategorization(defaultfalse) in executorApplicationConfiguration. When off, the classifier is never constructed;Classifyon a nil receiver returns an emptyClassifyResultandExtractFailureInfowrites an emptyCategorieslist - identical behavior to the pre-PR executor.Special notes for your reviewer
internal/executor/categorizer/*(classifier semantics) andinternal/executor/util/pod_status.go(theExtractFailureInfoadapter that preserves theFailureInfo.Categorieswire shape)internal/executor/reporter/event.goandinternal/executor/service/pod_issue_handler.gotype-check unchanged becauseclassifier.Classify()andExtractFailureInfo()types shifted in lockstepPart of
Split from #4843. Follow-up PR will flatten the wire format and update the Lookout ingester.