Skip to content

Add onPodError matcher to categorize pre-startup pod failures#4891

Open
dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
dejanzele:categorizer-on-pod-error
Open

Add onPodError matcher to categorize pre-startup pod failures#4891
dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
dejanzele:categorizer-on-pod-error

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Apr 30, 2026

Summary

This PR extends the failure categorizer with two operator-facing additions:

  1. onPodError matcher - matches pod-level error text, covering pre-startup kubelet/runtime errors (image pull, missing volume, missing ConfigMap/Secret) and Armada-detected pod-level failures (stuck terminating, active deadline exceeded, externally deleted) that produce no useful container terminationMessage. These end up with empty failure_category / failure_subcategory in lookoutdb today.
  2. hint field on rules - operator-supplied user-facing copy describing the failure mode. When set, it is appended to the failure message that lands in lookoutdb.job_run.error, so end users see actionable guidance alongside the raw runtime error.

The two commits are independently useful; together they let operators both classify and explain previously-opaque pod-level failures.

Approach

onPodError

  • New rule field on CategoryRule. Matches a regex against the issue's pod-level error message. ContainerName scoping is ignored (pod-level text has no container attribution).
  • onTerminationMessage is unchanged - still matches container Terminated.Message, still honors ContainerName. Non-overlapping data source from OnPodError by design.
  • Classify is split into ClassifyContainerError(pod) and ClassifyPodError(pod, podErrorMessage). The pod-error variant is needed because kubelet rotates Waiting.Reason from ErrImagePull to ImagePullBackOff within seconds, replacing Waiting.Message with a generic backoff string, so by the time Armada classifies the pod the runtime error is no longer in pod.Status and must be passed in by the caller.
  • Config validation: a rule must specify exactly one of the four matchers; regex matchers compile-check at startup so invalid patterns fail fast.

hint

  • Optional Hint string on CategoryRule. Empty by default - no behavior change for existing configs.
  • The classifier returns the matched rule's hint alongside category/subcategory in ClassifyResult.
  • pod_issue_handler and reporter/event.go append the hint to the user-facing message with two newlines: "<original message>\n\n<hint>". Appended (not prepended) so the raw runtime error stays the lede.
  • Hints flow through the existing PodError -> Pulsar -> lookout-ingester pipeline into lookoutdb.job_run.error.

Validation

To reproduce on local dev:

1. Add a rule with both onPodError and hint (_local/executor/config.yaml under application:):

application:
  errorCategories:
    defaultCategory: "uncategorized"
    defaultSubcategory: "unknown"
    categories:
      - name: infrastructure
        rules:
          - onPodError:
              pattern: "no match for platform in manifest"
            subcategory: "platform_mismatch"
            hint: "Build the image for the cluster's CPU architecture (typically x64/arm64 mismatch)."

Categorization is opt-in: Armada ships no default rules.

2. Submit a wrong-arch job (example/platform-mismatch.yaml):

queue: test
jobSetId: platform-mismatch-repro
jobs:
  - namespace: default
    priority: 0
    podSpec:
      terminationGracePeriodSeconds: 0
      restartPolicy: Never
      containers:
        - name: wrong-arch
          image: amd64/busybox:latest
          command:
            - sh
            - -c
            - echo should-never-run
          resources:
            requests:
              memory: 64Mi
              cpu: "0.1"
            limits:
              memory: 64Mi
              cpu: "0.1"
armadactl create queue test
armadactl submit example/platform-mismatch.yaml

3. Wait for the kubelet event-based fail check to fire (typically 1-5 minutes).

4. Verify the categorization and hint landed:

docker exec postgres psql -U postgres -d lookout -c \
  "SELECT job_id, run_id, finished, failure_category, failure_subcategory,
          convert_from(decompress(error), 'UTF8') AS error_text
     FROM job_run
    ORDER BY finished DESC NULLS LAST
    LIMIT 1;"

Expected:

 failure_category | failure_subcategory
------------------+---------------------
 infrastructure   | platform_mismatch

 error_text
------------
 Pod of the job failed
 ...
 Failed to pull image "amd64/busybox:latest": no match for platform in manifest
 ...

 Build the image for the cluster's CPU architecture (typically x64/arm64 mismatch).

The hint appears on its own paragraph after the raw kubelet error.

Live-validated end-to-end on macOS arm64 (M3) against a k3d cluster.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR extends the executor's failure categorizer with two features: an onPodError regex matcher for classifying pre-startup pod-level failures (image pull errors, missing volumes, stuck terminating, active deadline exceeded) that have no useful container terminationMessage, and an optional hint field on rules that appends operator-supplied guidance to the failure message stored in lookoutdb.job_run.error.

  • onPodError matcher: Classify is split into ClassifyContainerError (used for direct PodFailed detection in job_state_reporter) and ClassifyPodError (used in handleNonRetryableJobIssue, matching podIssue.Message against the new matcher). Each rule must still specify exactly one matcher; ContainerName scoping is intentionally ignored for onPodError rules since pod-level errors carry no container attribution.
  • hint field: Propagated through ClassifyResult.AppendHint, called at both reporting sites (event.go for direct pod failures, pod_issue_handler.go for executor-detected issues) and appended after the raw runtime error text.

Confidence Score: 5/5

Safe to merge; the change is additive with no behavior change when no rules are configured, and both new features are covered by end-to-end tests.

Both new features are opt-in with no behavior change for existing configs; regex and matcher-count validation happen at startup so misconfigured rules fail fast; all changed code paths have direct test coverage.

No files require special attention; all changed paths have direct test coverage.

Important Files Changed

Filename Overview
internal/executor/categorizer/classifier.go Adds onPodError regex matcher and hint field to rule evaluation; splits Classify into ClassifyContainerError and ClassifyPodError. Logic is correct; minor usability gap where containerName on an onPodError rule is silently ignored.
internal/executor/categorizer/classifier_test.go Comprehensive test coverage for new matchers, AppendHint, and the ClassifyContainerError/ClassifyPodError split; includes boundary cases and regression guards.
internal/executor/categorizer/types.go Adds OnPodError and Hint fields to CategoryRule with clear documentation of ContainerName scoping semantics.
internal/executor/reporter/event.go Applies AppendHint to the PodFailed reason in CreateEventForCurrentState using the existing ClassifyContainerError result.
internal/executor/service/pod_issue_handler.go Switches handleNonRetryableJobIssue to ClassifyPodError, using podIssue.Message as the error text for onPodError matching and appending the hint before emitting the failed event.
internal/executor/service/pod_issue_handler_test.go Adds TestPodIssueService_OnPodErrorClassifies covering platform-mismatch and active-deadline-exceeded cases including hint ordering assertions.
internal/executor/service/job_state_reporter.go Renames Classify call to ClassifyContainerError for the direct PodFailed reporting path; no behavior change.
internal/executor/categorizer/doc.go Updates package documentation to describe onPodError, hint, ContainerName scoping semantics, and the new API surface.
internal/executor/reporter/event_test.go Adds hint-ordering test for CreateEventForCurrentState and updates existing test to use the renamed ClassifyContainerError.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Pod Detected] --> B{Pod Phase?}
    B -->|PodFailed| C[job_state_reporter\nreportCurrentStatus]
    B -->|Pending/Unknown\nStuck/Terminating\nDeadline Exceeded\nExternally Deleted| D[pod_issue_handler\ndetectPodIssues]
    C --> E[ClassifyContainerError\n- onConditions\n- onExitCodes\n- onTerminationMessage]
    E --> F[AppendHint\nextractPodFailedReason]
    F --> G[CreateEventForCurrentState]
    D --> H[registerIssue non-retryable]
    H --> I[handleNonRetryableJobIssue]
    I --> J[ClassifyPodError\n- onConditions\n- onExitCodes\n- onTerminationMessage\n- onPodError NEW]
    J --> K[AppendHint podIssue.Message]
    K --> L[CreateJobFailedEvent]
    L --> M[Pulsar to lookoutdb.job_run.error]
    G --> M
Loading

Reviews (14): Last reviewed commit: "Add hint field to errorCategories rules ..." | Re-trigger Greptile

@dejanzele dejanzele force-pushed the categorizer-on-pod-error branch 6 times, most recently from e36ca03 to d1690ff Compare April 30, 2026 13:10
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

mauriceyap
mauriceyap previously approved these changes Apr 30, 2026
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the categorizer-on-pod-error branch 2 times, most recently from ebca648 to f7c8cee Compare May 6, 2026 11:32
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the categorizer-on-pod-error branch from f7c8cee to 9202ede Compare May 6, 2026 12:00
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the categorizer-on-pod-error branch from 9202ede to 7c651a1 Compare May 6, 2026 12:07
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants