Add onPodError matcher to categorize pre-startup pod failures#4891
Add onPodError matcher to categorize pre-startup pod failures#4891dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
Conversation
Greptile SummaryThis PR extends the executor's failure categorizer with two features: an
Confidence Score: 5/5Safe to merge; the change is additive with no behavior change when no rules are configured, and both new features are covered by end-to-end tests. Both new features are opt-in with no behavior change for existing configs; regex and matcher-count validation happen at startup so misconfigured rules fail fast; all changed code paths have direct test coverage. No files require special attention; all changed paths have direct test coverage. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Pod Detected] --> B{Pod Phase?}
B -->|PodFailed| C[job_state_reporter\nreportCurrentStatus]
B -->|Pending/Unknown\nStuck/Terminating\nDeadline Exceeded\nExternally Deleted| D[pod_issue_handler\ndetectPodIssues]
C --> E[ClassifyContainerError\n- onConditions\n- onExitCodes\n- onTerminationMessage]
E --> F[AppendHint\nextractPodFailedReason]
F --> G[CreateEventForCurrentState]
D --> H[registerIssue non-retryable]
H --> I[handleNonRetryableJobIssue]
I --> J[ClassifyPodError\n- onConditions\n- onExitCodes\n- onTerminationMessage\n- onPodError NEW]
J --> K[AppendHint podIssue.Message]
K --> L[CreateJobFailedEvent]
L --> M[Pulsar to lookoutdb.job_run.error]
G --> M
Reviews (14): Last reviewed commit: "Add hint field to errorCategories rules ..." | Re-trigger Greptile |
e36ca03 to
d1690ff
Compare
d1690ff to
66d8523
Compare
ebca648 to
f7c8cee
Compare
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
f7c8cee to
9202ede
Compare
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
9202ede to
7c651a1
Compare
Summary
This PR extends the failure categorizer with two operator-facing additions:
onPodErrormatcher - matches pod-level error text, covering pre-startup kubelet/runtime errors (image pull, missing volume, missing ConfigMap/Secret) and Armada-detected pod-level failures (stuck terminating, active deadline exceeded, externally deleted) that produce no useful containerterminationMessage. These end up with emptyfailure_category/failure_subcategoryin lookoutdb today.hintfield on rules - operator-supplied user-facing copy describing the failure mode. When set, it is appended to the failure message that lands inlookoutdb.job_run.error, so end users see actionable guidance alongside the raw runtime error.The two commits are independently useful; together they let operators both classify and explain previously-opaque pod-level failures.
Approach
onPodErrorCategoryRule. Matches a regex against the issue's pod-level error message.ContainerNamescoping is ignored (pod-level text has no container attribution).onTerminationMessageis unchanged - still matches containerTerminated.Message, still honorsContainerName. Non-overlapping data source fromOnPodErrorby design.Classifyis split intoClassifyContainerError(pod)andClassifyPodError(pod, podErrorMessage). The pod-error variant is needed because kubelet rotatesWaiting.ReasonfromErrImagePulltoImagePullBackOffwithin seconds, replacingWaiting.Messagewith a generic backoff string, so by the time Armada classifies the pod the runtime error is no longer inpod.Statusand must be passed in by the caller.hintHint stringonCategoryRule. Empty by default - no behavior change for existing configs.ClassifyResult.pod_issue_handlerandreporter/event.goappend the hint to the user-facing message with two newlines:"<original message>\n\n<hint>". Appended (not prepended) so the raw runtime error stays the lede.PodError-> Pulsar -> lookout-ingester pipeline intolookoutdb.job_run.error.Validation
To reproduce on local dev:
1. Add a rule with both
onPodErrorandhint(_local/executor/config.yamlunderapplication:):Categorization is opt-in: Armada ships no default rules.
2. Submit a wrong-arch job (
example/platform-mismatch.yaml):armadactl create queue test armadactl submit example/platform-mismatch.yaml3. Wait for the kubelet event-based fail check to fire (typically 1-5 minutes).
4. Verify the categorization and hint landed:
Expected:
The hint appears on its own paragraph after the raw kubelet error.
Live-validated end-to-end on macOS arm64 (M3) against a k3d cluster.