Skip to content

Replace FailureInfo with flat failure_category/failure_subcategory fields#4843

Closed
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:slim-failure-classification
Closed

Replace FailureInfo with flat failure_category/failure_subcategory fields#4843
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:slim-failure-classification

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Apr 16, 2026

FailureInfo bundled four fields (exit_code, termination_message, categories, container_name) that were already present on ContainerError. Replacing it with two flat strings on Error cuts wire size, removes the duplicate termination-message storage, and gives the ingester two text columns to write instead of a jsonb blob.

Classifier returns a single (category, subcategory) with first-match-wins semantics instead of []string of all matches. Adds a default category and per-rule subcategory. Gated behind a new enableJobErrorCategorization executor flag, off by default, so existing deployments are untouched until they opt in.

Proto: delete FailureInfo, reserve field 15, add failure_category (16) and failure_subcategory (17) on Error. Same on public api.JobFailedEvent which swaps categories []string for the two scalars.

Lookout: migration 032 adds the columns, ingester writes them in place of failure_info. The old column stays put here and is dropped in a follow-up once readers are on the new columns.

Follow-ups on top of this PR:

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 16, 2026

Greptile Summary

This PR replaces the FailureInfo proto message with two flat string fields (failure_category, failure_subcategory) on Error, and updates the classifier from returning all matching categories to first-match-wins semantics. The ingester writes the new text columns in place of the old failure_info jsonb blob, gated behind a new enableJobErrorCategorization executor flag that defaults to off.

Confidence Score: 5/5

Safe to merge; flag-gating ensures existing deployments are untouched and proto changes are backward-compatible.

All remaining findings are P2. No logic bugs or data-loss paths found. Previous P1 concerns about the v1 read path gap and nil-classifier safety are acknowledged in the PR and prior review. Proto field reservation is correct.

internal/lookout/schema/migrations/032_add_failure_category_to_job_run.sql — missing index; internal/lookoutingester/lookoutdb/insertion.go — verify parameter ordering after new column additions.

Important Files Changed

Filename Overview
internal/lookout/schema/migrations/032_add_failure_category_to_job_run.sql Adds failure_category and failure_subcategory as nullable text columns; no index added, which may hurt future queries in follow-up PRs #4853/#4854.
internal/lookoutingester/lookoutdb/insertion.go Temp table, batch UPDATE, and scalar UPDATE all swapped failure_info jsonb for failure_category/failure_subcategory text columns; parameter counts look correct but no index added for the new columns.
internal/lookoutingester/instructions/instructions.go Removed EnableJobRunFailureInfoMap flag and failureInfoToMap; ingester now writes failure_category/failure_subcategory as plain strings only when non-empty and terminal.
internal/executor/categorizer/classifier.go Classifier refactored from returning all matching categories to first-match-wins (ClassifyResult with Category+Subcategory); nil-receiver safety on Classify() preserved.
internal/executor/reporter/event.go FailureInfo removed; CreateJobFailedEvent now takes flat failure_category/failure_subcategory strings; CreateSimpleJobFailedEvent signature cleaned up.
pkg/armadaevents/events.proto FailureInfo message deleted, field 15 reserved; failure_category (16) and failure_subcategory (17) added as flat strings on Error; backward-compatible proto changes.
pkg/api/event.proto JobFailedEvent.categories (field 15) reserved; failure_category (16) and failure_subcategory (17) added as scalars; correct proto reservation.
internal/lookoutingester/model/model.go UpdateJobRunInstruction.FailureInfo map replaced with FailureCategory/*string and FailureSubcategory/*string; cleaner model matching the new DB schema.

Sequence Diagram

sequenceDiagram
    participant Pod as Kubernetes Pod
    participant Executor as Executor
    participant Classifier as Classifier
    participant Event as armadaevents.Error
    participant Pulsar as Pulsar
    participant Ingester as LookoutIngester
    participant DB as job_run (Postgres)

    Pod->>Executor: PodFailed
    Executor->>Classifier: Classify(pod)
    Note over Classifier: First-match-wins across categories and rules
    Classifier-->>Executor: ClassifyResult{Category, Subcategory}
    Executor->>Event: CreateJobFailedEvent(..., category, subcategory)
    Note over Event: Error.failure_category = category
    Event->>Pulsar: publish EventSequence
    Pulsar->>Ingester: consume
    Ingester->>DB: UPDATE job_run SET failure_category, failure_subcategory
Loading

Reviews (19): Last reviewed commit: "Replace FailureInfo with flat failure_ca..." | Re-trigger Greptile

Comment thread internal/executor/categorizer/classifier.go
Comment thread internal/executor/job/processors/preempt_runs.go Outdated
@dejanzele dejanzele force-pushed the slim-failure-classification branch 6 times, most recently from 7f89036 to a0328c1 Compare April 17, 2026 11:51
Comment thread internal/executor/configuration/types.go
@dejanzele dejanzele force-pushed the slim-failure-classification branch 8 times, most recently from c45f485 to a8d6b85 Compare April 17, 2026 15:44
@dejanzele dejanzele changed the title Replace FailureInfo with failure_category/failure_subcategory Replace FailureInfo with flat failure_category/failure_subcategory fields Apr 17, 2026
@dejanzele dejanzele force-pushed the slim-failure-classification branch 2 times, most recently from 9ab0a66 to 35a2bb6 Compare April 21, 2026 12:33
@dejanzele dejanzele force-pushed the slim-failure-classification branch from 35a2bb6 to 966cf7f Compare April 21, 2026 13:22
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele dejanzele force-pushed the slim-failure-classification branch from 966cf7f to 53fec06 Compare April 21, 2026 13:35
@dejanzele dejanzele closed this Apr 22, 2026
@dejanzele dejanzele reopened this Apr 22, 2026
@dejanzele dejanzele force-pushed the slim-failure-classification branch from 53fec06 to 4b94bd0 Compare April 22, 2026 15:37
…elds

FailureInfo bundled four fields (exit_code, termination_message, categories,
container_name) that were already available on ContainerError. Replace it
with two flat strings on Error: failure_category (metric-safe, first-match-
wins) and failure_subcategory (optional drill-down). This shrinks the wire
size, avoids redundant storage of termination messages, and gives the
ingester two text columns to write instead of a jsonb blob.

Classifier rewrite: returns a single (category, subcategory) with first-
match-wins semantics. Adds a default category and per-rule subcategory.
Enabled via the new EnableJobErrorCategorization executor flag (off by
default).

Proto: delete FailureInfo message, reserve field 15, add failure_category
(16) and failure_subcategory (17) on Error. Mirror the change on the public
api.JobFailedEvent, which replaces categories []string with the same two
scalar fields.

Lookout DB: migration 032 adds the two text columns. Ingester writes them
in place of the jsonb failure_info blob. The old column is dropped in a
follow-up PR once readers are on the new columns.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the slim-failure-classification branch from 4b94bd0 to 34ecb7a Compare April 22, 2026 15:43
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

@dejanzele
Copy link
Copy Markdown
Member Author

Closing in favour of mode modular PR

@dejanzele dejanzele closed this Apr 23, 2026
dejanzele added a commit that referenced this pull request Apr 23, 2026
Reads the flat failure_category and failure_subcategory columns added in
#4843 instead of the failure_info jsonb blob. Updates the querybuilder
SQL, the internal Go model, swagger and its generated code, the
conversion layer, and the jobs sidebar.

Depends on #4843.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
dejanzele added a commit that referenced this pull request Apr 29, 2026
Drops the failure_info jsonb column from job_run. Nothing writes or
reads it after #4843 and #4853, and the column was never populated in
production outside the opt-in flag path anyway.

Also drops the unused FailureInfo field from the queryapi sqlc model.

Only merge after #4843 and #4853 have been deployed long enough that we
are sure no consumer still depends on the column.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
YHines004 pushed a commit that referenced this pull request May 4, 2026
Drops the failure_info jsonb column from job_run. Nothing writes or
reads it after #4843 and #4853, and the column was never populated in
production outside the opt-in flag path anyway.

Also drops the unused FailureInfo field from the queryapi sqlc model.

Only merge after #4843 and #4853 have been deployed long enough that we
are sure no consumer still depends on the column.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Yasmine Hines <yhines004@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant