Replace FailureInfo with flat failure_category/failure_subcategory fields#4843
Replace FailureInfo with flat failure_category/failure_subcategory fields#4843dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
Conversation
Greptile SummaryThis PR replaces the Confidence Score: 5/5Safe to merge; flag-gating ensures existing deployments are untouched and proto changes are backward-compatible. All remaining findings are P2. No logic bugs or data-loss paths found. Previous P1 concerns about the v1 read path gap and nil-classifier safety are acknowledged in the PR and prior review. Proto field reservation is correct. internal/lookout/schema/migrations/032_add_failure_category_to_job_run.sql — missing index; internal/lookoutingester/lookoutdb/insertion.go — verify parameter ordering after new column additions. Important Files Changed
Sequence DiagramsequenceDiagram
participant Pod as Kubernetes Pod
participant Executor as Executor
participant Classifier as Classifier
participant Event as armadaevents.Error
participant Pulsar as Pulsar
participant Ingester as LookoutIngester
participant DB as job_run (Postgres)
Pod->>Executor: PodFailed
Executor->>Classifier: Classify(pod)
Note over Classifier: First-match-wins across categories and rules
Classifier-->>Executor: ClassifyResult{Category, Subcategory}
Executor->>Event: CreateJobFailedEvent(..., category, subcategory)
Note over Event: Error.failure_category = category
Event->>Pulsar: publish EventSequence
Pulsar->>Ingester: consume
Ingester->>DB: UPDATE job_run SET failure_category, failure_subcategory
Reviews (19): Last reviewed commit: "Replace FailureInfo with flat failure_ca..." | Re-trigger Greptile |
7f89036 to
a0328c1
Compare
c45f485 to
a8d6b85
Compare
9ab0a66 to
35a2bb6
Compare
35a2bb6 to
966cf7f
Compare
966cf7f to
53fec06
Compare
53fec06 to
4b94bd0
Compare
…elds FailureInfo bundled four fields (exit_code, termination_message, categories, container_name) that were already available on ContainerError. Replace it with two flat strings on Error: failure_category (metric-safe, first-match- wins) and failure_subcategory (optional drill-down). This shrinks the wire size, avoids redundant storage of termination messages, and gives the ingester two text columns to write instead of a jsonb blob. Classifier rewrite: returns a single (category, subcategory) with first- match-wins semantics. Adds a default category and per-rule subcategory. Enabled via the new EnableJobErrorCategorization executor flag (off by default). Proto: delete FailureInfo message, reserve field 15, add failure_category (16) and failure_subcategory (17) on Error. Mirror the change on the public api.JobFailedEvent, which replaces categories []string with the same two scalar fields. Lookout DB: migration 032 adds the two text columns. Ingester writes them in place of the jsonb failure_info blob. The old column is dropped in a follow-up PR once readers are on the new columns. Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
4b94bd0 to
34ecb7a
Compare
|
Closing in favour of mode modular PR |
Reads the flat failure_category and failure_subcategory columns added in #4843 instead of the failure_info jsonb blob. Updates the querybuilder SQL, the internal Go model, swagger and its generated code, the conversion layer, and the jobs sidebar. Depends on #4843. Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Drops the failure_info jsonb column from job_run. Nothing writes or reads it after #4843 and #4853, and the column was never populated in production outside the opt-in flag path anyway. Also drops the unused FailureInfo field from the queryapi sqlc model. Only merge after #4843 and #4853 have been deployed long enough that we are sure no consumer still depends on the column. Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Drops the failure_info jsonb column from job_run. Nothing writes or reads it after #4843 and #4853, and the column was never populated in production outside the opt-in flag path anyway. Also drops the unused FailureInfo field from the queryapi sqlc model. Only merge after #4843 and #4853 have been deployed long enough that we are sure no consumer still depends on the column. Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com> Signed-off-by: Yasmine Hines <yhines004@gmail.com>
FailureInfo bundled four fields (exit_code, termination_message, categories, container_name) that were already present on ContainerError. Replacing it with two flat strings on Error cuts wire size, removes the duplicate termination-message storage, and gives the ingester two text columns to write instead of a jsonb blob.
Classifier returns a single (category, subcategory) with first-match-wins semantics instead of []string of all matches. Adds a default category and per-rule subcategory. Gated behind a new
enableJobErrorCategorizationexecutor flag, off by default, so existing deployments are untouched until they opt in.Proto: delete FailureInfo, reserve field 15, add failure_category (16) and failure_subcategory (17) on Error. Same on public api.JobFailedEvent which swaps
categories []stringfor the two scalars.Lookout: migration 032 adds the columns, ingester writes them in place of failure_info. The old column stays put here and is dropped in a follow-up once readers are on the new columns.
Follow-ups on top of this PR:
job_failure_category_total