Skip to content

Flatten Error.FailureInfo into scalars and migrate Lookout#4860

Closed
dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
dejanzele:slim-failure-proto-flatten
Closed

Flatten Error.FailureInfo into scalars and migrate Lookout#4860
dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
dejanzele:slim-failure-proto-flatten

Conversation

@dejanzele
Copy link
Copy Markdown
Member

What type of PR is this?

Refactor (2 of 2 splitting #4843 into smaller reviews). Depends on #4859 (classifier refactor).

What this PR does / why we need it

Flattens the FailureInfo bundle on armadaevents.Error into two scalar string fields (failure_category, failure_subcategory) and migrates the Lookout ingester to write two text columns in place of the jsonb failure_info blob. All executor call sites that previously packed classifier output into FailureInfo.Categories now write the scalars directly on Error.

Split from #4843 at the request of reviewers who asked for smaller PRs. The classifier logic change landed in #4859; this PR is the wire-format swap + storage migration. Together they reproduce #4843.

Proto changes

  • pkg/armadaevents/events.proto: delete FailureInfo message, reserve field 15 on Error, add failure_category (16) and failure_subcategory (17) scalars
  • pkg/api/event.proto: reserve field 15 on JobFailedEvent (was categories []string), add the same two scalars

Lookout ingester

Which issue(s) this PR fixes

Part of the failure categorization slimming. Follow-ups on top:

Special notes for your reviewer

  • Depends on Refactor executor classifier to first-match-wins with subcategory #4859. If you're reviewing this standalone, note that the classifier now returns ClassifyResult{Category, Subcategory}; this PR just changes where those strings end up on the wire.
  • Most of the diff is mechanical: every FailureInfo read-site swapped for the two scalar fields, every FailureInfo.Categories write-site swapped for Error.FailureCategory/Error.FailureSubcategory.
  • Migration 032 adds nullable text columns only (no index) - consistent with existing schema. Indexes can be added in follow-up PRs if query patterns warrant.
  • Greptile previously flagged parameter ordering in internal/lookoutingester/lookoutdb/insertion.go - please verify the temp-table / batch / scalar UPDATEs line up.

…ind feature flag

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
…nd migrate Lookout

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR flattens the FailureInfo bundle on armadaevents.Error into two scalar string fields (failure_category, failure_subcategory) and migrates the Lookout ingester to write those fields to two new nullable text columns in job_run, replacing the JSONB failure_info blob. All executor call sites, proto definitions, ingester instructions, batch/scalar DB writes, and the query API model have been consistently updated; the old failure_info column is intentionally left in place for follow-up removal once readers migrate.

Confidence Score: 5/5

This PR is safe to merge; all scalar/batch DB parameter orderings are correct, proto field reservations are in place, and migration is non-blocking.

All findings are P2 or lower. Proto backward compatibility is maintained (field 15 reserved before adding 16/17). Temp-table columns, CopyFromSlice values, batch UPDATE, and scalar UPDATE all correctly aligned. Nil-classifier path handled safely via the nil-receiver guard in Classify. Migration adds only nullable columns. No data loss or correctness issues identified.

No files require special attention.

Important Files Changed

Filename Overview
internal/lookoutingester/lookoutdb/insertion.go Temp-table columns, CopyFromSlice values, batch UPDATE, and scalar UPDATE all correctly use failure_category/failure_subcategory; parameter numbering in scalar UPDATE ($11/$12) matches Go argument order.
internal/lookout/schema/migrations/032_add_failure_category_to_job_run.sql Adds two nullable text columns to job_run; non-blocking DDL, consistent with existing schema style, no index added (deferred as noted).
internal/lookoutingester/instructions/instructions.go Removes enableJobRunFailureInfoMap flag, failureInfoToMap helper, and maxTerminationMessageLen constant; writes FailureCategory/FailureSubcategory pointers only when non-empty (NULL-safe).
internal/executor/reporter/event.go CreateSimpleJobFailedEvent drops debugMessage param (was always empty at all callers); CreateJobFailedEvent sets FailureCategory/FailureSubcategory scalars directly on Error; nil classifier handled safely via Classify nil-receiver guard.
pkg/armadaevents/events.proto Correctly reserves field 15 (and 'failure_info' name) before adding fields 16/17; FailureInfo message deleted; wire-format backward compatibility maintained.
pkg/api/event.proto Correctly reserves field 15 and 'categories' name on JobFailedEvent before adding failure_category (16) and failure_subcategory (17).
internal/executor/service/pod_issue_handler.go handleNonRetryableJobIssue switches from CreateSimpleJobFailedEvent to CreateJobFailedEvent with real container statuses, enabling classification and retaining ContainerErrors for the retry engine.
internal/executor/categorizer/classifier.go Classifier now returns first-match ClassifyResult{Category, Subcategory} instead of all-matches []string; nil receiver handled safely; defaultCategory falls back to 'uncategorized' when no rules match.
internal/server/queryapi/database/query.sql.go Both GetJobRunsByJobIds and GetJobRunsByRunIds queries extended to also select failure_category and failure_subcategory; Scan destinations match column order.
internal/executor/application.go Classifier construction gated behind EnableJobErrorCategorization flag; reuses existing err variable declared on line 201; nil classifier propagated safely throughout.

Sequence Diagram

sequenceDiagram
    participant Pod as Kubernetes Pod
    participant Executor as Executor
    participant Classifier as Classifier
    participant EventReporter as EventReporter
    participant Pulsar as Pulsar
    participant Ingester as Lookout Ingester
    participant DB as PostgreSQL (job_run)

    Pod->>Executor: Pod failed
    Executor->>Classifier: Classify(pod)
    Classifier-->>Executor: ClassifyResult{Category, Subcategory}
    Executor->>EventReporter: CreateJobFailedEvent(..., category, subcategory)
    EventReporter->>Pulsar: Publish Error{FailureCategory, FailureSubcategory}

    Pulsar->>Ingester: Consume EventSequence
    Ingester->>Ingester: handleJobRunErrors() -> UpdateJobRunInstruction{FailureCategory*, FailureSubcategory*}
    Ingester->>DB: UPDATE job_run SET failure_category=..., failure_subcategory=...
Loading

Reviews (1): Last reviewed commit: "Flatten Error.FailureInfo into failure_c..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant