Flatten Error.FailureInfo into scalars and migrate Lookout#4860
Flatten Error.FailureInfo into scalars and migrate Lookout#4860dejanzele wants to merge 2 commits intoarmadaproject:masterfrom
Conversation
…ind feature flag Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
…nd migrate Lookout Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Greptile SummaryThis PR flattens the Confidence Score: 5/5This PR is safe to merge; all scalar/batch DB parameter orderings are correct, proto field reservations are in place, and migration is non-blocking. All findings are P2 or lower. Proto backward compatibility is maintained (field 15 reserved before adding 16/17). Temp-table columns, CopyFromSlice values, batch UPDATE, and scalar UPDATE all correctly aligned. Nil-classifier path handled safely via the nil-receiver guard in Classify. Migration adds only nullable columns. No data loss or correctness issues identified. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Pod as Kubernetes Pod
participant Executor as Executor
participant Classifier as Classifier
participant EventReporter as EventReporter
participant Pulsar as Pulsar
participant Ingester as Lookout Ingester
participant DB as PostgreSQL (job_run)
Pod->>Executor: Pod failed
Executor->>Classifier: Classify(pod)
Classifier-->>Executor: ClassifyResult{Category, Subcategory}
Executor->>EventReporter: CreateJobFailedEvent(..., category, subcategory)
EventReporter->>Pulsar: Publish Error{FailureCategory, FailureSubcategory}
Pulsar->>Ingester: Consume EventSequence
Ingester->>Ingester: handleJobRunErrors() -> UpdateJobRunInstruction{FailureCategory*, FailureSubcategory*}
Ingester->>DB: UPDATE job_run SET failure_category=..., failure_subcategory=...
Reviews (1): Last reviewed commit: "Flatten Error.FailureInfo into failure_c..." | Re-trigger Greptile |
What type of PR is this?
Refactor (2 of 2 splitting #4843 into smaller reviews). Depends on #4859 (classifier refactor).
What this PR does / why we need it
Flattens the
FailureInfobundle onarmadaevents.Errorinto two scalar string fields (failure_category,failure_subcategory) and migrates the Lookout ingester to write two text columns in place of the jsonbfailure_infoblob. All executor call sites that previously packed classifier output intoFailureInfo.Categoriesnow write the scalars directly onError.Split from #4843 at the request of reviewers who asked for smaller PRs. The classifier logic change landed in #4859; this PR is the wire-format swap + storage migration. Together they reproduce #4843.
Proto changes
pkg/armadaevents/events.proto: deleteFailureInfomessage, reserve field 15 onError, addfailure_category(16) andfailure_subcategory(17) scalarspkg/api/event.proto: reserve field 15 onJobFailedEvent(wascategories []string), add the same two scalarsLookout ingester
failure_categoryandfailure_subcategorytext columns tojob_runfailure_infoblobfailure_infojsonb column is left in place and removed in a follow-up PR (Drop failure_info column from job_run table #4855) once readers have switched over (Display failure_category/failure_subcategory in Lookout UI #4853)Which issue(s) this PR fixes
Part of the failure categorization slimming. Follow-ups on top:
job_failure_category_totalfailure_infojsonb columnSpecial notes for your reviewer
ClassifyResult{Category, Subcategory}; this PR just changes where those strings end up on the wire.FailureInforead-site swapped for the two scalar fields, everyFailureInfo.Categorieswrite-site swapped forError.FailureCategory/Error.FailureSubcategory.internal/lookoutingester/lookoutdb/insertion.go- please verify the temp-table / batch / scalar UPDATEs line up.