feat(observability): enrich ActiveJob error logs (ING-136) by groyoh · Pull Request #5521 · getlago/lago-api

groyoh · 2026-05-18T08:11:22Z

Context

After the load-test incident, reconciling which jobs were lost from the Sidekiq dead-jobs queue and the Sentry traces was harder than it should have been because the structured JSON log entries our ActiveJob::JsonLogSubscriber already emits did not include enough metadata to map a failed job back to its inputs. The error log entries did not carry the job's arguments, so the entries we could see told us that a job class had failed for a queue but not which row, customer, or invoice it had been called with. The organization_id was also missing, which meant we could not slice reconciliation queries per organization.

The logging surface itself already exists. The JSON subscriber attaches in config/application.rb and fires on every enqueue, perform, retry, and discard event. The gap is on the fields included in those entries, not on whether the events are captured at all.

Description

The error-status entries from ActiveJob::JsonLogSubscriber now carry the missing fields. enqueue errors, enqueue_all per-job errors, enqueue_at errors, perform errors, enqueue_retry, retry_stopped, and discard all emit queue, arguments, and organization_id (when extractable). The execution-count field is now emitted under the same key (retries) on #perform error, #retry_stopped, and #discard; #retry_stopped already used retries before this branch, so the other two events were aligned to that name. #enqueue_retry keeps its existing execution key (it is a per-attempt retry counter, distinct from the terminal retries total). #discard also gained the previously-missing queue field.

The organization_id walker is sourced through two new private helpers. organization_id_in(arg) returns the org_id for a single argument: for a Hash, it falls back to arg[:organization_id].presence || arg["organization_id"].presence; for any other arg, it calls arg.organization_id when the receiver responds to it. organization_id_from(job) uses Array#find over job.arguments to return the first non-nil match, with a top-level rescue StandardError so a buggy accessor on an argument cannot break logging. merge_organization_id(payload, job) calls the extractor and only adds the organization_id key when the value is non-nil.

Organization#organization_id is added as a self-id accessor next to the existing #organization self-returning helper, so the walker can rely on the uniform respond_to?(:organization_id) interface without special-casing the model.

Argument truncation, backtrace inclusion, and the related sensitive-payload scrubbing are intentionally left out of this PR. They are tracked separately so the truncation thresholds and the scrubbing policy can be decided together.

A Sidekiq client/server middleware was considered and dropped. It would have duplicated what the existing subscriber already covers through ActiveJob's Rails instrumentation, since the only producer of Sidekiq jobs in this codebase is ActiveJob.

The spec adds full-hash assertions for every new field on every affected event, plus organization_id extraction across all four sources (symbol-key hash, string-key hash, Organization instance, respond_to?(:organization_id)), the 0 falsy-value preservation, false and nil fall-through to the next argument, multi-argument walks, the job.arguments == nil defensive guard, and the rescue behaviour when an argument's organization_id accessor raises. 183 examples, 0 failures (subscriber + Organization model specs).

## Context Production incident INC-157 (Redis OOM) caused Sidekiq client pushes to be silently rejected. We had no queryable Loki/Grafana record of either enqueue failures or perform failures with enough metadata to reconcile lost jobs — only Sentry traces (sampled, slow to search) and Sidekiq dead jobs (also lossy). The existing JSON log subscriber already covered the log surface, but error entries lacked the fields needed to correlate failures with affected organizations and jobs. ## Description Error log entries emitted by ActiveJob::JsonLogSubscriber are now enriched with: - `arguments`: the formatted job arguments (respecting `log_arguments?`), truncated to 1000 chars with a `"… (truncated)"` suffix when the formatted string exceeds the cap. - `attempt_count`: `job.executions` on perform error, retry stopped, and discard events. The previous `retries` key on `retry_stopped` was renamed to `attempt_count` for consistency across error events. - `organization_id`: best-effort extraction from `job.arguments` — hash with `:organization_id`/`"organization_id"`, an `Organization` instance, or any arg responding to `organization_id`. The key is omitted when extraction returns nil and the whole lookup is wrapped to never raise. - `exception.backtrace`: first 10 frames of the exception backtrace. The key is omitted entirely when the backtrace is nil or empty. - `job_id` and `queue` added to enqueue error and discard log entries respectively. The exception payload construction was extracted into a private `exception_payload(ex)` helper used at every error site.

## Context The `organization_id_from` helper used `||` and `if value` checks to short-circuit the lookup. When a hash argument carried `organization_id: false` or `organization_id: 0`, the lookup silently fell through to the next argument or returned `nil`, producing wrong-value or missing-key log entries. The downstream `merge_organization_id` helper had the same problem and dropped non-nil falsy values from the payload. The spec file also introduced a top-level `describe "organization_id extraction"` block sitting outside the per-method describes, and several new contexts asserted on individual keys (`logs.first["organization_id"]`, `logs.first["exception"].not_to have_key("backtrace")`) instead of the full-hash `eq({...})` shape used everywhere else in the file. Job-setup boilerplate was repeated across the new contexts. ## Description The `organization_id_from` helper now distinguishes "key present but the value is nil" from "key absent" using `Hash#key?` and an explicit `unless value.nil?` guard on every return path, including the `respond_to?(:organization_id)` branch. An explicit `return nil if job.arguments.nil?` guard was added at the top. The existing `rescue StandardError` is kept as a final safety net. The `merge_organization_id` helper switches from `if org_id` to `unless org_id.nil?` so that a deliberate `false` or `0` is preserved in the log entry while `nil` continues to skip the key. The new organization_id contexts in the spec were consolidated into `describe "#perform"` (alongside the existing per-method contexts in `#retry_stopped` and `#discard`) and the standalone top-level block was removed. Every new assertion now compares the full log entry against a single expected hash, matching the convention established by the other tests. A `build_job_with_args` helper was added next to `build_job` to replace the repeated `TestLogJobWithArgs.new(...).tap { ... }` boilerplate. Coverage for the falsy-value contract was added: `organization_id: false` (preserved), `organization_id: 0` (preserved), and `organization_id: nil` (falls through to the next argument).

## Context The ING-136 enrichments to `ActiveJob::JsonLogSubscriber` add `arguments`, `attempt_count`, `organization_id`, and truncated `backtrace` fields to every error log path (enqueue, enqueue_all per-job, enqueue_at, perform, retry_stopped, discard). Several defensive guards and boundary conditions in the implementation lacked regression coverage. ## Description Additional specs were added to pin the previously untested behaviours: - Boundary tests for `ARGUMENTS_MAX_LENGTH = 1000` (exactly 1000 chars passes through, 1001 chars triggers truncation). - Boundary tests for `BACKTRACE_MAX_FRAMES = 10` (exactly 10 frames is not truncated, 11 frames is truncated to 10). - `organization_id` extraction from an `Organization` instance argument via `arg.id`. - Multi-argument walking when the first argument does not carry an `organization_id` and a later one does. - The `rescue StandardError` path when an argument's `organization_id` accessor raises (log is still emitted without the key). - The explicit `job.arguments.nil?` guard. - Cross-event enrichment coverage for `enqueue` error, `enqueue_at` error, the per-job `enqueue_error` branch inside `enqueue_all`, plus backtrace truncation on `retry_stopped` and `discard`.

## Context The previous commit renamed the `retries` key in the `retry_stopped` JSON log entry to `attempt_count` for vocabulary consistency with the other error events (`#perform` error, `#discard`). External Loki / Grafana dashboards and alerts may already read the `retries` key, so renaming it is a breaking change to a public log surface that the codebase cannot fully verify from the inside. ## Description The `retry_stopped` payload reverts to `retries: job.executions`, restoring backward compatibility for any downstream consumer. The `#perform` error and `#discard` events keep their new `attempt_count` field — those keys did not exist before this branch, so adding them is purely additive. The vocabulary inconsistency between `retries` (on retry_stopped) and `attempt_count` (on perform / discard) is accepted in exchange for not silently breaking existing dashboards. Spec assertions for `retry_stopped` updated accordingly.

## Context The first commit introduced an `attempt_count` field on `#perform` error and `#discard` logs while keeping the existing `retries` key on `#retry_stopped`. The vocabulary split (`retries` on one event, `attempt_count` on others) was inconsistent and unnecessary. ## Description The `#perform` error, `#retry_stopped`, and `#discard` payloads now all emit `retries: job.executions`, matching the key name that `#retry_stopped` already used before this branch. Backward compatibility for external dashboards reading `retries` is preserved, and consumers reading the new error events get a uniform vocabulary across all error-status entries. Spec assertions updated to match.

## Context The first commit added an `ARGUMENTS_MAX_LENGTH`-based truncation on the formatted arguments string and a `BACKTRACE_MAX_FRAMES`-based truncation on the exception backtrace. Both belong to a separate follow-up that should also decide on the truncation thresholds, the scrubbing policy for sensitive payload fields, and whether the backtrace should be included on error log entries at all. ## Description The two constants and the truncation branches in `args_info` and `exception_payload` are removed. `args_info` reverts to the original behaviour (formatted string with no length cap). `exception_payload` returns only the `class` and `message` keys; the `backtrace` is no longer added. The corresponding spec contexts (argument length boundaries, backtrace truncation on `#perform` / `#retry_stopped` / `#discard`, and the empty-/nil-backtrace omissions) are removed. 38 examples, 0 failures.

## Context The earlier commits enriched the error-status entries on `#enqueue`, `#perform`, `#retry_stopped`, and `#discard`, but `#enqueue_retry` was left untouched. The retry log is fired with an active error context (`event.payload[:error]`) on every retry attempt, so it matters as much as the other error events for reconciliation. It was also missing `queue`, `arguments`, and `organization_id`. The `organization_id_from` walker also had a special-case branch for `Organization` instances (`arg.is_a?(Organization) ? arg.id`) that disappears once `Organization` itself exposes `#organization_id` the same way it already exposes `#organization`. ## Description `Organization#organization_id` is added next to the existing `#organization` self-returning helper. The walker drops the `is_a?(Organization)` branch and uniformly relies on `respond_to?(:organization_id)`, which `Organization` now satisfies. The Hash lookup is collapsed to `arg[:organization_id].presence || arg["organization_id"].presence` and the walker itself is rewritten around `Array#find`. The spec context that pinned `false` as a valid organization_id is updated to reflect the `.presence` semantics (blank values fall through to the next argument). `#enqueue_retry` now emits `queue`, `arguments`, and `organization_id` on both its error and success branches, sourced through the same `args_info` / `exception_payload` / `merge_organization_id` helpers used by the other error events. The `execution` key is kept untouched on this event (it's a per-retry counter, distinct from the `retries` total on terminal events). Existing spec assertions updated and a new context added for the organization_id extraction on retry-error logs.

vincent-pochet

Nice

groyoh added 7 commits May 18, 2026 09:49

vincent-pochet approved these changes May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): enrich ActiveJob error logs (ING-136)#5521

feat(observability): enrich ActiveJob error logs (ING-136)#5521
groyoh wants to merge 7 commits into
mainfrom
feat/ing-136/log-sidekiq-errors

groyoh commented May 18, 2026 •

edited

Loading

Uh oh!

vincent-pochet left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

groyoh commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Description

Uh oh!

vincent-pochet left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

groyoh commented May 18, 2026 •

edited

Loading