Skip to content

feat(observability): enrich ActiveJob error logs (ING-136)#5521

Draft
groyoh wants to merge 7 commits into
mainfrom
feat/ing-136/log-sidekiq-errors
Draft

feat(observability): enrich ActiveJob error logs (ING-136)#5521
groyoh wants to merge 7 commits into
mainfrom
feat/ing-136/log-sidekiq-errors

Conversation

@groyoh
Copy link
Copy Markdown
Contributor

@groyoh groyoh commented May 18, 2026

Context

After the load-test incident, reconciling which jobs were lost from the Sidekiq dead-jobs queue and the Sentry traces was harder than it should have been because the structured JSON log entries our ActiveJob::JsonLogSubscriber already emits did not include enough metadata to map a failed job back to its inputs. The error log entries did not carry the job's arguments, so the entries we could see told us that a job class had failed for a queue but not which row, customer, or invoice it had been called with. The organization_id was also missing, which meant we could not slice reconciliation queries per organization.

The logging surface itself already exists. The JSON subscriber attaches in config/application.rb and fires on every enqueue, perform, retry, and discard event. The gap is on the fields included in those entries, not on whether the events are captured at all.

Description

The error-status entries from ActiveJob::JsonLogSubscriber now carry the missing fields. enqueue errors, enqueue_all per-job errors, enqueue_at errors, perform errors, enqueue_retry, retry_stopped, and discard all emit queue, arguments, and organization_id (when extractable). The execution-count field is now emitted under the same key (retries) on #perform error, #retry_stopped, and #discard; #retry_stopped already used retries before this branch, so the other two events were aligned to that name. #enqueue_retry keeps its existing execution key (it is a per-attempt retry counter, distinct from the terminal retries total). #discard also gained the previously-missing queue field.

The organization_id walker is sourced through two new private helpers. organization_id_in(arg) returns the org_id for a single argument: for a Hash, it falls back to arg[:organization_id].presence || arg["organization_id"].presence; for any other arg, it calls arg.organization_id when the receiver responds to it. organization_id_from(job) uses Array#find over job.arguments to return the first non-nil match, with a top-level rescue StandardError so a buggy accessor on an argument cannot break logging. merge_organization_id(payload, job) calls the extractor and only adds the organization_id key when the value is non-nil.

Organization#organization_id is added as a self-id accessor next to the existing #organization self-returning helper, so the walker can rely on the uniform respond_to?(:organization_id) interface without special-casing the model.

Argument truncation, backtrace inclusion, and the related sensitive-payload scrubbing are intentionally left out of this PR. They are tracked separately so the truncation thresholds and the scrubbing policy can be decided together.

A Sidekiq client/server middleware was considered and dropped. It would have duplicated what the existing subscriber already covers through ActiveJob's Rails instrumentation, since the only producer of Sidekiq jobs in this codebase is ActiveJob.

The spec adds full-hash assertions for every new field on every affected event, plus organization_id extraction across all four sources (symbol-key hash, string-key hash, Organization instance, respond_to?(:organization_id)), the 0 falsy-value preservation, false and nil fall-through to the next argument, multi-argument walks, the job.arguments == nil defensive guard, and the rescue behaviour when an argument's organization_id accessor raises. 183 examples, 0 failures (subscriber + Organization model specs).

groyoh added 7 commits May 18, 2026 09:49
## Context

Production incident INC-157 (Redis OOM) caused Sidekiq client pushes to be silently rejected. We had no queryable Loki/Grafana record of either enqueue failures or perform failures with enough metadata to reconcile lost jobs — only Sentry traces (sampled, slow to search) and Sidekiq dead jobs (also lossy). The existing JSON log subscriber already covered the log surface, but error entries lacked the fields needed to correlate failures with affected organizations and jobs.

## Description

Error log entries emitted by ActiveJob::JsonLogSubscriber are now enriched with:

- `arguments`: the formatted job arguments (respecting `log_arguments?`), truncated to 1000 chars with a `"… (truncated)"` suffix when the formatted string exceeds the cap.
- `attempt_count`: `job.executions` on perform error, retry stopped, and discard events. The previous `retries` key on `retry_stopped` was renamed to `attempt_count` for consistency across error events.
- `organization_id`: best-effort extraction from `job.arguments` — hash with `:organization_id`/`"organization_id"`, an `Organization` instance, or any arg responding to `organization_id`. The key is omitted when extraction returns nil and the whole lookup is wrapped to never raise.
- `exception.backtrace`: first 10 frames of the exception backtrace. The key is omitted entirely when the backtrace is nil or empty.
- `job_id` and `queue` added to enqueue error and discard log entries respectively.

The exception payload construction was extracted into a private `exception_payload(ex)` helper used at every error site.
## Context

The `organization_id_from` helper used `||` and `if value` checks to short-circuit the lookup. When a hash argument carried `organization_id: false` or `organization_id: 0`, the lookup silently fell through to the next argument or returned `nil`, producing wrong-value or missing-key log entries. The downstream `merge_organization_id` helper had the same problem and dropped non-nil falsy values from the payload.

The spec file also introduced a top-level `describe "organization_id extraction"` block sitting outside the per-method describes, and several new contexts asserted on individual keys (`logs.first["organization_id"]`, `logs.first["exception"].not_to have_key("backtrace")`) instead of the full-hash `eq({...})` shape used everywhere else in the file. Job-setup boilerplate was repeated across the new contexts.

## Description

The `organization_id_from` helper now distinguishes "key present but the value is nil" from "key absent" using `Hash#key?` and an explicit `unless value.nil?` guard on every return path, including the `respond_to?(:organization_id)` branch. An explicit `return nil if job.arguments.nil?` guard was added at the top. The existing `rescue StandardError` is kept as a final safety net.

The `merge_organization_id` helper switches from `if org_id` to `unless org_id.nil?` so that a deliberate `false` or `0` is preserved in the log entry while `nil` continues to skip the key.

The new organization_id contexts in the spec were consolidated into `describe "#perform"` (alongside the existing per-method contexts in `#retry_stopped` and `#discard`) and the standalone top-level block was removed. Every new assertion now compares the full log entry against a single expected hash, matching the convention established by the other tests. A `build_job_with_args` helper was added next to `build_job` to replace the repeated `TestLogJobWithArgs.new(...).tap { ... }` boilerplate. Coverage for the falsy-value contract was added: `organization_id: false` (preserved), `organization_id: 0` (preserved), and `organization_id: nil` (falls through to the next argument).
## Context

The ING-136 enrichments to `ActiveJob::JsonLogSubscriber` add `arguments`,
`attempt_count`, `organization_id`, and truncated `backtrace` fields to
every error log path (enqueue, enqueue_all per-job, enqueue_at, perform,
retry_stopped, discard). Several defensive guards and boundary
conditions in the implementation lacked regression coverage.

## Description

Additional specs were added to pin the previously untested behaviours:

- Boundary tests for `ARGUMENTS_MAX_LENGTH = 1000` (exactly 1000 chars
  passes through, 1001 chars triggers truncation).
- Boundary tests for `BACKTRACE_MAX_FRAMES = 10` (exactly 10 frames is
  not truncated, 11 frames is truncated to 10).
- `organization_id` extraction from an `Organization` instance argument
  via `arg.id`.
- Multi-argument walking when the first argument does not carry an
  `organization_id` and a later one does.
- The `rescue StandardError` path when an argument's `organization_id`
  accessor raises (log is still emitted without the key).
- The explicit `job.arguments.nil?` guard.
- Cross-event enrichment coverage for `enqueue` error, `enqueue_at`
  error, the per-job `enqueue_error` branch inside `enqueue_all`, plus
  backtrace truncation on `retry_stopped` and `discard`.
## Context

The previous commit renamed the `retries` key in the `retry_stopped`
JSON log entry to `attempt_count` for vocabulary consistency with the
other error events (`#perform` error, `#discard`). External Loki /
Grafana dashboards and alerts may already read the `retries` key, so
renaming it is a breaking change to a public log surface that the
codebase cannot fully verify from the inside.

## Description

The `retry_stopped` payload reverts to `retries: job.executions`,
restoring backward compatibility for any downstream consumer. The
`#perform` error and `#discard` events keep their new `attempt_count`
field — those keys did not exist before this branch, so adding them is
purely additive. The vocabulary inconsistency between `retries` (on
retry_stopped) and `attempt_count` (on perform / discard) is accepted
in exchange for not silently breaking existing dashboards. Spec
assertions for `retry_stopped` updated accordingly.
## Context

The first commit introduced an `attempt_count` field on `#perform`
error and `#discard` logs while keeping the existing `retries` key on
`#retry_stopped`. The vocabulary split (`retries` on one event,
`attempt_count` on others) was inconsistent and unnecessary.

## Description

The `#perform` error, `#retry_stopped`, and `#discard` payloads now
all emit `retries: job.executions`, matching the key name that
`#retry_stopped` already used before this branch. Backward
compatibility for external dashboards reading `retries` is preserved,
and consumers reading the new error events get a uniform vocabulary
across all error-status entries. Spec assertions updated to match.
## Context

The first commit added an `ARGUMENTS_MAX_LENGTH`-based truncation on
the formatted arguments string and a `BACKTRACE_MAX_FRAMES`-based
truncation on the exception backtrace. Both belong to a separate
follow-up that should also decide on the truncation thresholds, the
scrubbing policy for sensitive payload fields, and whether the
backtrace should be included on error log entries at all.

## Description

The two constants and the truncation branches in `args_info` and
`exception_payload` are removed. `args_info` reverts to the original
behaviour (formatted string with no length cap). `exception_payload`
returns only the `class` and `message` keys; the `backtrace` is no
longer added. The corresponding spec contexts (argument length
boundaries, backtrace truncation on `#perform` / `#retry_stopped` /
`#discard`, and the empty-/nil-backtrace omissions) are removed. 38
examples, 0 failures.
## Context

The earlier commits enriched the error-status entries on `#enqueue`,
`#perform`, `#retry_stopped`, and `#discard`, but `#enqueue_retry`
was left untouched. The retry log is fired with an active error
context (`event.payload[:error]`) on every retry attempt, so it
matters as much as the other error events for reconciliation. It was
also missing `queue`, `arguments`, and `organization_id`.

The `organization_id_from` walker also had a special-case branch
for `Organization` instances (`arg.is_a?(Organization) ? arg.id`)
that disappears once `Organization` itself exposes `#organization_id`
the same way it already exposes `#organization`.

## Description

`Organization#organization_id` is added next to the existing
`#organization` self-returning helper. The walker drops the
`is_a?(Organization)` branch and uniformly relies on
`respond_to?(:organization_id)`, which `Organization` now satisfies.
The Hash lookup is collapsed to `arg[:organization_id].presence ||
arg["organization_id"].presence` and the walker itself is rewritten
around `Array#find`. The spec context that pinned `false` as a valid
organization_id is updated to reflect the `.presence` semantics
(blank values fall through to the next argument).

`#enqueue_retry` now emits `queue`, `arguments`, and
`organization_id` on both its error and success branches, sourced
through the same `args_info` / `exception_payload` /
`merge_organization_id` helpers used by the other error events. The
`execution` key is kept untouched on this event (it's a per-retry
counter, distinct from the `retries` total on terminal events).
Existing spec assertions updated and a new context added for the
organization_id extraction on retry-error logs.
Copy link
Copy Markdown
Collaborator

@vincent-pochet vincent-pochet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants