feat(observability): enrich ActiveJob error logs (ING-136)#5521
Draft
groyoh wants to merge 7 commits into
Draft
Conversation
## Context Production incident INC-157 (Redis OOM) caused Sidekiq client pushes to be silently rejected. We had no queryable Loki/Grafana record of either enqueue failures or perform failures with enough metadata to reconcile lost jobs — only Sentry traces (sampled, slow to search) and Sidekiq dead jobs (also lossy). The existing JSON log subscriber already covered the log surface, but error entries lacked the fields needed to correlate failures with affected organizations and jobs. ## Description Error log entries emitted by ActiveJob::JsonLogSubscriber are now enriched with: - `arguments`: the formatted job arguments (respecting `log_arguments?`), truncated to 1000 chars with a `"… (truncated)"` suffix when the formatted string exceeds the cap. - `attempt_count`: `job.executions` on perform error, retry stopped, and discard events. The previous `retries` key on `retry_stopped` was renamed to `attempt_count` for consistency across error events. - `organization_id`: best-effort extraction from `job.arguments` — hash with `:organization_id`/`"organization_id"`, an `Organization` instance, or any arg responding to `organization_id`. The key is omitted when extraction returns nil and the whole lookup is wrapped to never raise. - `exception.backtrace`: first 10 frames of the exception backtrace. The key is omitted entirely when the backtrace is nil or empty. - `job_id` and `queue` added to enqueue error and discard log entries respectively. The exception payload construction was extracted into a private `exception_payload(ex)` helper used at every error site.
## Context
The `organization_id_from` helper used `||` and `if value` checks to short-circuit the lookup. When a hash argument carried `organization_id: false` or `organization_id: 0`, the lookup silently fell through to the next argument or returned `nil`, producing wrong-value or missing-key log entries. The downstream `merge_organization_id` helper had the same problem and dropped non-nil falsy values from the payload.
The spec file also introduced a top-level `describe "organization_id extraction"` block sitting outside the per-method describes, and several new contexts asserted on individual keys (`logs.first["organization_id"]`, `logs.first["exception"].not_to have_key("backtrace")`) instead of the full-hash `eq({...})` shape used everywhere else in the file. Job-setup boilerplate was repeated across the new contexts.
## Description
The `organization_id_from` helper now distinguishes "key present but the value is nil" from "key absent" using `Hash#key?` and an explicit `unless value.nil?` guard on every return path, including the `respond_to?(:organization_id)` branch. An explicit `return nil if job.arguments.nil?` guard was added at the top. The existing `rescue StandardError` is kept as a final safety net.
The `merge_organization_id` helper switches from `if org_id` to `unless org_id.nil?` so that a deliberate `false` or `0` is preserved in the log entry while `nil` continues to skip the key.
The new organization_id contexts in the spec were consolidated into `describe "#perform"` (alongside the existing per-method contexts in `#retry_stopped` and `#discard`) and the standalone top-level block was removed. Every new assertion now compares the full log entry against a single expected hash, matching the convention established by the other tests. A `build_job_with_args` helper was added next to `build_job` to replace the repeated `TestLogJobWithArgs.new(...).tap { ... }` boilerplate. Coverage for the falsy-value contract was added: `organization_id: false` (preserved), `organization_id: 0` (preserved), and `organization_id: nil` (falls through to the next argument).
## Context The ING-136 enrichments to `ActiveJob::JsonLogSubscriber` add `arguments`, `attempt_count`, `organization_id`, and truncated `backtrace` fields to every error log path (enqueue, enqueue_all per-job, enqueue_at, perform, retry_stopped, discard). Several defensive guards and boundary conditions in the implementation lacked regression coverage. ## Description Additional specs were added to pin the previously untested behaviours: - Boundary tests for `ARGUMENTS_MAX_LENGTH = 1000` (exactly 1000 chars passes through, 1001 chars triggers truncation). - Boundary tests for `BACKTRACE_MAX_FRAMES = 10` (exactly 10 frames is not truncated, 11 frames is truncated to 10). - `organization_id` extraction from an `Organization` instance argument via `arg.id`. - Multi-argument walking when the first argument does not carry an `organization_id` and a later one does. - The `rescue StandardError` path when an argument's `organization_id` accessor raises (log is still emitted without the key). - The explicit `job.arguments.nil?` guard. - Cross-event enrichment coverage for `enqueue` error, `enqueue_at` error, the per-job `enqueue_error` branch inside `enqueue_all`, plus backtrace truncation on `retry_stopped` and `discard`.
## Context The previous commit renamed the `retries` key in the `retry_stopped` JSON log entry to `attempt_count` for vocabulary consistency with the other error events (`#perform` error, `#discard`). External Loki / Grafana dashboards and alerts may already read the `retries` key, so renaming it is a breaking change to a public log surface that the codebase cannot fully verify from the inside. ## Description The `retry_stopped` payload reverts to `retries: job.executions`, restoring backward compatibility for any downstream consumer. The `#perform` error and `#discard` events keep their new `attempt_count` field — those keys did not exist before this branch, so adding them is purely additive. The vocabulary inconsistency between `retries` (on retry_stopped) and `attempt_count` (on perform / discard) is accepted in exchange for not silently breaking existing dashboards. Spec assertions for `retry_stopped` updated accordingly.
## Context The first commit introduced an `attempt_count` field on `#perform` error and `#discard` logs while keeping the existing `retries` key on `#retry_stopped`. The vocabulary split (`retries` on one event, `attempt_count` on others) was inconsistent and unnecessary. ## Description The `#perform` error, `#retry_stopped`, and `#discard` payloads now all emit `retries: job.executions`, matching the key name that `#retry_stopped` already used before this branch. Backward compatibility for external dashboards reading `retries` is preserved, and consumers reading the new error events get a uniform vocabulary across all error-status entries. Spec assertions updated to match.
## Context The first commit added an `ARGUMENTS_MAX_LENGTH`-based truncation on the formatted arguments string and a `BACKTRACE_MAX_FRAMES`-based truncation on the exception backtrace. Both belong to a separate follow-up that should also decide on the truncation thresholds, the scrubbing policy for sensitive payload fields, and whether the backtrace should be included on error log entries at all. ## Description The two constants and the truncation branches in `args_info` and `exception_payload` are removed. `args_info` reverts to the original behaviour (formatted string with no length cap). `exception_payload` returns only the `class` and `message` keys; the `backtrace` is no longer added. The corresponding spec contexts (argument length boundaries, backtrace truncation on `#perform` / `#retry_stopped` / `#discard`, and the empty-/nil-backtrace omissions) are removed. 38 examples, 0 failures.
## Context The earlier commits enriched the error-status entries on `#enqueue`, `#perform`, `#retry_stopped`, and `#discard`, but `#enqueue_retry` was left untouched. The retry log is fired with an active error context (`event.payload[:error]`) on every retry attempt, so it matters as much as the other error events for reconciliation. It was also missing `queue`, `arguments`, and `organization_id`. The `organization_id_from` walker also had a special-case branch for `Organization` instances (`arg.is_a?(Organization) ? arg.id`) that disappears once `Organization` itself exposes `#organization_id` the same way it already exposes `#organization`. ## Description `Organization#organization_id` is added next to the existing `#organization` self-returning helper. The walker drops the `is_a?(Organization)` branch and uniformly relies on `respond_to?(:organization_id)`, which `Organization` now satisfies. The Hash lookup is collapsed to `arg[:organization_id].presence || arg["organization_id"].presence` and the walker itself is rewritten around `Array#find`. The spec context that pinned `false` as a valid organization_id is updated to reflect the `.presence` semantics (blank values fall through to the next argument). `#enqueue_retry` now emits `queue`, `arguments`, and `organization_id` on both its error and success branches, sourced through the same `args_info` / `exception_payload` / `merge_organization_id` helpers used by the other error events. The `execution` key is kept untouched on this event (it's a per-retry counter, distinct from the `retries` total on terminal events). Existing spec assertions updated and a new context added for the organization_id extraction on retry-error logs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
After the load-test incident, reconciling which jobs were lost from the Sidekiq dead-jobs queue and the Sentry traces was harder than it should have been because the structured JSON log entries our
ActiveJob::JsonLogSubscriberalready emits did not include enough metadata to map a failed job back to its inputs. The error log entries did not carry the job'sarguments, so the entries we could see told us that a job class had failed for a queue but not which row, customer, or invoice it had been called with. Theorganization_idwas also missing, which meant we could not slice reconciliation queries per organization.The logging surface itself already exists. The JSON subscriber attaches in
config/application.rband fires on every enqueue, perform, retry, and discard event. The gap is on the fields included in those entries, not on whether the events are captured at all.Description
The error-status entries from
ActiveJob::JsonLogSubscribernow carry the missing fields.enqueueerrors,enqueue_allper-job errors,enqueue_aterrors,performerrors,enqueue_retry,retry_stopped, anddiscardall emitqueue,arguments, andorganization_id(when extractable). The execution-count field is now emitted under the same key (retries) on#performerror,#retry_stopped, and#discard;#retry_stoppedalready usedretriesbefore this branch, so the other two events were aligned to that name.#enqueue_retrykeeps its existingexecutionkey (it is a per-attempt retry counter, distinct from the terminalretriestotal).#discardalso gained the previously-missingqueuefield.The organization_id walker is sourced through two new private helpers.
organization_id_in(arg)returns the org_id for a single argument: for aHash, it falls back toarg[:organization_id].presence || arg["organization_id"].presence; for any other arg, it callsarg.organization_idwhen the receiver responds to it.organization_id_from(job)usesArray#findoverjob.argumentsto return the first non-nil match, with a top-levelrescue StandardErrorso a buggy accessor on an argument cannot break logging.merge_organization_id(payload, job)calls the extractor and only adds theorganization_idkey when the value is non-nil.Organization#organization_idis added as a self-id accessor next to the existing#organizationself-returning helper, so the walker can rely on the uniformrespond_to?(:organization_id)interface without special-casing the model.Argument truncation, backtrace inclusion, and the related sensitive-payload scrubbing are intentionally left out of this PR. They are tracked separately so the truncation thresholds and the scrubbing policy can be decided together.
A Sidekiq client/server middleware was considered and dropped. It would have duplicated what the existing subscriber already covers through ActiveJob's Rails instrumentation, since the only producer of Sidekiq jobs in this codebase is ActiveJob.
The spec adds full-hash assertions for every new field on every affected event, plus
organization_idextraction across all four sources (symbol-key hash, string-key hash,Organizationinstance,respond_to?(:organization_id)), the0falsy-value preservation,falseandnilfall-through to the next argument, multi-argument walks, thejob.arguments == nildefensive guard, and the rescue behaviour when an argument'sorganization_idaccessor raises. 183 examples, 0 failures (subscriber + Organization model specs).