fix: downgrade library err instrumentation from ERROR to WARN#712
Merged
Conversation
… to WARN
Velocity limit enforcement failures ("Velocity limit exceeded") are
expected business logic rejections — a user hitting a spending cap, not
an operational failure. The `#[instrument(..., err)]` attribute on
`update_balances_with_limit_enforcement_in_op` and its child functions
`velocity_limit.enforce` and `velocity_limit.window_for_enforcement`
were emitting ERROR-level span events for these rejections. Our
Honeycomb trigger (`galoy-staging-lana-errors`) filters on
`error = true AND level = ERROR` and forwards matches to Zenduty, so
every legitimate velocity rejection paged on-call (e.g. incident #1505,
2026-04-22).
The root cause: cala-ledger, as a library, was making an opinionated
decision about error severity that belongs to the application layer.
Changing `err` to `err(level = tracing::Level::WARN)` on the parent and
both child spans preserves full trace visibility for debugging while
keeping these events below the alerting threshold.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… cala Extends the velocity enforcement fix to every remaining `#[instrument(..., err)]` in cala-ledger, cala-cel-parser, and cala-cel-interpreter. A library should not unilaterally decide that its errors are operational emergencies — that decision belongs to the application layer. Bare `err` emits ERROR-level span events that match the `galoy-staging-lana-errors` Honeycomb trigger and page on-call via Zenduty, even for expected conditions like validation failures or CEL parse errors. All 12 remaining bare `err` attributes are now `err(level = tracing::Level::WARN)`. Traces retain full error detail for debugging; only the severity drops below the alerting threshold. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
📊 Performance ReportCommit: 8723a61 Cala Performance Benchmark Results (non-representative)Criterion Benchmark Results (single-threaded)
Load Testing Results (parallel-execution)
Note: Performance results may vary based on system resources and database state. Last updated by commit 8723a61 |
nicolasburtey
approved these changes
Apr 23, 2026
pmartincalvo
approved these changes
Apr 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Zenduty incident #1505 (2026-04-22T06:33:14Z) paged on-call for a velocity limit enforcement rejection — a user hitting a spending cap, not an infrastructure failure. The alert fired because cala-ledger uses
#[instrument(..., err)]on functions likeupdate_balances_with_limit_enforcement_in_opand its children (velocity_limit.enforce,velocity_limit.window_for_enforcement). Bareerremits ERROR-level span events. Thegaloy-staging-lana-errorsHoneycomb trigger filters onerror = true AND level = ERRORand forwards to Zenduty — so the library's severity decision became an on-call page.The deeper problem: cala is a library. It shouldn't unilaterally decide that its errors are operational emergencies. Whether a velocity rejection, a CEL parse failure, or a template validation error is page-worthy depends on application context that only the consumer (lana-bank) has. An audit found 14 functions across cala-ledger, cala-cel-parser, and cala-cel-interpreter using bare
err— all potential sources of spurious alerts that the application layer cannot suppress.lana-bank already classifies velocity enforcement errors correctly at its layer —
ManualTransactionLedgerErrormapsVelocityError::EnforcementtoLevel::WARNvia theErrorSeveritytrait, and the#[record_error_severity]macro onmanual_transaction.postemits WARN-level events accordingly. The problem was that cala's#[instrument(..., err)]emitted its own ERROR-level span events underneath, before lana-bank's macro ever got a chance to classify the error. Both layers produced span events for the same error, but cala's was ERROR while lana-bank's was WARN — and the trigger matched cala's.This PR changes every bare
errtoerr(level = tracing::Level::WARN). Error details remain in traces for debugging; only the severity drops below the alerting threshold. The application layer can re-instrument at ERROR where it has the context to distinguish operational failures from expected business rejections.Test plan
errremains:rg '#\[instrument.*\berr\)\]' --type rustreturns zero matchesgaloy-staging-lana-errorstrigger no longer fires for velocity limit rejections🤖 Generated with Claude Code