Skip to content

fix: downgrade library err instrumentation from ERROR to WARN#712

Merged
vindard merged 2 commits into
mainfrom
fix/downgrade-err-instrumentation-to-warn
Apr 23, 2026
Merged

fix: downgrade library err instrumentation from ERROR to WARN#712
vindard merged 2 commits into
mainfrom
fix/downgrade-err-instrumentation-to-warn

Conversation

@vindard
Copy link
Copy Markdown
Contributor

@vindard vindard commented Apr 23, 2026

Summary

Zenduty incident #1505 (2026-04-22T06:33:14Z) paged on-call for a velocity limit enforcement rejection — a user hitting a spending cap, not an infrastructure failure. The alert fired because cala-ledger uses #[instrument(..., err)] on functions like update_balances_with_limit_enforcement_in_op and its children (velocity_limit.enforce, velocity_limit.window_for_enforcement). Bare err emits ERROR-level span events. The galoy-staging-lana-errors Honeycomb trigger filters on error = true AND level = ERROR and forwards to Zenduty — so the library's severity decision became an on-call page.

The deeper problem: cala is a library. It shouldn't unilaterally decide that its errors are operational emergencies. Whether a velocity rejection, a CEL parse failure, or a template validation error is page-worthy depends on application context that only the consumer (lana-bank) has. An audit found 14 functions across cala-ledger, cala-cel-parser, and cala-cel-interpreter using bare err — all potential sources of spurious alerts that the application layer cannot suppress.

lana-bank already classifies velocity enforcement errors correctly at its layer — ManualTransactionLedgerError maps VelocityError::Enforcement to Level::WARN via the ErrorSeverity trait, and the #[record_error_severity] macro on manual_transaction.post emits WARN-level events accordingly. The problem was that cala's #[instrument(..., err)] emitted its own ERROR-level span events underneath, before lana-bank's macro ever got a chance to classify the error. Both layers produced span events for the same error, but cala's was ERROR while lana-bank's was WARN — and the trigger matched cala's.

This PR changes every bare err to err(level = tracing::Level::WARN). Error details remain in traces for debugging; only the severity drops below the alerting threshold. The application layer can re-instrument at ERROR where it has the context to distinguish operational failures from expected business rejections.

Test plan

  • Verify no bare err remains: rg '#\[instrument.*\berr\)\]' --type rust returns zero matches
  • Confirm velocity enforcement traces still appear in Honeycomb at WARN level after deploy
  • Confirm galoy-staging-lana-errors trigger no longer fires for velocity limit rejections

🤖 Generated with Claude Code

vindard and others added 2 commits April 23, 2026 11:01
… to WARN

Velocity limit enforcement failures ("Velocity limit exceeded") are
expected business logic rejections — a user hitting a spending cap, not
an operational failure. The `#[instrument(..., err)]` attribute on
`update_balances_with_limit_enforcement_in_op` and its child functions
`velocity_limit.enforce` and `velocity_limit.window_for_enforcement`
were emitting ERROR-level span events for these rejections. Our
Honeycomb trigger (`galoy-staging-lana-errors`) filters on
`error = true AND level = ERROR` and forwards matches to Zenduty, so
every legitimate velocity rejection paged on-call (e.g. incident #1505,
2026-04-22).

The root cause: cala-ledger, as a library, was making an opinionated
decision about error severity that belongs to the application layer.
Changing `err` to `err(level = tracing::Level::WARN)` on the parent and
both child spans preserves full trace visibility for debugging while
keeping these events below the alerting threshold.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… cala

Extends the velocity enforcement fix to every remaining `#[instrument(..., err)]`
in cala-ledger, cala-cel-parser, and cala-cel-interpreter. A library should not
unilaterally decide that its errors are operational emergencies — that decision
belongs to the application layer. Bare `err` emits ERROR-level span events that
match the `galoy-staging-lana-errors` Honeycomb trigger and page on-call via
Zenduty, even for expected conditions like validation failures or CEL parse errors.

All 12 remaining bare `err` attributes are now `err(level = tracing::Level::WARN)`.
Traces retain full error detail for debugging; only the severity drops below the
alerting threshold.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vindard vindard requested a review from pmartincalvo April 23, 2026 15:09
@github-actions
Copy link
Copy Markdown
Contributor

📊 Performance Report

Commit: 8723a61
Updated: 2026-04-23 15:25:11 UTC

Cala Performance Benchmark Results (non-representative)

Criterion Benchmark Results (single-threaded)

Benchmark Time per Run Throughput % vs Baseline
post_simple_transaction 6.904ms 144 tx/s 0 (baseline)
post_and_recalculate_ec_account_set 17.563ms 56 tx/s -154.0%
post_and_batch_recalculate_ec_account_set 13.116ms 76 tx/s -89.0%
post_multi_layer_transaction 9.713ms 102 tx/s -40.0%
post_simple_transaction_with_effective_balances 9.021ms 110 tx/s -30.0%
post_simple_transaction_with_skipped_velocity 6.972ms 143 tx/s -0.0%
post_simple_transaction_with_velocity 8.100ms 123 tx/s -17.0%
post_simple_transaction_with_hit_velocity 4.152ms 240 tx/s +39.0%
post_simple_transaction_with_one_account_set 7.360ms 135 tx/s -6.0%
post_simple_transaction_with_five_account_sets 8.709ms 114 tx/s -26.0%
post_simple_transaction_with_ec_account_set 6.441ms 155 tx/s +6.0%

Load Testing Results (parallel-execution)

Scenario tx/s
1 parallel 86.27
2 parallel 125.89
5 parallel 158.23
10 parallel 159.26
20 parallel 154.24
2 contention 116.51
5 contention 140.78
2 acct_sets 111.40
5 acct_sets 122.50

Note: Performance results may vary based on system resources and database state.

Last updated by commit 8723a61

@vindard vindard requested a review from a team April 23, 2026 15:37
@vindard vindard merged commit 67253d1 into main Apr 23, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants