Skip to content

Remove fallback cache; fix fallback deadlock, duplicate egress, and Aria hot paths#37

Merged
kPsarakis merged 6 commits into
mainfrom
remove-fallback-cache
May 15, 2026
Merged

Remove fallback cache; fix fallback deadlock, duplicate egress, and Aria hot paths#37
kPsarakis merged 6 commits into
mainfrom
remove-fallback-cache

Conversation

@kPsarakis
Copy link
Copy Markdown
Member

@kPsarakis kPsarakis commented May 13, 2026

Summary

Started as a cleanup to drop the fallback result cache, then expanded into a series of correctness and performance fixes uncovered by TPC-C 1-warehouse runs.

Correctness

  • Remove the fallback cache to simplify the codebase (fe2dd4e).
  • Fix fallback deadlock on local chain branches. Fallback chain participants suspended on fallback_locking_event_map[d] while holding a slot in AIOTaskScheduler's 64-slot semaphore; those events fire only when other (queued) participants run, producing a cluster-wide stall under load. Adds create_unbounded_task and routes fallback dispatch and other always-sleepy handler paths through it (4315b37, 385633c).
  • Fix duplicate egress on rw-set rescheduling. When a fallback rw-set changed, the txn was marked failed but the egress block still fired because client_responses was populated. Skip egress when rw_changed=True; batch the post-fallback egress with a single send_batch() (385633c).
  • Relax has_fallback_rw_set_changed. Writes/reads of keys absent from committed state are insert-safe and no longer trigger reschedule (385633c).

Performance

  • Drop redundant networking_locks from handlers without internal awaits (Ack, ChainAbort, ResponseToRoot, Unlock, DeterministicReordering, AriaCommit, AriaProcessingDone, SyncCleanup, AriaFallback*, RunFunRemote, WrongPartitionRequest, AsyncMigration, and the migration key-fetch trio in worker_service). Single-threaded asyncio already gives atomicity for sync critical sections (385633c).
  • Replace fractions.Fraction with float for chain ACK accounting. ~100× cheaper per add; ε=1e-9 tolerance is safely below realistic chain depth/fan-out (safe to ~4.5M leaves; TPC-C chains have <1k) (385633c).
  • Replace one msgpack encode/decode roundtrip with a shallow dict(...) copy on the hot path; switch fallback egress from send_immediate to batched send (385633c).

Misc

  • Version bumps for ruff, msgspec, aiokafka, confluent-kafka, boto3, prometheus-client, aiohttp, and base Python image (37bb8dc).

Test plan

  • Unit tests: pytest tests/unit/ — 726 passing
  • ruff check clean for all modified files
  • TPC-C 1-warehouse end-to-end run (manual)
  • YCSB-T end-to-end run (manual)
  • Verify no duplicate egress messages and exactly_once_output: true in result JSON

@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.03%. Comparing base (0d9fcab) to head (37bb8dc).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #37      +/-   ##
==========================================
- Coverage   88.14%   88.03%   -0.12%     
==========================================
  Files          45       45              
  Lines        2616     2590      -26     
==========================================
- Hits         2306     2280      -26     
  Misses        310      310              
Flag Coverage Δ
coordinator 93.40% <ø> (ø)
integration 9.30% <0.00%> (+0.09%) ⬆️
styx-package 84.60% <100.00%> (-0.37%) ⬇️
worker 83.69% <100.00%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
styx-package/styx/common/base_networking.py 94.73% <100.00%> (-0.84%) ⬇️
styx-package/styx/common/message_types.py 100.00% <ø> (ø)
styx-package/styx/common/operator.py 100.00% <100.00%> (ø)
styx-package/styx/common/run_func_payload.py 100.00% <100.00%> (ø)
styx-package/styx/common/stateful_function.py 99.13% <100.00%> (-0.07%) ⬇️
...tyx-package/styx/common/util/aio_task_scheduler.py 93.87% <100.00%> (+1.19%) ⬆️
worker/operator_state/aria/base_aria_state.py 92.56% <100.00%> (+0.52%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kPsarakis and others added 5 commits May 13, 2026 22:32
Setting internal=True on local recursive run_fallback_function calls so
they don't await waited_ack_events[t_id] from inside the same gather the
root is using to drive it. Surfaced after removing the fallback cache,
which previously short-circuited __call__ before __send_async_calls ran.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds explicit markers around: epoch boundaries, chain-ack gather,
sync_workers barriers, run_fallback_function entry/exit, fallback
strategy entry/exit, ack accumulation. Temporary — to pinpoint the
post-snapshot stall reported on TPC-C 1-warehouse.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the rw-set check returns True we now print which partition
disagreed and the symmetric-difference of keys. The TPC-C 1-warehouse
stall on origin/main shows the reschedule list growing unboundedly, so
the question is: which keys are showing up in one phase but not the
other? Temporary.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix cluster-wide fallback deadlock: AIOTaskScheduler 64-slot semaphore
  was saturated by fallback chain participants waiting on locking events
  that fire only when other (queued) participants run. Add
  create_unbounded_task and use it for fallback dispatch and other
  handler paths that previously held a bounded slot while sleeping.

- Fix duplicate egress on rw-set rescheduling: when a fallback rw-set
  changed, the transaction was marked failed but the egress block still
  fired because client_responses was populated. Skip egress when
  rw_changed=True; batch fallback egress with a single send_batch().

- Relax has_fallback_rw_set_changed: treat writes/reads of keys absent
  from committed state as safe (insert-safe relaxation).

- Drop redundant networking_locks from handlers without internal awaits
  (Ack, ChainAbort, ResponseToRoot, Unlock, DeterministicReordering,
  AriaCommit, AriaProcessingDone, SyncCleanup, AriaFallback*,
  RunFunRemote, WrongPartitionRequest, AsyncMigration, and the migration
  key-fetch trio in worker_service). Single-threaded asyncio already
  gives atomicity for sync critical sections.

- Replace fractions.Fraction with float for chain ACK accounting.
  ~100x cheaper per add; ε=1e-9 tolerance is safely below realistic
  chain depth/fan-out limits (safe to ~4.5M leaves; TPC-C chains
  have <1k).

- Tests: cover create_unbounded_task, float-format ACK strings, both
  write- and read-side rw-set relaxations (new-key safe vs
  contended-key reschedule).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@kPsarakis kPsarakis changed the title Remove the fallback cache to simplify the codebase Remove fallback cache; fix fallback deadlock, duplicate egress, and Aria hot paths May 15, 2026
@kPsarakis kPsarakis self-assigned this May 15, 2026
@kPsarakis kPsarakis merged commit b1d735d into main May 15, 2026
8 checks passed
@kPsarakis kPsarakis deleted the remove-fallback-cache branch May 15, 2026 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant