Remove fallback cache; fix fallback deadlock, duplicate egress, and Aria hot paths#37
Merged
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #37 +/- ##
==========================================
- Coverage 88.14% 88.03% -0.12%
==========================================
Files 45 45
Lines 2616 2590 -26
==========================================
- Hits 2306 2280 -26
Misses 310 310
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Setting internal=True on local recursive run_fallback_function calls so they don't await waited_ack_events[t_id] from inside the same gather the root is using to drive it. Surfaced after removing the fallback cache, which previously short-circuited __call__ before __send_async_calls ran. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds explicit markers around: epoch boundaries, chain-ack gather, sync_workers barriers, run_fallback_function entry/exit, fallback strategy entry/exit, ack accumulation. Temporary — to pinpoint the post-snapshot stall reported on TPC-C 1-warehouse. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the rw-set check returns True we now print which partition disagreed and the symmetric-difference of keys. The TPC-C 1-warehouse stall on origin/main shows the reschedule list growing unboundedly, so the question is: which keys are showing up in one phase but not the other? Temporary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix cluster-wide fallback deadlock: AIOTaskScheduler 64-slot semaphore was saturated by fallback chain participants waiting on locking events that fire only when other (queued) participants run. Add create_unbounded_task and use it for fallback dispatch and other handler paths that previously held a bounded slot while sleeping. - Fix duplicate egress on rw-set rescheduling: when a fallback rw-set changed, the transaction was marked failed but the egress block still fired because client_responses was populated. Skip egress when rw_changed=True; batch fallback egress with a single send_batch(). - Relax has_fallback_rw_set_changed: treat writes/reads of keys absent from committed state as safe (insert-safe relaxation). - Drop redundant networking_locks from handlers without internal awaits (Ack, ChainAbort, ResponseToRoot, Unlock, DeterministicReordering, AriaCommit, AriaProcessingDone, SyncCleanup, AriaFallback*, RunFunRemote, WrongPartitionRequest, AsyncMigration, and the migration key-fetch trio in worker_service). Single-threaded asyncio already gives atomicity for sync critical sections. - Replace fractions.Fraction with float for chain ACK accounting. ~100x cheaper per add; ε=1e-9 tolerance is safely below realistic chain depth/fan-out limits (safe to ~4.5M leaves; TPC-C chains have <1k). - Tests: cover create_unbounded_task, float-format ACK strings, both write- and read-side rw-set relaxations (new-key safe vs contended-key reschedule). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Started as a cleanup to drop the fallback result cache, then expanded into a series of correctness and performance fixes uncovered by TPC-C 1-warehouse runs.
Correctness
fe2dd4e).fallback_locking_event_map[d]while holding a slot inAIOTaskScheduler's 64-slot semaphore; those events fire only when other (queued) participants run, producing a cluster-wide stall under load. Addscreate_unbounded_taskand routes fallback dispatch and other always-sleepy handler paths through it (4315b37,385633c).client_responseswas populated. Skip egress whenrw_changed=True; batch the post-fallback egress with a singlesend_batch()(385633c).has_fallback_rw_set_changed. Writes/reads of keys absent from committed state are insert-safe and no longer trigger reschedule (385633c).Performance
networking_locksfrom handlers without internal awaits (Ack, ChainAbort, ResponseToRoot, Unlock, DeterministicReordering, AriaCommit, AriaProcessingDone, SyncCleanup, AriaFallback*, RunFunRemote, WrongPartitionRequest, AsyncMigration, and the migration key-fetch trio inworker_service). Single-threaded asyncio already gives atomicity for sync critical sections (385633c).fractions.Fractionwithfloatfor chain ACK accounting. ~100× cheaper per add; ε=1e-9 tolerance is safely below realistic chain depth/fan-out (safe to ~4.5M leaves; TPC-C chains have <1k) (385633c).dict(...)copy on the hot path; switch fallback egress fromsend_immediateto batchedsend(385633c).Misc
37bb8dc).Test plan
pytest tests/unit/— 726 passingruff checkclean for all modified filesexactly_once_output: truein result JSON