Fix four correctness bugs (exception hang, reentrancy, termination race, busy-poll)#7
Open
Roach wants to merge 1 commit into
Open
Fix four correctness bugs (exception hang, reentrancy, termination race, busy-poll)#7Roach wants to merge 1 commit into
Roach wants to merge 1 commit into
Conversation
1. User-function exceptions hung the generator forever `on_completed` called `future.result()` unguarded; if the user function raised, the callback aborted before putting anything on the result queue, and the consumer waited indefinitely for the missing index. Wrap in try/except, push a `_TaskError` sentinel, re-raise in the caller. 2. Module-level globals made fast_map non-reentrant `enqueueing_finished` and `enqueued_items_count` were module globals, so two concurrent calls (the README even encourages this) clobbered each other's state. Move both into per-call locals closed over by the enqueuer. 3. Flag/count race at termination The consumer's exit check read `enqueueing_finished` and `enqueued_items_count` in an order that could observe a stale count alongside finished=True. Write the final count BEFORE setting the Event so any observer that sees the flag also sees the final total. 4. Busy-poll + unreliable Queue.empty() in consumer loop The main loop polled `result_queue.empty()` and `is_alive()` without blocking, burning a full CPU between results and racing with documented-unreliable `Queue.empty()`. Replace with `result_queue.get(timeout=...)` and an Event-based termination check; bail out if all workers die without producing the expected index. Also adds a regression test module covering each fix plus empty-input and single-task edge cases.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I was reading through again and spotted four correctness bugs that all live in the same producer/consumer path, so figured I'd bundle the fixes into one PR rather than spam you with four.
Each one is independently real; together they cover most of the silent-hang / wrong-result scenarios I could reproduce.
1. Exceptions in the user function hang the generator forever
on_completedinprocess_chunkcallsfuture.result()with no try/except. If the user's function raises, the exception re-raises inside theadd_done_callbackthread,result_queue.put(...)never runs, and the consumer waits forever for the missing index.Fix: catch it, push a
_TaskError(exc)sentinel through the queue, re-raise it in the caller when that slot would have been yielded.2.
fast_mapisn't reentrantThese are module globals — written by the enqueuer thread, read by the consumer. Two concurrent
fast_mapcalls (which the README explicitly says people do) clobber each other. Second call resets the counter to 0 while the first is still consuming → early return or hang.Moved both into per-call state closed over by
enqueuer. No more shared mutable module state.3. Flag/count race at termination
Termination check is
expected_index == enqueued_items_count and enqueueing_finished. The enqueuer previously incremented the counter per-item and set the flag at the end with no ordering between the two writes (and no atomic read on the consumer side). So a consumer that observesfinished=Truecould still be reading a stale count.Fix: write the final count before setting the
Event. Anyone who sees the flag is now guaranteed to see the final total too.4. Busy-poll + unreliable
Queue.empty()in the consumerTwo issues here:
multiprocessing.Queue.empty()is documented as unreliable, so a result produced betweenempty()andis_alive()can get dropped on the floor and never yieldedReplaced with
result_queue.get(timeout=0.1)plus anEvent-based termination check. Also added a defensive bailout if every worker has died without producing the expected index, so a worker crash can't deadlock the generator.Tests
Added
test/test_p0_regressions.py— one case per fix plus empty-input and single-task edges:The concurrency and empty-input cases both hang on
master(verified locally), the others fail with the exception or get wrong ordering under load.Scope
I kept this strictly correctness — didn't want to balloon the diff. Side note: I've got separate notes on a handful of other things I'd be happy to send follow-up PRs for if you're interested:
atexit.registerleaks one entry per callspawnportability — the existingexamples/*.pyand script-style tests don't run on macOS as-is (missingif __name__ == "__main__":guard)on_errorcallback forfast_map_async, context-manager / explicit close, saferthreads_limitdefaultsetup.cfg+pyproject.tomlduplication, the committedfast_map.py.bakLet me know if you have questions or want me to split this up differently! :man-bowing: