[SPR-842]: Fix deadlock between log reader and copy thread#529
Conversation
rsdcbabu
commented
Jun 26, 2025
- Make copy thread to consume all the requests from sync queue before releasing log reader
- Make table_copy aware of drop_table, and so it can proceed with drop table instead of copy
- Have resync map to track tables those are queued for resync and so DDLs can be skipped.
SPR-842 copy_thread - logger_thread - internal-thread deadlock
copy_thread - logger_thread - internal-thread deadlock |
…eue, increase nightly iteration
|
…p for later checks
…et to running in ddl mgr after init.
…R-842-logger-deadlock-2
…to SPR-842-logger-deadlock-2
craigsoules
left a comment
There was a problem hiding this comment.
Logic seems fine, but need additional comments in the code explaining about when it's safe to skip the inflight vs. not.
| fi | ||
| exit 1 | ||
| fi | ||
| #- name: Run Proxy Tests |
There was a problem hiding this comment.
Looks like these proxy tests are commented out... do we need to re-enable them?
There was a problem hiding this comment.
Yes, need to fix the proxy tests:
https://linear.app/springtail/issue/SPR-955/fix-proxy-tests
| /** db-> table indicating that a resync was issued but it hasn't been picked up by the copy | ||
| thread yet. */ | ||
| std::map<uint64_t, std::map<uint64_t, std::set<XidLsn>>> _resync_map; | ||
|
|
||
| /** db -> table -> xid indicating the table @ XID is selected for resync, yet to in-flight*/ | ||
| std::map<uint64_t, std::map<uint64_t, XidLsn>> _resync_picked_map; | ||
|
|
There was a problem hiding this comment.
Would be more readable if we named these various map types types.
There was a problem hiding this comment.
Created aliases, let me know if needs further change
| CopyQueuePtr copy_queue, | ||
| PgCopyResultPtr result); | ||
| PgCopyResultPtr result, | ||
| bool skip_setting_inflight); |
There was a problem hiding this comment.
Add an explanation of this variable and when it should be set.
There was a problem hiding this comment.
Yeah, this is needed only in resyncs triggered while ingestion is running in general. Should see if we can avoid this specific behaviour, will probably add that as well in the comment for room-to-improvise.
There was a problem hiding this comment.
@craigsoules I'm thinking if I should always call the method (mark_inflight) and have if-clause to skip inside the method instead of having CHECKs in that method. Will fix that way and remove this param. Let me know if you think otherwise.
There was a problem hiding this comment.
Removed this flag altogether as it makes assumptions.
| std::optional<std::set<uint32_t>> table_oids = std::nullopt, | ||
| std::optional<nlohmann::json> include_json = std::nullopt); | ||
| std::optional<nlohmann::json> include_json = std::nullopt, | ||
| bool skip_setting_inflight=true); |
There was a problem hiding this comment.
Same here, a comment would be invaluable.
There was a problem hiding this comment.
Removed the flag as mentioned in the previous comment
…ceived for the table" This reverts commit 0073478.
craigsoules
left a comment
There was a problem hiding this comment.
Seems fine if it's fixing the problem.
My only concern is that we used to have a resync map like this, and I ended up removing it because it turned out that if I didn't block when a resync was requested there was some kind of race condition in which mutations to the table that should have been skipped were being passed through.
It's possible that would no longer happen with the picked-for-sync stuff, but just wanted to mention it in case you were seeing something like that happening in the nighty run.
The current issue in the nightly is happening in both this branch and in main (Ella's branch) so don't think it is related to these changes. |
|


