Skip to content

[SPR-842]: Fix deadlock between log reader and copy thread#529

Merged
garthgoodson merged 36 commits into
mainfrom
SPR-842-logger-deadlock-2
Aug 20, 2025
Merged

[SPR-842]: Fix deadlock between log reader and copy thread#529
garthgoodson merged 36 commits into
mainfrom
SPR-842-logger-deadlock-2

Conversation

@rsdcbabu

Copy link
Copy Markdown
Contributor
  1. Make copy thread to consume all the requests from sync queue before releasing log reader
  2. Make table_copy aware of drop_table, and so it can proceed with drop table instead of copy
  3. Have resync map to track tables those are queued for resync and so DDLs can be skipped.

@linear

linear Bot commented Jun 26, 2025

Copy link
Copy Markdown
SPR-842 copy_thread - logger_thread - internal-thread deadlock

copy_thread - logger_thread - internal-thread deadlock

Comment thread src/pg_log_mgr/pg_log_mgr.cc
@sonarqubecloud

Copy link
Copy Markdown

@rsdcbabu rsdcbabu marked this pull request as ready for review August 11, 2025 15:32

@craigsoules craigsoules left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic seems fine, but need additional comments in the code explaining about when it's safe to skip the inflight vs. not.

Comment thread .github/workflows/nightly.yml Outdated
fi
exit 1
fi
#- name: Run Proxy Tests

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these proxy tests are commented out... do we need to re-enable them?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +281 to +287
/** db-> table indicating that a resync was issued but it hasn't been picked up by the copy
thread yet. */
std::map<uint64_t, std::map<uint64_t, std::set<XidLsn>>> _resync_map;

/** db -> table -> xid indicating the table @ XID is selected for resync, yet to in-flight*/
std::map<uint64_t, std::map<uint64_t, XidLsn>> _resync_picked_map;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be more readable if we named these various map types types.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created aliases, let me know if needs further change

Comment thread include/pg_repl/pg_copy_table.hh Outdated
CopyQueuePtr copy_queue,
PgCopyResultPtr result);
PgCopyResultPtr result,
bool skip_setting_inflight);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an explanation of this variable and when it should be set.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is needed only in resyncs triggered while ingestion is running in general. Should see if we can avoid this specific behaviour, will probably add that as well in the comment for room-to-improvise.

@rsdcbabu rsdcbabu Aug 18, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@craigsoules I'm thinking if I should always call the method (mark_inflight) and have if-clause to skip inside the method instead of having CHECKs in that method. Will fix that way and remove this param. Let me know if you think otherwise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this flag altogether as it makes assumptions.

Comment thread include/pg_repl/pg_copy_table.hh Outdated
std::optional<std::set<uint32_t>> table_oids = std::nullopt,
std::optional<nlohmann::json> include_json = std::nullopt);
std::optional<nlohmann::json> include_json = std::nullopt,
bool skip_setting_inflight=true);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, a comment would be invaluable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the flag as mentioned in the previous comment

@rsdcbabu rsdcbabu requested a review from craigsoules August 19, 2025 13:50

@craigsoules craigsoules left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine if it's fixing the problem.

My only concern is that we used to have a resync map like this, and I ended up removing it because it turned out that if I didn't block when a resync was requested there was some kind of race condition in which mutations to the table that should have been skipped were being passed through.

It's possible that would no longer happen with the picked-for-sync stuff, but just wanted to mention it in case you were seeing something like that happening in the nighty run.

@garthgoodson

Copy link
Copy Markdown
Contributor

Seems fine if it's fixing the problem.

My only concern is that we used to have a resync map like this, and I ended up removing it because it turned out that if I didn't block when a resync was requested there was some kind of race condition in which mutations to the table that should have been skipped were being passed through.

It's possible that would no longer happen with the picked-for-sync stuff, but just wanted to mention it in case you were seeing something like that happening in the nighty run.

The current issue in the nightly is happening in both this branch and in main (Ella's branch) so don't think it is related to these changes.

@garthgoodson garthgoodson merged commit 9319007 into main Aug 20, 2025
1 check passed
@garthgoodson garthgoodson deleted the SPR-842-logger-deadlock-2 branch August 20, 2025 19:00
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants