Skip to content

fuse: fix multiple issues during conn abort#138

Open
hbirth wants to merge 2 commits intoDDNStorage:redfs-ubuntu-noble-6.8.0-58.60from
hbirth:fix-race-in-get-req
Open

fuse: fix multiple issues during conn abort#138
hbirth wants to merge 2 commits intoDDNStorage:redfs-ubuntu-noble-6.8.0-58.60from
hbirth:fix-race-in-get-req

Conversation

@hbirth
Copy link
Copy Markdown
Collaborator

@hbirth hbirth commented Mar 26, 2026

No description provided.

@hbirth
Copy link
Copy Markdown
Collaborator Author

hbirth commented Mar 26, 2026

we have to verify this with John Gu ... since I cannot reproduce the state on my system

@hbirth hbirth force-pushed the fix-race-in-get-req branch from b665e47 to 8fb4203 Compare March 27, 2026 11:47
@hbirth hbirth changed the title fuse: fix lost wakeup in request allocation during connection abort fuse: add debug prints when request hang in conn abort Mar 27, 2026
@hbirth
Copy link
Copy Markdown
Collaborator Author

hbirth commented Mar 27, 2026

just added debug print for all requests that are in the system ... printed after 10 seconds when we hang in fuse_wait_aborted()

@hbirth hbirth force-pushed the fix-race-in-get-req branch 2 times, most recently from 2ebd093 to 00bac22 Compare March 27, 2026 12:00
@hbirth hbirth force-pushed the fix-race-in-get-req branch 3 times, most recently from cf3e5eb to 4a7ae2f Compare March 27, 2026 14:55
bsbernd
bsbernd previously approved these changes Mar 27, 2026
Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
@hbirth hbirth force-pushed the fix-race-in-get-req branch from 4a7ae2f to 16c0fc2 Compare April 7, 2026 07:54
@hbirth hbirth changed the title fuse: add debug prints when request hang in conn abort fuse: fix multiple issues during conn abort Apr 7, 2026
*/
for (qid = 0; qid < ring->max_nr_queues; qid++) {
queue = READ_ONCE(ring->queues[qid]);
if (!queue)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see why two runs are required. With one single lock, let's say we have two competing tasks, task-1 tries to add the request, task-2 tries to abort the queue.

Scenario1:
task-1 wins and is in the process to add to the queue -> takes the lock, task-2 (abort) is blocked. Once task-2 continues, It blocks the queue and flushes requests. After that another task cannot add new requests, because queue->stopped` is set.

Scenario2:
task-2 (abort) wins, takes the lock, set queue->lock and flushes the queue. Once it releases the lock, task-1 wakes up, sees that the queue is stopped and fails the request with -ENOTCONN.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really that bad? this only gets called when the connection is destroyed.

* queued even when no ring entries are active.
*/
fuse_uring_flush_bg(fc);
fuse_uring_stop_queues(ring);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is not right, queue_refs == 0 means there are no ring_ent on any queue - an impossible condition. And with queue_refs == 0, ring->ready will not be true.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is that an impossible condition? I have seen it many times in debug prints

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At startup time - > ring gets created, first queue gets created, then a ring-entry gets created (increases queue_refs), then ring->ready is set.

teardown:

fuse_uring_teardown_entries
fuse_uring_stop_list_entries() decreases queue_refs

Btw, if there would be a problem with FRRS_FUSE_REQ and FRRS_COMMIT, queue refs would never go to 0 - exactly here.

Anyway, fuse_uring_teardown_entries() is called by fuse_uring_stop_queues, which is supposed to be called in fuse_uring_abort() after fuse_uring_abort_end_requests(). Which means it is after queue->stopped = true

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes ... that's why I had done those above (maybe in a bit of dispair) ... we have queue_refs == 0, and queue ready, and still requests active that will never terminate.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, all I want here is to call fuse_flush_bg() and fuse_uring_stop_queues() no matter what. What is wrong with that?

@hbirth hbirth force-pushed the fix-race-in-get-req branch 2 times, most recently from 78f26c1 to 177aa69 Compare April 7, 2026 13:22
Fix uninterruptible sleep (D state) hangs during FUSE filesystem
teardown when using io_uring. The issue manifests as processes stuck
waiting for requests that are never completed, particularly affecting
force requests like FUSE_FLUSH or when requests are created after
fuse_abort_conn() already finished.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
@hbirth hbirth force-pushed the fix-race-in-get-req branch from 177aa69 to e3bbfe2 Compare April 7, 2026 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants