Skip to content

fix: prevent I/O errors during disk cleanup by ensuring devices are unmounted#234

Merged
brownzebra merged 1 commit into
mainfrom
EAI-5787_bloom_cleanup
May 4, 2026
Merged

fix: prevent I/O errors during disk cleanup by ensuring devices are unmounted#234
brownzebra merged 1 commit into
mainfrom
EAI-5787_bloom_cleanup

Conversation

@Q-Dub
Copy link
Copy Markdown
Contributor

@Q-Dub Q-Dub commented Apr 28, 2026

EAI-5787_bloom_cleanup

Critical bug fix for EAI-5787: The cleanup process was causing nodes to become unresponsive with I/O errors due to attempting wipefs/mkfs on mounted or in-use devices.

Changes:

  • Add final forced unmount pass for all CLUSTER_DISKS before wiping
  • Add 10-second delay to allow kernel to fully release devices
  • Verify each device is unmounted with findmnt before wipefs/mkfs
  • Skip devices that are still mounted to prevent system corruption

This prevents the catastrophic I/O errors that were making GPU worker nodes unresponsive when running bloom cleanup on systems with multiple NVMe devices.

…nmounted

Critical bug fix for EAI-5787: The cleanup process was causing nodes to become
unresponsive with I/O errors due to attempting wipefs/mkfs on mounted or
in-use devices.

Changes:
- Add final forced unmount pass for all CLUSTER_DISKS before wiping
- Add 10-second delay to allow kernel to fully release devices
- Verify each device is unmounted with findmnt before wipefs/mkfs
- Skip devices that are still mounted to prevent system corruption

This prevents the catastrophic I/O errors that were making GPU worker nodes
unresponsive when running bloom cleanup on systems with multiple NVMe devices.
@Q-Dub Q-Dub requested review from a team, brownzebra, mramdgh, oskarasbrink and silokimmo April 28, 2026 17:32
Copy link
Copy Markdown
Contributor

@brownzebra brownzebra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@brownzebra brownzebra merged commit aa340ef into main May 4, 2026
3 checks passed
@brownzebra brownzebra deleted the EAI-5787_bloom_cleanup branch May 4, 2026 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants