Skip to content

Conversation

@llllssss94
Copy link

What is the purpose of the change

This pull request fixes a critical bug (FLINK-38909) that causes checkpoint cleanup to fail with a PathIsNotEmptyDirectoryException. The root cause was an incorrect, non-recursive delete call on a checkpoint's storage location, which by design contains multiple files.

A completed Flink checkpoint always consists of multiple data files and a metadata file, grouped under a common path (exclusiveCheckpointDir). This logical location is never empty. Attempting to delete it with a non-recursive delete(path, false) command is fundamentally incorrect and guaranteed to fail on any compliant file system. This bug leads to orphaned checkpoint data and storage leaks.

This fix corrects the logic by using a recursive delete, ensuring that all files and objects associated with a checkpoint's location are properly removed, regardless of the underlying filesystem's architecture.

Brief change log

  • In FsCompletedCheckpointStorageLocation.disposeStorageLocation(), the filesystem call was changed to fs.delete(exclusiveCheckpointDir, true). This enables recursive deletion, ensuring the entire directory tree of a checkpoint is properly removed.

Verifying this change

This change added tests and can be verified as follows:

  • Added a new test case to FsCompletedCheckpointStorageLocationTest to specifically reproduce the bug and validate the fix. This test simulates a real, non-empty checkpoint by creating a storage location with subdirectories and files. It then calls the disposeStorageLocation() method and asserts that no exception is thrown and the location is completely removed.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes (This directly impacts the correctness of the checkpoint cleanup lifecycle.)
  • The S3 file system connector: yes (While the fix is in core Flink, the bug is most frequently observed on object storage systems, and this change ensures correct behavior on them.)

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 15, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

}
fs.delete(exclusiveCheckpointDir, false);
// Recursively delete the checkpoint directory and all its contents
fs.delete(exclusiveCheckpointDir, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change looks good.

I was wondering if the deletion fails (e.g. for permissions reasons ) should we look to catch and log that error.

@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Jan 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants