Skip to content

[HDFS-17897] Handle InvalidEncryptionKeyException during striped file checksum#8364

Open
JHSUYU wants to merge 1 commit intoapache:trunkfrom
JHSUYU:fix-striped-checksum-encryption-key
Open

[HDFS-17897] Handle InvalidEncryptionKeyException during striped file checksum#8364
JHSUYU wants to merge 1 commit intoapache:trunkfrom
JHSUYU:fix-striped-checksum-encryption-key

Conversation

@JHSUYU
Copy link

@JHSUYU JHSUYU commented Mar 22, 2026

Description of PR

Jira: HDFS-17897

HDFS-12931 added handling for InvalidEncryptionKeyException in ReplicatedFileChecksumComputer.checksumBlock() but missed the parallel striped file path StripedFileNonStripedChecksumComputer.checksumBlockGroup().

Both paths call DFSClient.connectToDN(), which performs a SASL handshake using a cached DataEncryptionKey (DEK). When the DEK references a BlockKey that has been removed from the DataNode (there is
a time gap when the DataNode isn't updated with the new keys after key rotation, as described in HDFS-12931), the handshake fails with InvalidEncryptionKeyException.

In the replicated path, this exception is caught, clearDataEncryptionKey() is called to invalidate the cached DEK, and the block is retried. In the striped path, the exception falls through to the
generic catch (IOException) block, which only logs a warning. The stale DEK is never cleared, so every DataNode in the block group fails with the same error. The operation fails permanently — even
user-level retries will reuse the same stale cached DEK.

Fix: Add catch (InvalidEncryptionKeyException) in checksumBlockGroup(), mirroring the existing handling in checksumBlock().

How was this patch tested?

  • Added testStripedFileChecksumWithInvalidEncryptionKey in TestEncryptedTransfer, which creates an EC file, invalidates the encryption key on all DataNodes, and verifies that getFileChecksum() succeeds by catching the exception and refreshing the key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant