HDFS-11161. Incorporate Baidu BOS file system implementation.#8347
HDFS-11161. Incorporate Baidu BOS file system implementation.#8347yangdong2398 wants to merge 2 commits intoapache:trunkfrom
Conversation
Add hadoop-bos module to hadoop-cloud-storage-project providing Baidu Object Storage (BOS) FileSystem implementation using the bos:// URI scheme. The module includes: - BaiduBosFileSystem: core FileSystem implementation for BOS - Support for both flat and hierarchical namespace modes - Multipart upload for large files - CRC32C checksum compatible with HDFS for DistCp - Pluggable credential providers (configuration-based and env-variable-based) - Hadoop FileSystem contract tests - Shaded BOS SDK and transitive dependencies (jackson, httpcomponents, guava, commons-logging, commons-lang, commons-codec) under bfs.* prefix to avoid classpath conflicts - Integration into hadoop-project, hadoop-cloud-storage, and hadoop-cloud-storage-dist parent POMs - Standard core-site.xml + XInclude auth-keys.xml test credential pattern - tests-off/tests-on profiles to auto-skip tests when credentials absent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
🎊 +1 overall
This message was automatically generated. |
|
cc @steveloughran Could you help recommend someone who can review this PR? Thanks ~ |
|
I notice that all the commits under this directory are tagged with |
|
@yangdong2398 @LuciferYang Thank you for your contribution! As this PR is fairly large, we will need to initiate a DISCUSS and VOTE thread in accordance with the project process before proceeding to the next steps. You may refer to this email for the general format and workflow: https://lists.apache.org/thread/t4xopqkcd2r3gttlfpwgk6cbf0fx2tbm |
|
@yangdong2398 I have conducted some analysis on the dependencies of this module with cc and summarized them as follows: Compile-Scope Dependency TreeLicense Summary
Result: 12 of 14 dependencies are Apache License 2.0. Two (slf4j-api, bcprov-jdk15on) are MIT — ASF Category A, fully compatible. One dependency ( Open-Source License RisksRisk 1:
|
| --> | ||
| <FindBugsFilter> | ||
| <Match> | ||
| <Class name="org.apache.hadoop.fs.bos.BosInputStream"/> |
There was a problem hiding this comment.
Which issues can be fixed? Additionally, it's better to suppress issues as precisely as possible rather than using broad suppressions.
| BaiduBosConstants.BOS_STS_TOKEN_ENV); | ||
|
|
||
| if (LOG.isDebugEnabled()) { | ||
| LOG.debug("accessKey:{}", accessKey); |
There was a problem hiding this comment.
Access key, secret key, and session token are logged at DEBUG level. Credentials must never appear in logs.
| * @throws IOException if an I/O error occurs during close | ||
| */ | ||
| @Override | ||
| public synchronized void close() throws IOException { |
There was a problem hiding this comment.
When close() encounters an exception (e.g., upload part failure), the multipart upload seems never aborted. Will the uploadId remain active on the server side, leading to storage leakage?
| index); | ||
| } | ||
| } catch (Exception e) { | ||
| LOG.warn( |
There was a problem hiding this comment.
Exceptions from future.get() are caught and only logged as WARN. Seems execution continues to completeMultipartUpload, which will either succeed with incomplete data or throw IllegalStateException (unchecked, violating the OutputStream.close() contract).
| LoggerFactory.getLogger(BosOutputStream.class); | ||
|
|
||
| /** The backing native file system store. */ | ||
| public final BosNativeFileSystemStore store; |
There was a problem hiding this comment.
Nearly every field (store, key, uploadId, eTags, closed, etc.) is public. It is possible to change them to private (or private final).
| @Override | ||
| public boolean truncate(Path f, long newLength) | ||
| throws IOException { | ||
| throw new IOException("Not supported"); |
There was a problem hiding this comment.
should be UnsupportedOperationException
| String key = pathToKey(absolutePath); | ||
| FileStatus status = this.getFileStatus(f); | ||
| if (status.isFile()) { | ||
| return new ContentSummary(status.getLen(), 1L, 0L); |
There was a problem hiding this comment.
can we use ContentSummary.Builder, 3-arg ContentSummary constructor already deprecated
| * multipart upload. Data is written to an output buffer and | ||
| * then moved to an input buffer for uploading. | ||
| */ | ||
| public class BosBlockBuffer { |
There was a problem hiding this comment.
I think BosBlockBuffer should implements Closeable and close outBuffer and inBuffer in close() method
| boundedThreadPool = | ||
| BlockingThreadPoolExecutorService.newInstance( | ||
| activeTasks, | ||
| waitingTasks, |
There was a problem hiding this comment.
Integer.MAX_VALUE / 2? Will it only increase the risk of OOM?
| } | ||
|
|
||
| } | ||
| return response; |
There was a problem hiding this comment.
This method will silently returns null on invalid upload ID?
|
There is also the risk of JAR package conflicts, such as the compatibility risks between Log4j 1.x and 2.x versions, as well as whether there are conflicts with some common packages and whether shading has been applied. |
- Exclude EPL-licensed paho and CVE-affected bcprov-jdk15on deps - Relocate shaded packages to org.apache.hadoop.fs.bos.shaded.* - Remove jackson/commons-logging from shade (unused/bridged) - Use skipITs instead of maven.test.skip following AWS model - Fix BosOutputStream close() to abort multipart upload on failure - Fix BceCredentialsProvider mutual recursion causing StackOverflow - Fix BosInputStream close() synchronization and skip statistics - Restore interrupt flag on InterruptedException in retry loops - Cancel all futures on rename failure in NonHierarchyStore - Shut down thread pool in BosNativeFileSystemStore.close() - Add abortMultipartUpload to BosClientProxy interface - Throw IOException on invalid upload ID in completeMultipartUpload - Remove credential values from debug logs - Fix delete() to check existence in hierarchy+recursive path - Change BosOutputStream fields from public to private/private final - Use UnsupportedOperationException for append() and truncate() - Use ContentSummary.Builder instead of deprecated constructor - Implement Closeable on BosBlockBuffer - Use ThreadLocalRandom instead of Random in BosRandomRetryPolicy - Remove misleading constructor from BaiduBosFileSystemAdapter - Make findbugs-exclude.xml suppressions more precise - Use getDeclaredConstructor().newInstance() instead of deprecated API - Fix EnvironmentVariableCredentialsProvider logger class Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
💔 -1 overall
This message was automatically generated. |
Description of PR
JIRA: https://issues.apache.org/jira/browse/HDFS-11161
Add
hadoop-bosmodule tohadoop-cloud-storage-projectproviding Baidu Object Storage (BOS) FileSystem implementation using thebos://URI scheme.Key features:
BaiduBosFileSystem: core Hadoop FileSystem implementation for BOSbfs.*prefix to avoid classpath conflictsBuild integration:
hadoop-project/pom.xml(dependency management),hadoop-cloud-storage-project/pom.xml(module),hadoop-cloud-storage/pom.xml(umbrella dependency),hadoop-cloud-storage-dist/pom.xml(dist packaging)core-site.xml+ XIncludeauth-keys.xmltest credential patterntests-off/tests-onprofiles to auto-skip tests when credentials are absentHow was this patch tested?
TestConfigurationCredentialsProvidervalidates credential provider loading from Hadoop configurationTestBosContractCreate,TestBosContractDelete,TestBosContractRename,TestBosContractSeek, etc.) — require valid BOS credentials inauth-keys.xmlDependencies and Licensing
All new dependencies (Baidu BOS SDK) and transitive dependencies are properly shaded under
bfs.*prefix. BOS SDK is under Apache License 2.0, compatible with ASF 2.0 license requirements. No changes required to LICENSE, LICENSE-binary, or NOTICE-binary files as dependencies are shaded and not exposed in distribution.