-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Open
Labels
Description
There's an inconsistency in how FileSystems.get_filesystem() handles missing optional dependencies between GCS and S3.
Current Behavior
S3 (without aws extra):
>>> from apache_beam.io import filesystems
>>> filesystems.FileSystems.get_filesystem("s3://blah")
<apache_beam.io.aws.s3filesystem.S3FileSystem at 0x11a0af750>Returns the filesystem object; validation happens later when the filesystem is actually used.
GCS (without gcp extra):
>>> from apache_beam.io import filesystems
>>> filesystems.FileSystems.get_filesystem("gcs://blah")
ValueError: Unable to get filesystem from specified path, please use the correct path or ensure the required dependency is installed, e.g., pip install apache-beam[gcp]. Path specified: gcs://blahRaises immediately because GCSFileSystem isn't registered as a subclass.
Proposed Behavior
Both should behave consistently. GCSFileSystem should be returned from get_filesystem() like S3FileSystem, allowing callers to validate dependencies when the filesystem is actually used rather than at lookup time.
Why This Matters
- Inconsistent API behavior is confusing
- Code that handles multiple filesystem types can't catch/handle GCS gracefully
- Dependency validation at usage time (not lookup time) allows for better error handling and lazy loading patterns
Environment
- Apache Beam version: 2.70.0
- Python version: 3.11
Generated by Claude Code, confirmed by @hjtran