Skip to content

GCSFileSystem requires gcp extra at lookup time while S3FileSystem does not #37445

@hjtran

Description

@hjtran

There's an inconsistency in how FileSystems.get_filesystem() handles missing optional dependencies between GCS and S3.

Current Behavior

S3 (without aws extra):

>>> from apache_beam.io import filesystems
>>> filesystems.FileSystems.get_filesystem("s3://blah")
<apache_beam.io.aws.s3filesystem.S3FileSystem at 0x11a0af750>

Returns the filesystem object; validation happens later when the filesystem is actually used.

GCS (without gcp extra):

>>> from apache_beam.io import filesystems
>>> filesystems.FileSystems.get_filesystem("gcs://blah")
ValueError: Unable to get filesystem from specified path, please use the correct path or ensure the required dependency is installed, e.g., pip install apache-beam[gcp]. Path specified: gcs://blah

Raises immediately because GCSFileSystem isn't registered as a subclass.

Proposed Behavior

Both should behave consistently. GCSFileSystem should be returned from get_filesystem() like S3FileSystem, allowing callers to validate dependencies when the filesystem is actually used rather than at lookup time.

Why This Matters

  • Inconsistent API behavior is confusing
  • Code that handles multiple filesystem types can't catch/handle GCS gracefully
  • Dependency validation at usage time (not lookup time) allows for better error handling and lazy loading patterns

Environment

  • Apache Beam version: 2.70.0
  • Python version: 3.11

Generated by Claude Code, confirmed by @hjtran

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions