feat: 19 - cloud storage#19
Conversation
joefrost01
left a comment
There was a problem hiding this comment.
Claude review complete: LGTM on cloud prefix coverage (including abfss), s3_profile plumbing, and conditional extension loading; ready to merge once CI passes.
PR #19 Self-Review: Cloud Storage Path and Credential SupportWhat looks good
Needs work before mergeImportant: No tests for new functionalityCLAUDE.md requires: "Every new feature needs tests — unit tests for logic, integration tests for CLI behaviour." There are zero tests covering:
At minimum, add unit tests for:
Non-blocking observations (for follow-up)
|
joefrost01
left a comment
There was a problem hiding this comment.
Code Review: Cloud Storage Path & Credential Support
What's done well
s3_profilewiring is complete end-to-end: CLI → config merge →FileResolverConfig→CloudSettings→SET s3_profilein DuckDB. Every hop is present.abfss://prefix added consistently across all fouris_cloud_pathcall sites.requires_cloud_extensions()correctly checks all three primary cloud path sources (input files, ref paths, output path).- Conditional extension loading is a good design choice — keeps local runs fast.
- All 91 existing tests pass.
Issues to address
1. is_cloud_path duplicated across 4 modules (Important)
The identical function body exists in:
src/file_resolution.rs:390src/reference_tables.rs:106src/query_pipeline.rs:690src/fingerprint.rs:46(takes&Path, converts viato_string_lossy)
This is exactly how abfss:// got missed in the first place — it wasn't in any of them. When the next prefix arrives (r2://, hdfs://, etc.), someone has to find and update all four copies.
Recommendation: Extract a single pub fn is_cloud_path(path: &str) -> bool into a shared module (e.g. src/util.rs or alongside error types). The fingerprint module calls it with path.to_string_lossy().as_ref(). Single-function extraction, zero risk.
2. No tests added (Important)
CLAUDE.md states: "Every new feature needs tests — unit tests for logic, integration tests for CLI behaviour." This PR adds a new function (requires_cloud_extensions), a new config field (s3_profile), and modified behavior (conditional extension loading) — none have test coverage.
At minimum:
requires_cloud_extensionsunit tests: no cloud paths → false, cloud input → true, cloud ref → true, cloud output → true,abfss://recognizedis_cloud_pathtests: each prefix includingabfss://, non-cloud paths return falses3_profilepropagation: one testQueryArgswiths3_profileset (the existing test helpers already buildQueryArgswiths3_profile: None~10 times — adding one that sets a value is straightforward)
Minor notes (non-blocking)
requires_cloud_extensionsdoesn't inspect--filter-sql/--post-sqlfor embedded cloud paths (e.g.read_parquet('s3://...')). Parsing SQL would be fragile, but a code comment noting this known limitation would help future maintainers.- The fingerprint
is_cloud_pathtakes&Path— on Windows,Pathcan normalize://in URIs. Low risk for this project's target audience, but another reason to centralize on&str. - If a user passes
--s3-profilewith no cloud paths,SET s3_profileis issued without httpfs loaded. Consider loading extensions when any cloud credential flag is set, not just when cloud paths are detected.
What problem are you trying to solve?
Cloud-storage support was incomplete in runtime behavior:
--s3-profilewas accepted by CLI but not applied to DuckDB,abfss://paths were not recognized as cloud paths in several modules, and cloud extensions were always disabled in the main query pipeline even when cloud inputs/refs/outputs were used.What does this PR change?
This wires
s3_profilethrough engine/file-resolver configuration into DuckDBSET s3_profile, addsabfss://cloud path detection across resolver/reference/fingerprint/pipeline modules, and enables DuckDB cloud extension loading automatically when cloud paths are detected in input files, refs, or output destination.Does this change align with DESIGN.md?
Yes. It keeps the existing architecture of passing cloud paths directly to DuckDB and configuring extensions/options at engine initialization; no pipeline ordering changes.
What alternatives did you consider?
I considered always loading cloud extensions unconditionally, but that adds avoidable startup overhead for purely local runs. Conditional enablement based on detected cloud paths keeps local performance unchanged.
Does this PR contain multiple unrelated changes?
No. All changes are directly tied to feature 19 cloud storage support.
Existing PRs
Testing
cargo testpassescargo clippypasses with no warningscargo fmthas been runEvaluation
s3_profile.abfss://handling.--s3-profilenot applied;abfss://not recognized; cloud extensions could remain disabled in cloud runs.Human review