feat: AWS S3 + Glue lakehouse backend with S3 Files mount tutorial#36
feat: AWS S3 + Glue lakehouse backend with S3 Files mount tutorial#36everettVT wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6db5140c73
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # dependencies = [ | ||
| # "daft[iceberg,sql]>=0.7.8", | ||
| # "pyiceberg[sql-sqlite,gcsfs,bigquery]", | ||
| # "pyiceberg[sql-sqlite,gcsfs,bigquery,s3fs,glue]", |
There was a problem hiding this comment.
Add Glue dependencies to the lakehouse extra
When users follow the new README command AWS_S3_BUCKET=... uv run --extra lakehouse -m pipelines.lakehouse_analytics.ingest, uv installs the project's lakehouse extra from pyproject.toml, which still has pyiceberg[sql-sqlite,gcsfs,bigquery] and not the glue/S3 extras shown only in this PEP 723 header. In a fresh checkout the new AWS branch can therefore select type="glue" without the Glue dependencies (notably boto3) being installed, so the AWS mode fails before it can connect; update the project extra/lockfile as well as the inline metadata.
Useful? React with 👍 / 👎.
| **{ | ||
| "type": "glue", | ||
| "warehouse": f"s3://{AWS_S3_BUCKET}", | ||
| "region_name": AWS_REGION, |
There was a problem hiding this comment.
Pass AWS region using PyIceberg's current keys
For the AWS branch, region_name is a removed/ignored PyIceberg catalog property in current versions; PyIceberg expects keys such as glue.region or client.region (and s3.region if configuring FileIO separately). With the documented/default AWS_REGION value only copied into this region_name entry, users who do not also have a region configured in their AWS profile/environment can get a missing or wrong-region Glue client even though the pipeline claims to default to us-east-1.
Useful? React with 👍 / 👎.
1532b3d to
0b497e1
Compare
…nt option The lakehouse catalog factory now supports three interchangeable backends: local SQLite (default), GCS + BigLake, and AWS S3 + Glue. The S3 backend pairs with AWS S3 Files for POSIX-mounted lakehouse access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expands the S3 Files mount section into a full nine-step walkthrough: prerequisites, client install, service role with IAM JSON, EC2 instance role, security groups, file system + mount target creation, mount, fstab persistence, dual-access verification, and a troubleshooting table. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
0b497e1 to
96f5231
Compare
Summary
AWS_S3_BUCKET) toggles it on; the rest of the pipeline runs unchanged.daft.Fileknowledge-base example showing how one glob + one DataFrame handles every file type (code, docs, PDFs, media) in a single bucket.Why
S3 Files turns an S3 bucket into a local filesystem with strong consistency. That's the missing piece for AI workloads that need shared mutable storage — agent memory, knowledge bases, intermediate artifacts. Daft already reads from S3 unchanged; this PR makes the lakehouse story land alongside it: one bucket, one source of truth, three access paths (POSIX mount, S3 API, Iceberg catalog).
What's in the PR
pipelines/catalog.pyelif AWS_S3_BUCKETbranch using PyIceberg's Glue catalog withs3://warehousepipelines/lakehouse_analytics/README.mdexamples/files/daft_file_knowledge_base.py.env.exampleAWS_S3_BUCKET+AWS_REGIONREADME.md,tests/registry.pyStacks on
This PR is based on
everettVT/ship-lakehouse(#35). Merge that first.Test plan
uv run examples/files/daft_file_knowledge_base.pyruns against public HF data (42 files across doc/pdf/audio/video/other)catalog.pyconstructs the PyIceberg call correctly, runtime path needs a real bucket to validateMarketing companion
Eventual-Inc/marketing#297 — blog post tracking the "one bucket for everything" angle.
🤖 Generated with Claude Code