Skip to content

feat: AWS S3 + Glue lakehouse backend with S3 Files mount tutorial#36

Open
everettVT wants to merge 4 commits into
mainfrom
everettVT/aws-s3-files-guide
Open

feat: AWS S3 + Glue lakehouse backend with S3 Files mount tutorial#36
everettVT wants to merge 4 commits into
mainfrom
everettVT/aws-s3-files-guide

Conversation

@everettVT
Copy link
Copy Markdown
Contributor

Summary

  • Adds AWS S3 + Glue as a third interchangeable lakehouse backend alongside the existing SQLite (local) and GCS BigLake modes. One env var (AWS_S3_BUCKET) toggles it on; the rest of the pipeline runs unchanged.
  • Pairs naturally with the new AWS S3 Files capability (launched 2026-04-07): mount your lakehouse bucket locally as a POSIX filesystem, write to it from agents/scripts with strong consistency, read it back with Daft via the Glue catalog.
  • Ships a standalone daft.File knowledge-base example showing how one glob + one DataFrame handles every file type (code, docs, PDFs, media) in a single bucket.

Why

S3 Files turns an S3 bucket into a local filesystem with strong consistency. That's the missing piece for AI workloads that need shared mutable storage — agent memory, knowledge bases, intermediate artifacts. Daft already reads from S3 unchanged; this PR makes the lakehouse story land alongside it: one bucket, one source of truth, three access paths (POSIX mount, S3 API, Iceberg catalog).

What's in the PR

File Change
pipelines/catalog.py New elif AWS_S3_BUCKET branch using PyIceberg's Glue catalog with s3:// warehouse
pipelines/lakehouse_analytics/README.md Three-backend table, AWS quick start, nine-step end-to-end S3 Files mount tutorial with full IAM JSON, security group rules, fstab persistence, and troubleshooting table
examples/files/daft_file_knowledge_base.py Mixed-content processing example — one glob, every file type, branch by extension
.env.example AWS_S3_BUCKET + AWS_REGION
README.md, tests/registry.py Wired up the new example

Stacks on

This PR is based on everettVT/ship-lakehouse (#35). Merge that first.

Test plan

  • uv run examples/files/daft_file_knowledge_base.py runs against public HF data (42 files across doc/pdf/audio/video/other)
  • Local SQLite mode still works (no env vars set)
  • BigLake mode untested in this PR (covered by parent [codex] Add lakehouse analytics pipeline #35)
  • AWS Glue mode requires real AWS infra — verified that catalog.py constructs the PyIceberg call correctly, runtime path needs a real bucket to validate
  • S3 Files mount tutorial commands need a real AWS account to validate end-to-end

Marketing companion

Eventual-Inc/marketing#297 — blog post tracking the "one bucket for everything" angle.

🤖 Generated with Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6db5140c73

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pipelines/catalog.py
# dependencies = [
# "daft[iceberg,sql]>=0.7.8",
# "pyiceberg[sql-sqlite,gcsfs,bigquery]",
# "pyiceberg[sql-sqlite,gcsfs,bigquery,s3fs,glue]",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add Glue dependencies to the lakehouse extra

When users follow the new README command AWS_S3_BUCKET=... uv run --extra lakehouse -m pipelines.lakehouse_analytics.ingest, uv installs the project's lakehouse extra from pyproject.toml, which still has pyiceberg[sql-sqlite,gcsfs,bigquery] and not the glue/S3 extras shown only in this PEP 723 header. In a fresh checkout the new AWS branch can therefore select type="glue" without the Glue dependencies (notably boto3) being installed, so the AWS mode fails before it can connect; update the project extra/lockfile as well as the inline metadata.

Useful? React with 👍 / 👎.

Comment thread pipelines/catalog.py Outdated
**{
"type": "glue",
"warehouse": f"s3://{AWS_S3_BUCKET}",
"region_name": AWS_REGION,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pass AWS region using PyIceberg's current keys

For the AWS branch, region_name is a removed/ignored PyIceberg catalog property in current versions; PyIceberg expects keys such as glue.region or client.region (and s3.region if configuring FileIO separately). With the documented/default AWS_REGION value only copied into this region_name entry, users who do not also have a region configured in their AWS profile/environment can get a missing or wrong-region Glue client even though the pipeline claims to default to us-east-1.

Useful? React with 👍 / 👎.

@everettVT everettVT changed the base branch from everettVT/ship-lakehouse to main May 11, 2026 14:46
@everettVT everettVT force-pushed the everettVT/aws-s3-files-guide branch from 1532b3d to 0b497e1 Compare May 11, 2026 14:47
everettVT and others added 4 commits May 11, 2026 17:01
…nt option

The lakehouse catalog factory now supports three interchangeable backends:
local SQLite (default), GCS + BigLake, and AWS S3 + Glue. The S3 backend
pairs with AWS S3 Files for POSIX-mounted lakehouse access.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expands the S3 Files mount section into a full nine-step walkthrough:
prerequisites, client install, service role with IAM JSON, EC2 instance
role, security groups, file system + mount target creation, mount, fstab
persistence, dual-access verification, and a troubleshooting table.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@everettVT everettVT force-pushed the everettVT/aws-s3-files-guide branch from 0b497e1 to 96f5231 Compare May 12, 2026 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant