feat: AWS S3 + Glue lakehouse backend with S3 Files mount tutorial by everettVT · Pull Request #36 · Eventual-Inc/daft-examples

everettVT · 2026-05-11T04:42:05Z

Summary

Adds AWS S3 + Glue as a third interchangeable lakehouse backend alongside the existing SQLite (local) and GCS BigLake modes. One env var (AWS_S3_BUCKET) toggles it on; the rest of the pipeline runs unchanged.
Pairs naturally with the new AWS S3 Files capability (launched 2026-04-07): mount your lakehouse bucket locally as a POSIX filesystem, write to it from agents/scripts with strong consistency, read it back with Daft via the Glue catalog.
Ships a standalone daft.File knowledge-base example showing how one glob + one DataFrame handles every file type (code, docs, PDFs, media) in a single bucket.

Why

S3 Files turns an S3 bucket into a local filesystem with strong consistency. That's the missing piece for AI workloads that need shared mutable storage — agent memory, knowledge bases, intermediate artifacts. Daft already reads from S3 unchanged; this PR makes the lakehouse story land alongside it: one bucket, one source of truth, three access paths (POSIX mount, S3 API, Iceberg catalog).

What's in the PR

File	Change
`pipelines/catalog.py`	New `elif AWS_S3_BUCKET` branch using PyIceberg's Glue catalog with `s3://` warehouse
`pipelines/lakehouse_analytics/README.md`	Three-backend table, AWS quick start, nine-step end-to-end S3 Files mount tutorial with full IAM JSON, security group rules, fstab persistence, and troubleshooting table
`examples/files/daft_file_knowledge_base.py`	Mixed-content processing example — one glob, every file type, branch by extension
`.env.example`	`AWS_S3_BUCKET` + `AWS_REGION`
`README.md`, `tests/registry.py`	Wired up the new example

Stacks on

This PR is based on everettVT/ship-lakehouse (#35). Merge that first.

Test plan

uv run examples/files/daft_file_knowledge_base.py runs against public HF data (42 files across doc/pdf/audio/video/other)
Local SQLite mode still works (no env vars set)
BigLake mode untested in this PR (covered by parent [codex] Add lakehouse analytics pipeline #35)
AWS Glue mode requires real AWS infra — verified that catalog.py constructs the PyIceberg call correctly, runtime path needs a real bucket to validate
S3 Files mount tutorial commands need a real AWS account to validate end-to-end

Marketing companion

Eventual-Inc/marketing#297 — blog post tracking the "one bucket for everything" angle.

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6db5140c73

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-11T04:44:40Z

 # dependencies = [
 #     "daft[iceberg,sql]>=0.7.8",
-#     "pyiceberg[sql-sqlite,gcsfs,bigquery]",
+#     "pyiceberg[sql-sqlite,gcsfs,bigquery,s3fs,glue]",


Add Glue dependencies to the lakehouse extra

When users follow the new README command AWS_S3_BUCKET=... uv run --extra lakehouse -m pipelines.lakehouse_analytics.ingest, uv installs the project's lakehouse extra from pyproject.toml, which still has pyiceberg[sql-sqlite,gcsfs,bigquery] and not the glue/S3 extras shown only in this PEP 723 header. In a fresh checkout the new AWS branch can therefore select type="glue" without the Glue dependencies (notably boto3) being installed, so the AWS mode fails before it can connect; update the project extra/lockfile as well as the inline metadata.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-11T04:44:40Z

+            **{
+                "type": "glue",
+                "warehouse": f"s3://{AWS_S3_BUCKET}",
+                "region_name": AWS_REGION,


Pass AWS region using PyIceberg's current keys

For the AWS branch, region_name is a removed/ignored PyIceberg catalog property in current versions; PyIceberg expects keys such as glue.region or client.region (and s3.region if configuring FileIO separately). With the documented/default AWS_REGION value only copied into this region_name entry, users who do not also have a region configured in their AWS profile/environment can get a missing or wrong-region Glue client even though the pipeline claims to default to us-east-1.

Useful? React with 👍 / 👎.

…nt option The lakehouse catalog factory now supports three interchangeable backends: local SQLite (default), GCS + BigLake, and AWS S3 + Glue. The S3 backend pairs with AWS S3 Files for POSIX-mounted lakehouse access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Expands the S3 Files mount section into a full nine-step walkthrough: prerequisites, client install, service role with IAM JSON, EC2 instance role, security groups, file system + mount target creation, mount, fstab persistence, dual-access verification, and a troubleshooting table. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

everettVT changed the base branch from everettVT/ship-lakehouse to main May 11, 2026 14:46

everettVT force-pushed the everettVT/aws-s3-files-guide branch from 1532b3d to 0b497e1 Compare May 11, 2026 14:47

everettVT and others added 4 commits May 11, 2026 17:01

docs: remove Athena/Glue arrow from lakehouse architecture diagram

1582dba

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chore: update conductor.json setup script path

96f5231

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

everettVT force-pushed the everettVT/aws-s3-files-guide branch from 0b497e1 to 96f5231 Compare May 12, 2026 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AWS S3 + Glue lakehouse backend with S3 Files mount tutorial#36

feat: AWS S3 + Glue lakehouse backend with S3 Files mount tutorial#36
everettVT wants to merge 4 commits into
mainfrom
everettVT/aws-s3-files-guide

everettVT commented May 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

everettVT commented May 11, 2026

Summary

Why

What's in the PR

Stacks on

Test plan

Marketing companion

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant