Skip to content

feat(pre_seed): support .tar.gz/.tgz archives#1196

Merged
thepagent merged 3 commits into
mainfrom
feat/pre-seed-tarball-support
Jun 25, 2026
Merged

feat(pre_seed): support .tar.gz/.tgz archives#1196
thepagent merged 3 commits into
mainfrom
feat/pre-seed-tarball-support

Conversation

@chaodu-agent

Copy link
Copy Markdown
Collaborator

Summary

Adds tarball extraction to the pre_seed phase, so existing .tar.gz home backups can be consumed directly without migration to zip.

Format Detection

Based on S3 URI extension:

  • .zip → zip extraction (existing)
  • .tar.gz / .tgz → gzipped tarball extraction (new)

Example

[hooks.pre_seed]
sources = [
  "s3://openab-state-pahud/chaodu-home.tar.gz",    # existing tarball
  "s3://openab-state-pahud/shared/openab.zip",     # zip also works
]

Implementation

  • Uses flate2 (gzip decompression) + tar (archive extraction)
  • Same safety limits: max file count, max extracted bytes, cooperative deadline
  • Preserves Unix permissions
  • Feature-gated under pre-seed

Tests

All 10 existing tests pass. Tarball extraction shares the same budget/deadline infrastructure.

Adds tarball extraction using flate2 + tar crates. Format is detected
from the S3 URI extension:
- .zip → zip extraction (existing)
- .tar.gz / .tgz → gzipped tarball extraction (new)

Same safety limits apply: max file count, max extracted bytes,
cooperative deadline checks, permission preservation.

This enables pre_seed to consume existing home tarballs directly
without requiring migration to zip format.
@chaodu-agent chaodu-agent requested a review from thepagent as a code owner June 25, 2026 02:33
@chaodu-agent

This comment has been minimized.

- Pin tar >= 0.4.45 (CVE-2026-33056 fix), disable default features
- Switch format detection from URI extension to magic bytes (0x1f 0x8b)
- Remove uri param from extract_and_apply (no longer needed)
- Disable set_preserve_permissions, use manual chmod stripping suid/sgid
- Reduce deadline check interval from 100 to 10 files
- Add filetime + tar to Cargo.lock
- Add tarball-specific unit tests: basic extraction, magic bytes
  detection, and deadline enforcement
@chaodu-agent

This comment has been minimized.

@chaodu-agent

This comment has been minimized.

@chaodu-agent

Copy link
Copy Markdown
Collaborator Author

Note

LGTM ✅ — All findings from group review addressed in 8494cd8.

What This PR Does

Adds .tar.gz/.tgz extraction support to the pre_seed phase, allowing existing tarball home backups to be consumed directly without migration to zip format.

How It Works

Format auto-detected via gzip magic bytes (0x1f, 0x8b) — no URI extension dependency. Uses flate2 for decompression + tar crate (≥ 0.4.45, CVE-2026-33056 patched) for extraction. Same budget/deadline infrastructure as zip path. Extracts atomically via temp directory.

Review Summary

Group review identified 2 critical + 5 important findings. All were fixed in commit 8494cd8. Re-review confirmed all fixes.

Findings

# Severity Finding Resolution
1 🔴 Cargo.lock missing tar crate — build not reproducible ✅ Fixed — lockfile updated with tar 0.4.46 + filetime
2 🔴 CVE-2026-33056: tar = "0.4" allows vulnerable versions ✅ Fixed — pinned "0.4.45", default-features = false
3 🟡 set_preserve_permissions(true) preserves suid/sgid bits ✅ Fixed — disabled, manual chmod with & 0o0777 mask
4 🟡 URI extension detection fragile (case-sensitive, query params) ✅ Fixed — magic bytes [0x1f, 0x8b] detection, uri param removed
5 🟡 No tarball-specific unit tests ✅ Fixed — 3 tests added (basic, magic bytes, deadline)
6 🟡 Deadline check only every 100 files ✅ Fixed — reduced to every 10 files
7 🟡 entry.size() relies on header-declared value ℹ️ Accepted — low risk, tar crate limits reads internally
8 🟢 Atomic extraction via temp directory Excellent design ✅
9 🟢 unpack_in correctly prevents path traversal Confirmed ✅
Follow-up Suggestions (non-blocking)
  • Add extract_tarball_budgeted(..., max_file_count, max_extracted_bytes) helper + small tar limit tests to mirror zip test coverage
  • Add --features pre-seed to CI workflow for compile/test coverage
  • Consider post-extraction size verification via std::fs::metadata for defense-in-depth against gzip bombs
Baseline Check
  • PR opened: 2026-06-25
  • Main already has: zip-based pre_seed with full safety infrastructure (budget, deadline, path traversal protection)
  • Net-new value: tarball format support for existing .tar.gz backups without format migration
What's Good (🟢)
  • Atomic extraction pattern prevents corrupted state on failure
  • unpack_in correctly handles path traversal protection
  • Clean feature-gating under pre-seed
  • Reuses existing budget/deadline infrastructure
  • Magic bytes detection is robust against filename variations

thepagent
thepagent previously approved these changes Jun 25, 2026
- Update hooks.md and config-reference.md to document all supported
  formats: .zip, .tar.gz, .tgz (auto-detected via magic bytes)
- Update examples to show mixed format usage
- Document path traversal prevention and permission hardening for both
  zip and tarball paths
@thepagent thepagent merged commit 64350c5 into main Jun 25, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants