Skip to content

Fix issues #4, #6, and #14: Documentation, error messages, and invalid dump generation#16

Merged
gmr merged 4 commits into
mainfrom
fix-issues-4-6-14
Nov 19, 2025
Merged

Fix issues #4, #6, and #14: Documentation, error messages, and invalid dump generation#16
gmr merged 4 commits into
mainfrom
fix-issues-4-6-14

Conversation

@gmr
Copy link
Copy Markdown
Owner

@gmr gmr commented Nov 19, 2025

Summary

This PR fixes three GitHub issues with comprehensive solutions:

Issues Resolved

Issue #4: Documentation Typo

Fixed incorrect reference to -Fd (directory format) in the load() function docstring. pgdumplib only supports custom format dumps created with pg_dump -Fc.

Issue #14: Better Error Messaging

Enhanced error messages when attempting to load invalid or incompatible dump files. The library now detects:

  • Plain SQL text files (starting with --, /*, CREATE, INSERT, etc.)
  • Invalid binary formats

Users now receive helpful guidance instead of generic "Invalid archive header" errors, making it clear that pgdumplib requires custom format dumps created with pg_dump -Fc.

Issue #6: Invalid Dump After Load/Save (Critical Bug)

Fixed a critical bug where loading and saving dumps with PostgreSQL v12+ format caused pg_restore to fail with:

ERROR: zero-length delimited identifier at or near ""
LINE 1: SET default_table_access_method = "";

Root Cause: The _write_str() method treated both None and empty string identically, writing length 0 for both. PostgreSQL requires a distinction between "not set" (length -1) and "empty string" (length 0).

Solution:

  1. Normalize empty tableam/tablespace strings to None during load
  2. Update add_entry() to default these fields to None instead of empty strings
  3. Fix _write_str() to write -1 for None and 0 for empty strings

Changes Made

Documentation (pgdumplib/__init__.py)

  • Fixed docstring typo: -Fd-Fc

Core Library (pgdumplib/dump.py)

  • Enhanced _read_header() with intelligent file type detection and helpful error messages
  • Normalized empty tableam/tablespace to None in _read_entry()
  • Updated add_entry() to default tableam/tablespace to None
  • Fixed _write_str() to properly distinguish None from empty string

Tests

Testing

Verification

Tested the fix for issue #6 with the original dump file:

# Before fix:
pg_restore output contained: SET default_table_access_method = "";
# Result: ERROR: zero-length delimited identifier

# After fix:
pg_restore output contains: SET default_table_access_method = heap;
# Result: Valid SQL, restoration succeeds

Resolves #4
Resolves #6
Resolves #14

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Clearer, contextual errors when loading incompatible or corrupted archives (detects plain SQL vs invalid binary).
    • Optional string fields are now treated as truly unset instead of empty, preventing empty settings during restore and ensuring correct table access method restoration.
  • Documentation

    • Updated loader documentation to reflect the supported archive format.
  • Tests

    • Added regression and error-validation tests covering unsupported formats and correct round-trip handling of optional fields.

✏️ Tip: You can customize this high-level summary in your review settings.

gmr and others added 2 commits November 19, 2025 15:47
Changed incorrect `-Fd` to `-Fc` in the docstring for the `load()`
function. pgdumplib only supports custom format dumps created with
`pg_dump -Fc`, not directory format (`-Fd`).

Fixes #4

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Enhanced the error message when attempting to load invalid or
incompatible dump files. The library now detects:
- Plain SQL text files (starting with --, /*, CREATE, INSERT, etc.)
- Invalid binary formats

Instead of the generic "Invalid archive header" error, users now
receive helpful guidance like:
"This appears to be a plain SQL text file. pgdumplib only supports
custom format dumps created with pg_dump -Fc"

Added comprehensive test coverage for various invalid file formats.

Fixes #14

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 19, 2025

Walkthrough

Docstring corrected to state -Fc; serialization changed to represent unset optional strings as None (written as -1 length); reader normalizes empty tablespace/tableam to None; header validation now differentiates plain SQL and invalid binaries; tests added for error cases and a regression for tableam handling.

Changes

Cohort / File(s) Summary
Documentation Fix
pgdumplib/__init__.py
Corrected load() docstring to state it loads pg_dump files created with -Fc (was -Fd).
Dump serialization & parsing
pgdumplib/dump.py
Optional string fields (tablespace, tableam) now default to None; _write_str writes None as length -1; _read_entry normalizes empty-string fields to None; _read_header provides contextual ValueErrors detecting plain SQL text vs invalid archive.
Error handling tests
tests/test_dump.py
Added tests: test_plain_sql_file_with_comment, test_plain_sql_file_with_create, test_invalid_binary_format to assert descriptive ValueErrors referencing "plain SQL text file"/"Invalid archive" and suggesting "pg_dump -Fc".
Regression tests for tableam
tests/test_save_dump.py
Added Issue6RegressionTestCase with test_load_save_no_empty_tableam_sql() to ensure saved dumps do not emit SET default_table_access_method = "" and preserve non-empty tableam. (Duplication of the test class/method appears in diff context.)

Sequence Diagram(s)

sequenceDiagram
  participant U as User
  participant L as pgdumplib.load()
  participant R as dump._read_header/_read_entry
  participant S as Dump.save()/_write_str

  rect rgb(248,249,255)
    U->>L: open file path
    L->>R: read header bytes
    alt archive magic detected
      R->>R: parse entries\nnormalize '' -> None (tablespace/tableam)
      R-->>L: return Dump
    else plain SQL detected
      R-->>L: raise ValueError("plain SQL text file" + hint "pg_dump -Fc")
    else invalid binary
      R-->>L: raise ValueError("Invalid archive" + hint "pg_dump -Fc")
    end
  end

  rect rgb(249,255,250)
    U->>S: save Dump to file
    S->>S: write header\nfor optional strings: write -1 if None
    S-->>U: output file
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Inspect header detection heuristics in pgdumplib/dump.py for edge cases with different pg_dump versions.
  • Verify _write_str change (writing -1 for None) is correct and fully covered by tests, especially round-trip scenarios.
  • Review duplicated test class/method in tests/test_save_dump.py and cleanup subprocess/pg_restore usage for portability.

"I nibble bytes and sniff the head,
I turn empty strings to None instead.
I guard the dump from plain SQL fright,
and save tableam just right.
🐇✨"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the three main changes: documentation fix for issue #4, error messages for issue #14, and invalid dump generation fix for issue #6.
Linked Issues check ✅ Passed All three linked issues are addressed: #4 updates documentation, #6 fixes None/empty-string distinction for tableam/tablespace, and #14 improves header validation with contextual error messages.
Out of Scope Changes check ✅ Passed All changes in the PR are directly related to the three linked issues; no unrelated modifications or scope creep detected.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-issues-4-6-14

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
pgdumplib/dump.py (3)

656-680: Normalization of tablespace/tableam is good; consider aligning add_entry defaults

Normalizing empty tablespace and tableam to None in _read_entry() is a solid fix and pairs well with the new _write_str() semantics to avoid emitting invalid empty values.

One minor inconsistency is that add_entry() still coerces None to empty strings:

tablespace=tablespace or '',
tableam=tableam or '',

so new entries created via the API will never have None for these fields, only ''. That’s not a blocker for the load/save regression you’re fixing (since entries coming from real dumps are normalized to None), but it does mean the public API can’t distinguish “unset” from “explicitly empty” when creating dumps.

If you want the behavior to be fully consistent and future‑proof, you could let None pass through unchanged:

-                tablespace=tablespace or '',
-                tableam=tableam or '',
+                tablespace=tablespace,
+                tableam=tableam,

so that callers can choose between None (not set) and '' (explicitly empty), matching how _read_entry() and _write_str() now behave.

Also applies to: 193-282


713-743: Header detection logic matches tests; optional: always mention -Fc in generic error

The new _read_header() logic cleanly separates three cases:

  • Valid custom-format (MAGIC matches) → normal path.
  • Plain SQL (comments or leading SQL keywords in the first 5 bytes) → “plain SQL text file … pg_dump -Fc” message.
  • Non‑printable or empty prefix → “Invalid archive format … pg_dump -Fc” message.

This aligns with the new tests and gives much clearer guidance than the prior generic error.

If you’d like to make the UX even more consistent for other “printable but wrong” prefixes (e.g., a random text file starting with some other letters), consider updating the default error_msg to also reference pg_dump -Fc, e.g.:

-        if magic_bytes != constants.MAGIC:
-            # Provide helpful error messages based on file content
-            error_msg = 'Invalid archive header'
+        if magic_bytes != constants.MAGIC:
+            # Provide helpful error messages based on file content
+            error_msg = (
+                'Invalid archive header. '
+                'pgdumplib only supports custom format dumps '
+                'created with pg_dump -Fc'
+            )

leaving the more specific branches as-is. That way every “not a custom archive” error nudges users toward the correct pg_dump flag.


1082-1098: _write_str correctly distinguishes None vs empty string; update docstring to reflect

The new _write_str() behavior—writing -1 for None and 0 for ''—is exactly what’s needed to preserve the “not set vs empty” distinction in the archive format and resolves the invalid "" tableam issue when combined with the normalization in _read_entry().

The only small mismatch is the docstring, which still documents value as str only. To keep the API self‑documenting, you could adjust it along these lines:

-    def _write_str(self, value: str | None) -> None:
-        """Write a string
-
-        :param str value: The string to write
+    def _write_str(self, value: str | None) -> None:
+        """Write a string or a NULL marker.
+
+        :param value: The string to write, or None to write a -1 length
+            (indicating an unset/NULL field in the archive format).

Behavior itself looks good and is consistent with _read_bytes().

tests/test_save_dump.py (1)

3-6: Issue #6 regression test effectively covers empty tableam behavior

The new Issue6RegressionTestCase does a nice job of:

  • Creating a dump for a Postgres version where tableam is relevant (appear_as='13.1').
  • Ensuring at least one table has tableam='heap' while another object (the sequence) has no tableam set.
  • Saving to disk and validating, via pg_restore --schema-only, that:
    • The process succeeds.
    • No SET default_table_access_method = "" appears.
    • A valid SET default_table_access_method = heap is present.

The use of NamedTemporaryFile(delete=False) plus an explicit finally cleanup is correct and mirrors existing pg_restore-based tests in this file. If you ever decide to extend coverage further, you might optionally consider adding a similar test that starts from a real fixture dump file to mirror the exact load→save scenario from the original issue, but what’s here is already a solid regression guard.

Also applies to: 182-250

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3000438 and 2e66920.

📒 Files selected for processing (4)
  • pgdumplib/__init__.py (1 hunks)
  • pgdumplib/dump.py (3 hunks)
  • tests/test_dump.py (1 hunks)
  • tests/test_save_dump.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/test_dump.py (2)
pgdumplib/dump.py (1)
  • load (319-390)
pgdumplib/__init__.py (1)
  • load (20-37)
tests/test_save_dump.py (2)
pgdumplib/__init__.py (1)
  • new (40-60)
pgdumplib/dump.py (2)
  • add_entry (193-283)
  • save (411-423)
pgdumplib/dump.py (1)
tests/test_edge_cases.py (1)
  • _write_int (21-27)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: test (3.14, 17)
  • GitHub Check: test (3.14, 16)
  • GitHub Check: test (3.14, 18)
  • GitHub Check: test (3.13, 17)
  • GitHub Check: test (3.13, 18)
  • GitHub Check: test (3.12, 18)
  • GitHub Check: test (3.11, 15)
  • GitHub Check: test (3.13, 16)
  • GitHub Check: test (3.11, 18)
  • GitHub Check: test (3.12, 17)
  • GitHub Check: test (3.11, 16)
  • GitHub Check: test (3.12, 16)
  • GitHub Check: test (3.11, 17)
  • GitHub Check: test (3.13, 15)
  • GitHub Check: test (3.12, 15)
  • GitHub Check: test (3.13, 17)
  • GitHub Check: test (3.13, 18)
  • GitHub Check: test (3.12, 18)
  • GitHub Check: test (3.13, 16)
  • GitHub Check: test (3.13, 15)
  • GitHub Check: test (3.12, 16)
  • GitHub Check: test (3.11, 17)
  • GitHub Check: test (3.12, 17)
  • GitHub Check: test (3.11, 15)
  • GitHub Check: test (3.12, 15)
  • GitHub Check: test (3.11, 16)
  • GitHub Check: test (3.11, 18)
🔇 Additional comments (2)
pgdumplib/__init__.py (1)

24-34: Docstring now correctly documents -Fc custom format

The updated load() docstring correctly states that only pg_dump -Fc custom-format files are supported and is consistent with the new error messages elsewhere in the PR.

tests/test_dump.py (1)

193-238: New error‑handling tests accurately capture header detection behavior

The three added tests robustly cover plain SQL (commented and CREATE-prefixed) and invalid binary inputs, assert against stable substrings ("plain SQL text file", "Invalid archive", "pg_dump -Fc"), and clean up temp files correctly. They give good regression coverage for the new _read_header() behavior.

When pgdumplib loaded dumps with PostgreSQL v12+ format (version 1.14.0+),
entries with empty tableam fields were stored as empty strings. On save,
these empty strings were written as zero-length fields, which pg_restore
incorrectly interpreted as explicit empty values, generating invalid SQL:
  SET default_table_access_method = "";

This caused restoration failures with:
  ERROR: zero-length delimited identifier at or near ""

Root cause: The _write_str() method treated both None and empty string
identically, writing length 0 for both. PostgreSQL requires a distinction:
- None (not set) should be represented as length -1
- Empty string should be represented as length 0

Changes made:

1. Normalize empty tableam/tablespace to None during load (_read_entry)
   - Prevents empty strings from entering the data model
   - Maintains consistency with the Entry dataclass defaults

2. Update add_entry() to default tableam/tablespace to None
   - Changed from `tableam or ''` to `tableam or None`
   - Ensures consistency across manual entry creation

3. Fix _write_str() to distinguish None from empty string
   - Write -1 for None (field not set)
   - Write 0 for empty string (field explicitly empty)
   - Critical fix: prevents pg_restore from generating invalid SQL

Added comprehensive regression test that:
- Creates dump with table (tableam='heap') and sequence (no tableam)
- Verifies pg_restore doesn't generate invalid SET statements
- Confirms valid SET statements are preserved

Fixes #6

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@gmr gmr force-pushed the fix-issues-4-6-14 branch from 2e66920 to e0216be Compare November 19, 2025 21:29
- Update default error message to always mention pg_dump -Fc
- Improve _write_str() docstring to document None handling
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/test_save_dump.py (1)

182-250: Consider adding a skip decorator for missing pg_restore.

The test depends on pg_restore being available in the system PATH, which may not be true in all CI/CD environments or developer machines. Without pg_restore, this test will fail rather than skip.

Consider adding a check to skip the test gracefully:

import shutil
import unittest

class Issue6RegressionTestCase(unittest.TestCase):
    @unittest.skipIf(
        shutil.which('pg_restore') is None,
        'pg_restore not found in PATH'
    )
    def test_load_save_no_empty_tableam_sql(self):
        # ... existing test code

Alternatively, document in the test suite README that pg_restore must be installed for the full test suite to pass.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e66920 and e0216be.

📒 Files selected for processing (2)
  • pgdumplib/dump.py (4 hunks)
  • tests/test_save_dump.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/test_save_dump.py (2)
pgdumplib/__init__.py (1)
  • new (40-60)
pgdumplib/dump.py (2)
  • add_entry (193-283)
  • save (411-423)
pgdumplib/dump.py (1)
tests/test_edge_cases.py (1)
  • _write_int (21-27)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: test (3.11, 16)
  • GitHub Check: test (3.13, 18)
  • GitHub Check: test (3.12, 17)
  • GitHub Check: test (3.11, 15)
  • GitHub Check: test (3.13, 16)
  • GitHub Check: test (3.11, 18)
  • GitHub Check: test (3.11, 17)
  • GitHub Check: test (3.12, 16)
  • GitHub Check: test (3.12, 15)
  • GitHub Check: test (3.13, 17)
  • GitHub Check: test (3.13, 15)
  • GitHub Check: test (3.12, 18)
🔇 Additional comments (8)
tests/test_save_dump.py (3)

3-6: LGTM!

The added imports are all properly utilized in the new regression test.


222-226: LGTM!

The subprocess call is secure (using list args without shell) and the noqa flags are appropriate. The command correctly uses --schema-only to validate SQL generation without requiring a database connection.


229-247: LGTM!

The test assertions comprehensively validate the fix:

  1. Ensures pg_restore succeeds
  2. Confirms no invalid empty tableam SQL is generated
  3. Verifies correct tableam SQL is present for tables with explicit access methods

This properly tests the regression described in issue #6.

pgdumplib/dump.py (5)

275-276: LGTM! Consistent normalization to None.

The change to use or None ensures that empty strings and None values are normalized to None, preventing the generation of invalid SQL statements like SET default_table_access_method = "";. This is consistent with the corresponding changes in _read_entry and _write_str.


672-673: LGTM! Proper normalization of tablespace.

The normalization ensures consistency by converting empty strings to None when reading entries, which aligns with the serialization logic in _write_str.


674-680: LGTM! Critical fix for issue #6.

The normalization of tableam from empty string to None, combined with the explanatory comment, directly addresses the bug where empty tableam values caused pg_restore to generate invalid SQL. The version gating is correct for backward compatibility.


721-743: LGTM! Excellent UX improvement for issue #14.

The enhanced error detection provides clear, actionable error messages when users attempt to load incompatible file formats. The heuristics for detecting plain SQL files (checking for common SQL keywords and comment markers) are reasonable and will catch the vast majority of cases where users mistakenly try to load plain SQL dumps.


1090-1098: LGTM! Critical serialization fix for issue #6.

The distinction between None (written as -1 length) and empty string (written as 0 length) is essential for the fix. Combined with the normalization in _read_entry and add_entry, this ensures that tablespace and tableam are consistently represented as None (not set) rather than empty strings, preventing invalid SQL generation.

The change maintains backward compatibility since both -1 and 0 length are read as empty bytes and normalized to None.

@gmr gmr merged commit 96a9f80 into main Nov 19, 2025
32 of 33 checks passed
@gmr gmr deleted the fix-issues-4-6-14 branch November 19, 2025 21:38
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pgdumplib/dump.py (1)

721-743: Consider case-insensitive plain-text detection.
Lowercase SQL headers (create table, insert into, etc.) won’t trigger the specialized message today because the prefix list is upper-case only. Normalizing once before the startswith checks would capture those cases without affecting the fallback path.

-                if file_start.startswith(('--', '/*', 'SE', 'CR', 'IN', 'DR')):
+                file_start_upper = file_start.upper()
+                if file_start_upper.startswith(('--', '/*', 'SE', 'CR', 'IN', 'DR')):
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0216be and 06f913b.

📒 Files selected for processing (1)
  • pgdumplib/dump.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
pgdumplib/dump.py (1)
tests/test_edge_cases.py (1)
  • _write_int (21-27)
🪛 Ruff (0.14.5)
pgdumplib/dump.py

1094-1094: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (26)
  • GitHub Check: test (3.13, 17)
  • GitHub Check: test (3.14, 17)
  • GitHub Check: test (3.13, 18)
  • GitHub Check: test (3.13, 16)
  • GitHub Check: test (3.14, 16)
  • GitHub Check: test (3.11, 15)
  • GitHub Check: test (3.12, 18)
  • GitHub Check: test (3.12, 15)
  • GitHub Check: test (3.11, 18)
  • GitHub Check: test (3.11, 16)
  • GitHub Check: test (3.13, 15)
  • GitHub Check: test (3.11, 17)
  • GitHub Check: test (3.12, 16)
  • GitHub Check: test (3.12, 17)
  • GitHub Check: test (3.13, 16)
  • GitHub Check: test (3.13, 15)
  • GitHub Check: test (3.13, 17)
  • GitHub Check: test (3.13, 18)
  • GitHub Check: test (3.11, 18)
  • GitHub Check: test (3.11, 15)
  • GitHub Check: test (3.12, 17)
  • GitHub Check: test (3.12, 15)
  • GitHub Check: test (3.11, 17)
  • GitHub Check: test (3.12, 18)
  • GitHub Check: test (3.12, 16)
  • GitHub Check: test (3.11, 16)
🔇 Additional comments (3)
pgdumplib/dump.py (3)

275-276: Thanks for normalizing optional storage fields.
Propagating None into the in-memory models.Entry keeps _write_str emitting the null sentinel and prevents another round of zero-length identifiers when the dump is rewritten. Nicely aligned with the regression fix.


673-679: Solid read-path normalization.
Coalescing empty strings back to None on load keeps the model symmetric with the writer and closes the loop on the default_table_access_method regression.


1087-1103: Appreciate the _write_str null sentinel fix.
Writing -1 for unset fields matches the archive format and directly addresses the zero-length identifier bug—great coordination with the read-path changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ValueError: Invalid archive header with pg_dump version 16.2 files Loading & saving dump file produced invalid dump file Confusing -Fd in the docs

1 participant