Skip to content

fix(fbmessenger): don't collapse multiple attachments to one row#406

Merged
wesm merged 11 commits into
kenn-io:mainfrom
njt:fix/fbmessenger-attachment-dedup
Jun 25, 2026
Merged

fix(fbmessenger): don't collapse multiple attachments to one row#406
wesm merged 11 commits into
kenn-io:mainfrom
njt:fix/fbmessenger-attachment-dedup

Conversation

@njt

@njt njt commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Facebook Messenger messages that carry multiple attachments without downloaded bytes (files missing from the DYI export, or no attachments dir configured) were recording only one attachment row — the rest were silently dropped.

UpsertAttachment dedupes rows with an empty content_hash to one per message. These link/missing attachments now get a stable synthetic content_hash derived from the export-relative URI, so siblings coexist and re-imports stay idempotent. storage_path stays empty, so no bytes are implied and file-cleanup paths are unaffected.

🤖 Generated with Claude Code

…e row

UpsertAttachment dedupes attachments with an empty content_hash to a single
row per message (SELECT-then-insert on message_id). The fbmessenger importer
stored an empty content_hash whenever an attachment file was missing or no
attachments dir was configured, so a Messenger message with several photos
whose files were absent recorded only ONE attachment row — the rest were
silently dropped.

Give hashless attachments a stable synthetic content_hash (sha256 of the
export-relative URI, not file bytes) so siblings coexist and re-imports stay
idempotent. storage_path stays empty, so no stored content is implied and the
file-cleanup paths (which filter on non-empty storage_path) are unaffected.

Audit of the other UpsertAttachment callers found no further instances: gmail
and the generic ingest path always carry a real MIME content hash, synctechsms
always hashes the part bytes, whatsapp stores at most one attachment per
message, and teams already uses a synthetic link hash.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
@roborev-ci

roborev-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

roborev: Combined Review (7cab66b)

Summary verdict: One medium issue remains; no high or critical findings were reported.

Medium

  • internal/fbmessenger/importer.go:701: Missing, rejected, or no-directory attachments now store a 64-character hex value in content_hash even though it is not a real byte hash and storage_path is empty. Existing consumers treat valid-looking SHA-256 values as exportable content, so show-message --json, export-attachment, export-attachments, and MCP attachment reads may advertise or attempt to fetch files that were never stored.
    • Fix: Keep content_hash reserved for real content hashes and use a distinguishable synthetic key for idempotency, such as a prefixed non-SHA value, or add a separate store-level identity for hashless attachments.

Panel: ci_default_security | Synthesis: codex, 11s | Members: codex_default (codex/default, done, 2m24s), codex_security (codex/security, done, 56s) | Total: 3m31s

@njt

njt commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Split out of #398 (Teams ingestion) to keep that PR scoped. Found via an audit of UpsertAttachment's hashless-attachment dedup path. Independent of #398 — branches off current main.

@wesm

wesm commented Jun 25, 2026

Copy link
Copy Markdown
Member

looking at this

Messenger imports still need a stable per-attachment identity when files are missing, rejected, or skipped because no attachments directory is configured. Without that identity, the store's empty-hash fallback collapses multiple hashless attachments on the same message into one row.

The previous fallback identity looked exactly like a SHA-256 content hash even though no bytes were stored. Prefix the synthetic key so JSON output and export flows cannot confuse it with content-addressed attachment data, while preserving idempotent re-import behavior.

Validation: focused Messenger attachment tests were run red/green; the new assertions failed against bare 64-hex fallback keys before the importer change and passed after prefixing them.

Generated with Codex (GPT-5)
Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (0da95ca)

Medium issue found; security review found no additional issues.

Medium

  • internal/fbmessenger/importer.go:701: Re-importing a Facebook export previously imported by older code can leave the legacy empty-content_hash attachment row in place while inserting the new synthetic-key row beside it. Because the new non-empty key no longer hits UpsertAttachment’s empty-hash dedupe path, upgrade idempotency breaks and attachment rows/counts can be inflated for missing, rejected, or no-attachment-dir imports.

    Suggested fix: When writing synthetic Facebook attachment keys, migrate or delete existing empty-hash attachment rows for that message before inserting synthetic rows. Add a regression test that seeds a legacy empty-hash row before re-import.


Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 2m52s), codex_security (codex/security, done, 13s) | Total: 3m13s

wesm and others added 2 commits June 24, 2026 20:41
Older Messenger imports could leave an empty content_hash attachment row for a missing or skipped attachment. After synthetic keys became non-empty, re-importing the same export no longer hit the store's empty-hash dedupe path, so the legacy row could survive beside the new synthetic-key row.

Remove those legacy empty-hash, empty-storage rows when writing a synthetic Messenger attachment key so upgrade re-imports keep one attachment row per referenced attachment without restoring SHA-looking fake content hashes.

Validation: focused regression test was run red/green; it failed with two rows before the importer cleanup and passed after deleting the legacy empty-hash row.

Generated with Codex (GPT-5)
Co-authored-by: Codex <codex@openai.com>
Legacy Messenger imports could leave an empty content_hash row when attachment storage was skipped. After the synthetic-key fix, cleanup ran only for hashless reimports, so upgrading the same export with an attachments directory and present file still left the legacy row beside the newly stored content-hash row.

Run the narrow legacy empty-hash, empty-storage cleanup before every Messenger attachment upsert. This preserves the synthetic-key upgrade path and also covers upgrades that now produce real stored attachment content.

Validation: focused real-storage upgrade regression failed with two rows before moving the cleanup and passed after it ran before all attachment upserts.

Generated with Codex (GPT-5)
Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (665d0b0)

Medium-risk issue remains: synthetic Facebook Messenger attachment placeholders can survive after the real attachment is later imported.

Medium

  • internal/fbmessenger/importer.go:701 — Synthetic placeholder rows are only removed when the current import is still hashless. If a previously missing/skipped attachment later becomes readable, or an import done without an attachments dir is rerun with storage enabled, handleAttachment returns a real hash and the old synthetic row remains alongside the newly inserted real attachment row.
    Fix: When a real contentHash is available, also remove the matching synthetic placeholder for fbAttachmentHash(att, ai) and any legacy empty-hash placeholder for that message, preferably after the real upsert succeeds or inside a transaction.

Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 3m19s), codex_security (codex/security, done, 1m10s) | Total: 4m35s

wesm and others added 3 commits June 24, 2026 22:38
Messenger imports can first record a synthetic attachment placeholder when storage is disabled or a referenced file is unavailable. If a later reimport stores the real attachment bytes, that deterministic synthetic row must be removed or the message advertises both the placeholder and the real content.

Delete the matching empty-storage synthetic placeholder after a real content-hash upsert succeeds, while keeping the legacy empty-hash cleanup on the same successful-upsert path.

Validation: focused synthetic-placeholder upgrade regression failed with two rows before the cleanup and passed after deleting the deterministic placeholder.

Generated with Codex (GPT-5)
Co-authored-by: Codex <codex@openai.com>
Messenger attachment storage can compute a real content hash before failing to create or write the content-addressed file. Recording that real hash with an empty storage_path blocks a later successful reimport because the store treats the same message/hash pair as already present.

Keep the synthetic placeholder identity unless a real storage path was produced. The deterministic placeholder can then remain visible after a failed storage attempt and still be replaced by the real content row once storage succeeds.

Validation: focused storage-failure regression failed by recording a real hash with empty storage_path before the fix and passed after preserving the synthetic placeholder through the failed import.

Generated with Codex (GPT-5)
Co-authored-by: Codex <codex@openai.com>
A prior failed storage attempt can leave a Messenger attachment row with a real content_hash but an empty storage_path. Because attachments conflict on message_id and content_hash, a later successful import would hit the existing row and leave it unrepaired.

When a Messenger attachment is actually stored, remove any same-message same-hash empty-storage row before the upsert so the stored path and size can be written. Keep synthetic placeholder behavior for imports that still do not have stored content.

Add a regression test that seeds the stale real-hash empty-path state, reimports with attachment storage enabled, and verifies the stored row and bytes are repaired.

Validation: the focused real-hash empty-path regression failed before this cleanup with an empty storage_path and size 0, then passed after deleting the stale row before the stored upsert.

Generated with Codex (GPT-5)

Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (f653b91)

Summary verdict: One medium correctness issue remains; no security findings were reported.

Medium

  • internal/fbmessenger/importer.go:710 - Legacy cleanup misses stale rows with real hashes and empty storage paths. The old importer could create rows with a real content_hash and empty storage_path when hashing succeeded but storage failed. Those rows survive the new synthetic placeholder cleanup, so later successful imports can conflict on the same real hash without repairing storage_path or size.

    Fix: Also migrate or delete legacy rows where content_hash = contentHash AND storage_path = '', or make the stored-content upsert repair empty-storage conflicts. Add a regression test that seeds this legacy row shape.


Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 3m36s), codex_security (codex/security, done, 2m5s) | Total: 5m48s

@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (285c5bd)

Summary verdict: One Medium issue remains; no High or Critical findings were reported.

Medium

  • internal/fbmessenger/importer.go:703
    If a legacy import already has a real SHA content_hash row with an empty storage_path from a previous storage failure, and storage still fails on reimport, this branch writes the synthetic placeholder but never removes the old real-hash empty-path row. That leaves duplicate attachment rows for one attachment and keeps a SHA-looking hash that downstream export code can treat as a stored file.

    Fix: After a successful synthetic placeholder upsert, also delete any empty-path row for the computed real contentHash when contentHash is non-empty, and add a regression test seeded with a real-hash/empty-path row while storage still fails.


Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 3m9s), codex_security (codex/security, done, 35s) | Total: 3m52s

A Messenger reimport can compute a real content hash even when storage still fails, then fall back to the synthetic placeholder row. If an older failed import already left the same real hash with an empty storage_path, keeping that row leaves duplicate attachment identities and exposes a SHA-looking hash with no stored file.

After the synthetic placeholder upsert succeeds, remove the same-message same-hash empty-storage row so the placeholder remains the only hashless-storage representation until a later successful import can replace it with real content.

Validation: seeded the stale real-hash empty-path row into the storage-failure regression; the focused test failed with two rows before the cleanup and passed after removing the stale row on placeholder reimport.

Generated with Codex (GPT-5)

Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (c5a0df3)

Summary verdict: One Medium issue remains; no High or Critical findings were reported.

Medium

  • internal/fbmessenger/importer.go:719 - Failed real-hash rows are only cleaned up when the current import can recompute contentHash. If reimporting without AttachmentsDir or after the media file disappears, the importer writes a synthetic placeholder but can leave a legacy empty-storage_path real-hash row beside it.
    • Fix: In the synthetic-placeholder path, also remove empty-storage real-hash rows for the current message/attachment identity, or delete empty-storage rows not in the expected hash set after processing the message.

Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 4m19s), codex_security (codex/security, done, 1m20s) | Total: 5m46s

A Messenger reimport without an attachments directory, or with a now-missing media file, cannot recompute the real content hash that an older failed storage attempt may have recorded. In that state the importer can still write the deterministic synthetic placeholder, but the stale SHA-looking empty-storage row needs to be removed by the remaining attachment metadata.

When a synthetic placeholder import has no content hash available, remove empty-storage 64-character hash rows for the same message, filename, and MIME type. Stored attachments and prefixed synthetic placeholders are left alone.

Validation: added a no-attachments-dir reimport regression seeded with a stale real-hash empty-path row; it failed with two rows before the metadata cleanup and passed after the fallback cleanup.

Generated with Codex (GPT-5)

Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (3db6f3a)

Summary verdict: one medium correctness issue remains; no security issues were reported.

Medium

  • internal/fbmessenger/importer.go:703: When handleAttachment can compute contentHash but cannot write storagePath, the importer upserts a synthetic placeholder even if the message already has a valid real-hash attachment row from an earlier successful import. The cleanup at lines 719-721 only removes empty-path rows for that real hash, so an existing stored row remains and reimport adds a duplicate placeholder for the same attachment.
    • Fix: Before falling back to the synthetic hash for contentHash != "", check whether a non-empty storage_path row already exists for (message_id, contentHash) and skip inserting the placeholder in that case. Add a regression test for storage failure after a prior successful attachment import.

Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 3m22s), codex_security (codex/security, done, 1m21s) | Total: 4m52s

A Messenger reimport can compute the real content hash before failing to write to the current attachment store. If that same content was already stored by an earlier successful import, falling back to the synthetic placeholder creates a duplicate row beside valid stored content.

Check for an existing non-empty storage_path row for the computed hash before inserting a placeholder. When stored content already exists, keep it as the attachment identity and remove any stale hashless or synthetic placeholder rows for that Messenger attachment.

Validation: added a regression where a successful stored import is followed by a storage-failing reimport; it failed with two rows before the stored-row guard and passed after skipping the placeholder.

Generated with Codex (GPT-5)

Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (3e5fc2b)

Medium issue found; no Critical or High findings.

Medium

  • internal/fbmessenger/importer.go:703 - Re-importing after a successful attachment import with AttachmentsDir later unset can add a synthetic placeholder beside the existing stored attachment. handleAttachment returns an empty contentHash when AttachmentsDir == "", so the stored-attachment check is skipped and UpsertAttachment writes a different synthetic content_hash, leaving duplicate rows for the same attachment.

    Suggested fix: Compute the source file hash independently from copying when att.AbsPath is available, and use that hash to detect existing stored rows even when no attachment storage dir is configured. Add a regression test for stored import followed by re-import without AttachmentsDir.


Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 2m25s), codex_security (codex/security, done, 2m7s) | Total: 4m38s

A Messenger reimport with AttachmentsDir unset can still resolve the source media path, but handleAttachment intentionally returns no content_hash because no bytes were stored in this run. That made the existing stored-row check blind to content already imported by an earlier successful run and allowed a synthetic placeholder beside valid stored content.

Hash the source attachment only for existing-stored-row detection when the current import did not produce stored content. The importer still writes synthetic keys for unstored attachments unless a stored row for the same source bytes already exists.

Validation: added a stored-import then no-attachments-dir reimport regression; it failed with two rows before source-hash detection and passed after using the source hash to skip the placeholder.

Generated with Codex (GPT-5)

Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (8da80c1)

No issues found.


Panel: ci_default_security | Synthesis: codex | Members: codex_default (codex/default, done, 6m18s), codex_security (codex/security, done, 1m0s) | Total: 7m18s

@wesm wesm merged commit bcec734 into kenn-io:main Jun 25, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants