fix(fbmessenger): don't collapse multiple attachments to one row#406
Conversation
…e row UpsertAttachment dedupes attachments with an empty content_hash to a single row per message (SELECT-then-insert on message_id). The fbmessenger importer stored an empty content_hash whenever an attachment file was missing or no attachments dir was configured, so a Messenger message with several photos whose files were absent recorded only ONE attachment row — the rest were silently dropped. Give hashless attachments a stable synthetic content_hash (sha256 of the export-relative URI, not file bytes) so siblings coexist and re-imports stay idempotent. storage_path stays empty, so no stored content is implied and the file-cleanup paths (which filter on non-empty storage_path) are unaffected. Audit of the other UpsertAttachment callers found no further instances: gmail and the generic ingest path always carry a real MIME content hash, synctechsms always hashes the part bytes, whatsapp stores at most one attachment per message, and teams already uses a synthetic link hash. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012Ri7QdvXXUMQPke9wrLjSS
roborev: Combined Review (
|
|
looking at this |
Messenger imports still need a stable per-attachment identity when files are missing, rejected, or skipped because no attachments directory is configured. Without that identity, the store's empty-hash fallback collapses multiple hashless attachments on the same message into one row. The previous fallback identity looked exactly like a SHA-256 content hash even though no bytes were stored. Prefix the synthetic key so JSON output and export flows cannot confuse it with content-addressed attachment data, while preserving idempotent re-import behavior. Validation: focused Messenger attachment tests were run red/green; the new assertions failed against bare 64-hex fallback keys before the importer change and passed after prefixing them. Generated with Codex (GPT-5) Co-authored-by: Codex <codex@openai.com>
roborev: Combined Review (
|
Older Messenger imports could leave an empty content_hash attachment row for a missing or skipped attachment. After synthetic keys became non-empty, re-importing the same export no longer hit the store's empty-hash dedupe path, so the legacy row could survive beside the new synthetic-key row. Remove those legacy empty-hash, empty-storage rows when writing a synthetic Messenger attachment key so upgrade re-imports keep one attachment row per referenced attachment without restoring SHA-looking fake content hashes. Validation: focused regression test was run red/green; it failed with two rows before the importer cleanup and passed after deleting the legacy empty-hash row. Generated with Codex (GPT-5) Co-authored-by: Codex <codex@openai.com>
Legacy Messenger imports could leave an empty content_hash row when attachment storage was skipped. After the synthetic-key fix, cleanup ran only for hashless reimports, so upgrading the same export with an attachments directory and present file still left the legacy row beside the newly stored content-hash row. Run the narrow legacy empty-hash, empty-storage cleanup before every Messenger attachment upsert. This preserves the synthetic-key upgrade path and also covers upgrades that now produce real stored attachment content. Validation: focused real-storage upgrade regression failed with two rows before moving the cleanup and passed after it ran before all attachment upserts. Generated with Codex (GPT-5) Co-authored-by: Codex <codex@openai.com>
roborev: Combined Review (
|
Messenger imports can first record a synthetic attachment placeholder when storage is disabled or a referenced file is unavailable. If a later reimport stores the real attachment bytes, that deterministic synthetic row must be removed or the message advertises both the placeholder and the real content. Delete the matching empty-storage synthetic placeholder after a real content-hash upsert succeeds, while keeping the legacy empty-hash cleanup on the same successful-upsert path. Validation: focused synthetic-placeholder upgrade regression failed with two rows before the cleanup and passed after deleting the deterministic placeholder. Generated with Codex (GPT-5) Co-authored-by: Codex <codex@openai.com>
Messenger attachment storage can compute a real content hash before failing to create or write the content-addressed file. Recording that real hash with an empty storage_path blocks a later successful reimport because the store treats the same message/hash pair as already present. Keep the synthetic placeholder identity unless a real storage path was produced. The deterministic placeholder can then remain visible after a failed storage attempt and still be replaced by the real content row once storage succeeds. Validation: focused storage-failure regression failed by recording a real hash with empty storage_path before the fix and passed after preserving the synthetic placeholder through the failed import. Generated with Codex (GPT-5) Co-authored-by: Codex <codex@openai.com>
A prior failed storage attempt can leave a Messenger attachment row with a real content_hash but an empty storage_path. Because attachments conflict on message_id and content_hash, a later successful import would hit the existing row and leave it unrepaired. When a Messenger attachment is actually stored, remove any same-message same-hash empty-storage row before the upsert so the stored path and size can be written. Keep synthetic placeholder behavior for imports that still do not have stored content. Add a regression test that seeds the stale real-hash empty-path state, reimports with attachment storage enabled, and verifies the stored row and bytes are repaired. Validation: the focused real-hash empty-path regression failed before this cleanup with an empty storage_path and size 0, then passed after deleting the stale row before the stored upsert. Generated with Codex (GPT-5) Co-authored-by: Codex <codex@openai.com>
roborev: Combined Review (
|
roborev: Combined Review (
|
A Messenger reimport can compute a real content hash even when storage still fails, then fall back to the synthetic placeholder row. If an older failed import already left the same real hash with an empty storage_path, keeping that row leaves duplicate attachment identities and exposes a SHA-looking hash with no stored file. After the synthetic placeholder upsert succeeds, remove the same-message same-hash empty-storage row so the placeholder remains the only hashless-storage representation until a later successful import can replace it with real content. Validation: seeded the stale real-hash empty-path row into the storage-failure regression; the focused test failed with two rows before the cleanup and passed after removing the stale row on placeholder reimport. Generated with Codex (GPT-5) Co-authored-by: Codex <codex@openai.com>
roborev: Combined Review (
|
A Messenger reimport without an attachments directory, or with a now-missing media file, cannot recompute the real content hash that an older failed storage attempt may have recorded. In that state the importer can still write the deterministic synthetic placeholder, but the stale SHA-looking empty-storage row needs to be removed by the remaining attachment metadata. When a synthetic placeholder import has no content hash available, remove empty-storage 64-character hash rows for the same message, filename, and MIME type. Stored attachments and prefixed synthetic placeholders are left alone. Validation: added a no-attachments-dir reimport regression seeded with a stale real-hash empty-path row; it failed with two rows before the metadata cleanup and passed after the fallback cleanup. Generated with Codex (GPT-5) Co-authored-by: Codex <codex@openai.com>
roborev: Combined Review (
|
A Messenger reimport can compute the real content hash before failing to write to the current attachment store. If that same content was already stored by an earlier successful import, falling back to the synthetic placeholder creates a duplicate row beside valid stored content. Check for an existing non-empty storage_path row for the computed hash before inserting a placeholder. When stored content already exists, keep it as the attachment identity and remove any stale hashless or synthetic placeholder rows for that Messenger attachment. Validation: added a regression where a successful stored import is followed by a storage-failing reimport; it failed with two rows before the stored-row guard and passed after skipping the placeholder. Generated with Codex (GPT-5) Co-authored-by: Codex <codex@openai.com>
roborev: Combined Review (
|
A Messenger reimport with AttachmentsDir unset can still resolve the source media path, but handleAttachment intentionally returns no content_hash because no bytes were stored in this run. That made the existing stored-row check blind to content already imported by an earlier successful run and allowed a synthetic placeholder beside valid stored content. Hash the source attachment only for existing-stored-row detection when the current import did not produce stored content. The importer still writes synthetic keys for unstored attachments unless a stored row for the same source bytes already exists. Validation: added a stored-import then no-attachments-dir reimport regression; it failed with two rows before source-hash detection and passed after using the source hash to skip the placeholder. Generated with Codex (GPT-5) Co-authored-by: Codex <codex@openai.com>
roborev: Combined Review (
|
Facebook Messenger messages that carry multiple attachments without downloaded bytes (files missing from the DYI export, or no attachments dir configured) were recording only one attachment row — the rest were silently dropped.
UpsertAttachmentdedupes rows with an emptycontent_hashto one per message. These link/missing attachments now get a stable syntheticcontent_hashderived from the export-relative URI, so siblings coexist and re-imports stay idempotent.storage_pathstays empty, so no bytes are implied and file-cleanup paths are unaffected.🤖 Generated with Claude Code