Skip to content

RFC: sync Gmail archived mail without duplicates via X-GM-LABELS (All Mail only) + paginated backfill #50

@jhayashi

Description

@jhayashi

Background

Following #45/#46 (thanks for the quick merge + the CI/dbus fixes — both shipped in 0.5.51), credentials persist and Gmail's [Gmail] container no longer aborts sync. But on Gmail/IMAP, archived mail still isn't synced: syncable_folders excludes the \All (All Mail) folder, and All Mail is the only place archived-only messages live.

Including All Mail naively isn't safe today. The store's identity is UNIQUE (account_id, provider_id) with provider_id = "mailbox:uid" and no RFC Message-ID dedup, so the same Gmail message in INBOX and All Mail (different UIDs per folder) becomes two rows. That duplication is presumably why \All is excluded — at the cost of archived mail. The Gmail API provider avoids this entirely by being message-centric (one entity, labelIds as attributes); the IMAP path can't naturally match that.

I have a working implementation and wanted to check the approach with you before opening a sizeable PR.

Proposed approach: make Gmail/IMAP message-centric via X-GM-*

For servers advertising X-GM-EXT-1:

  • Sync All Mail only. Each message is in All Mail exactly once → one row per message (no folder duplicates), and archived mail is covered.
  • Derive labels/threads/flags from X-GM-LABELS / X-GM-THRID rather than folder membership: \InboxINBOX, \Sent[Gmail]/Sent Mail, user labels by name (nested = Parent/Child), drop \All/unknown. (Note: Gmail quotes system labels inconsistently — \Sent vs \\Sent — so the mapping normalizes leading backslashes.)
  • Keep provider_id = "mailbox:uid" so the existing mutation path is unchanged.

For all IMAP servers (independently useful):

  • Initial sync becomes a probe + paginated backfill. Today's single-shot UID FETCH 1:* loads an entire folder into memory (multi-GB on a large All Mail). Instead, a probe records per-folder watermarks and backfill_sync walks UIDs newest→oldest via UID SEARCH, fetching ≤400/batch through the existing has_more loop — flat memory, resumable from the per-batch cursor. Plus a generous per-batch timeout so Gmail throttling surfaces as retry/backoff, not a wedge.

The one dependency: mxr-async-imap Gmail FETCH accessors

imap-proto already parses AttributeValue::GmailLabels/GmailMsgId/GmailThrId, but mxr-async-imap's Fetch only surfaces uid/size/modseq. It needs three small accessors on Fetch reading from self.response.parsed() (same shape as flags()):

pub fn gmail_labels(&self) -> Option<Vec<String>> { /* find AttributeValue::GmailLabels */ }
pub fn gmail_msgid(&self) -> Option<u64> { /* GmailMsgId */ }
pub fn gmail_thrid(&self) -> Option<u64> { /* GmailThrId */ }

Since mxr-async-imap is published from this project, would you prefer to add those accessors yourself, or take them as part of the change?

Status / validation

I have this implemented and running against a live Gmail account (~80k-message All Mail): full backfill with bounded memory, zero duplicate rows, labels resolving (Inbox/Sent/user labels), surviving daemon restarts. It includes unit tests (label mapping, the Gmail parse branch, backfill cursor round-trip, etc.) and updates the shared sync-conformance harness to drive has_more (verified across imap/gmail/fake).

Happy to open the PR once you're good with (a) the All-Mail-only + X-GM-LABELS model and (b) how you'd like the mxr-async-imap accessors landed. Are there constraints I'm missing (e.g. servers where you'd want to keep per-folder sync, or a preferred label-mapping source)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions