Skip to content

Bug: IMAP Scan Results Never Deduplicated — Same Emails Re-Inserted Every Scan, Table Grows ~144,000 Rows/Month Per User #223

Description

@devprashant19

Affected File

IMAP credential storebackend/email_connectors/imap_store.py


Bug Description

Every scheduled or manual IMAP scan fetches the last 50 emails from the inbox and inserts all of them into imap_scan_results unconditionally. There is no deduplication logic based on message_id, so the same emails are re-inserted on every scan cycle.

Problematic code (backend/email_connectors/imap_store.py, lines 117–141):

def save_scan_results(username, scanned_emails):
    now = datetime.now(timezone.utc).isoformat()
    with _connection() as conn:
        conn.executemany(
            """
            INSERT INTO imap_scan_results
                (username, message_id, subject, sender, date, prediction, risk_score, trust_level, scanned_at)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
            """,   # ← No ON CONFLICT / upsert logic
            [...]
        )

Growth rate calculation:

Scan interval Rows inserted per day Rows inserted per month
15 minutes 4,800 ~144,000
30 minutes 2,400 ~72,000
60 minutes 1,200 ~36,000
  • There is no index on (username, message_id) to prevent duplicates.
  • There is no retention policy, cleanup job, or TTL on old rows.
  • After 6 months of a single active user with 15-minute scanning, the table contains ~864,000 duplicate rows of the same 50 emails.
  • The GET /imap/scan-results endpoint returns the most recent limit rows (default 100), which will eventually all be duplicates of the same emails, making the history view useless.

Steps to Reproduce

  1. Connect an IMAP inbox with scan_interval_minutes: 15.
  2. Wait 30 minutes (two scan cycles complete).
  3. Query the SQLite database directly:
    sqlite3 imap_connections.db "SELECT message_id, COUNT(*) as cnt FROM imap_scan_results GROUP BY message_id ORDER BY cnt DESC LIMIT 5;"
  4. Observe that every message_id has cnt >= 2, growing by 1 on every scan cycle.

Expected Behavior

Each unique email (identified by message_id + username) is stored only once. Re-scanning updates the existing row rather than inserting a duplicate. Old scan results beyond a configurable retention period are automatically purged.


Actual Behavior

Every scan unconditionally inserts 50 new rows regardless of whether those emails have been scanned before. The table grows without bound with no deduplication or cleanup.


Proposed Fix

# imap_store.py — use upsert to prevent duplicates
conn.executemany(
    """
    INSERT INTO imap_scan_results
        (username, message_id, subject, sender, date, prediction, risk_score, trust_level, scanned_at)
    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    ON CONFLICT(username, message_id) DO UPDATE SET
        prediction=excluded.prediction,
        risk_score=excluded.risk_score,
        trust_level=excluded.trust_level,
        scanned_at=excluded.scanned_at
    """,
    [...]
)

Also add a unique index and a cleanup job:

-- schema migration
CREATE UNIQUE INDEX IF NOT EXISTS uq_imap_results_user_msgid
    ON imap_scan_results(username, message_id);

-- periodic cleanup: keep only the last 90 days
DELETE FROM imap_scan_results
WHERE scanned_at < datetime('now', '-90 days');

Scope of Contribution

  • Frontend
  • Backend
  • Database
  • API
  • Authentication
  • AI/ML
  • DevOps / CI-CD
  • Documentation
  • UI/UX

Pre-submission Checklist

  • I have checked existing issues for duplicates.
  • I have verified this issue exists in the current codebase.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions