Bug: IMAP Scan Results Never Deduplicated — Same Emails Re-Inserted Every Scan, Table Grows ~144,000 Rows/Month Per User


## Affected File

**IMAP credential store** — `backend/email_connectors/imap_store.py`

---

## Bug Description

Every scheduled or manual IMAP scan fetches the last 50 emails from the inbox and inserts all of them into `imap_scan_results` unconditionally. There is no deduplication logic based on `message_id`, so the same emails are re-inserted on every scan cycle.

**Problematic code** (`backend/email_connectors/imap_store.py`, lines 117–141):

```python
def save_scan_results(username, scanned_emails):
    now = datetime.now(timezone.utc).isoformat()
    with _connection() as conn:
        conn.executemany(
            """
            INSERT INTO imap_scan_results
                (username, message_id, subject, sender, date, prediction, risk_score, trust_level, scanned_at)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
            """,   # ← No ON CONFLICT / upsert logic
            [...]
        )
```

**Growth rate calculation:**

| Scan interval | Rows inserted per day | Rows inserted per month |
|---|---|---|
| 15 minutes | 4,800 | ~144,000 |
| 30 minutes | 2,400 | ~72,000 |
| 60 minutes | 1,200 | ~36,000 |

- There is no index on `(username, message_id)` to prevent duplicates.
- There is no retention policy, cleanup job, or TTL on old rows.
- After 6 months of a single active user with 15-minute scanning, the table contains **~864,000 duplicate rows** of the same 50 emails.
- The `GET /imap/scan-results` endpoint returns the most recent `limit` rows (default 100), which will eventually all be duplicates of the same emails, making the history view useless.

---

## Steps to Reproduce

1. Connect an IMAP inbox with `scan_interval_minutes: 15`.
2. Wait 30 minutes (two scan cycles complete).
3. Query the SQLite database directly:
   ```bash
   sqlite3 imap_connections.db "SELECT message_id, COUNT(*) as cnt FROM imap_scan_results GROUP BY message_id ORDER BY cnt DESC LIMIT 5;"
   ```
4. Observe that every `message_id` has `cnt >= 2`, growing by 1 on every scan cycle.

---

## Expected Behavior

Each unique email (identified by `message_id` + `username`) is stored only once. Re-scanning updates the existing row rather than inserting a duplicate. Old scan results beyond a configurable retention period are automatically purged.

---

## Actual Behavior

Every scan unconditionally inserts 50 new rows regardless of whether those emails have been scanned before. The table grows without bound with no deduplication or cleanup.

---

## Proposed Fix

```python
# imap_store.py — use upsert to prevent duplicates
conn.executemany(
    """
    INSERT INTO imap_scan_results
        (username, message_id, subject, sender, date, prediction, risk_score, trust_level, scanned_at)
    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    ON CONFLICT(username, message_id) DO UPDATE SET
        prediction=excluded.prediction,
        risk_score=excluded.risk_score,
        trust_level=excluded.trust_level,
        scanned_at=excluded.scanned_at
    """,
    [...]
)
```

Also add a unique index and a cleanup job:

```sql
-- schema migration
CREATE UNIQUE INDEX IF NOT EXISTS uq_imap_results_user_msgid
    ON imap_scan_results(username, message_id);

-- periodic cleanup: keep only the last 90 days
DELETE FROM imap_scan_results
WHERE scanned_at < datetime('now', '-90 days');
```

---

## Scope of Contribution

- [ ] Frontend
- [x] Backend
- [x] Database
- [x] API
- [ ] Authentication
- [ ] AI/ML
- [ ] DevOps / CI-CD
- [ ] Documentation
- [ ] UI/UX

---

## Pre-submission Checklist

- [x] I have checked existing issues for duplicates.
- [x] I have verified this issue exists in the current codebase.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: IMAP Scan Results Never Deduplicated — Same Emails Re-Inserted Every Scan, Table Grows ~144,000 Rows/Month Per User #223

Affected File

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Proposed Fix

Scope of Contribution

Pre-submission Checklist

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scan interval	Rows inserted per day	Rows inserted per month
15 minutes	4,800	~144,000
30 minutes	2,400	~72,000
60 minutes	1,200	~36,000

Bug: IMAP Scan Results Never Deduplicated — Same Emails Re-Inserted Every Scan, Table Grows ~144,000 Rows/Month Per User #223

Description

Affected File

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Proposed Fix

Scope of Contribution

Pre-submission Checklist

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions