Skip to content

feat: Microsoft Teams ingestion (delegated Graph sync)#398

Open
njt wants to merge 1 commit into
kenn-io:mainfrom
njt:feat/teams-ingestion
Open

feat: Microsoft Teams ingestion (delegated Graph sync)#398
njt wants to merge 1 commit into
kenn-io:mainfrom
njt:feat/teams-ingestion

Conversation

@njt

@njt njt commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

What

Sync your own Microsoft Teams 1:1/group/meeting chats and channel messages into msgvault via delegated Microsoft Graph, searchable alongside mail through the existing TUI / FTS / Parquet analytics.

Highlights

  • New add-teams (delegated Graph OAuth) and sync-teams (full + incremental, with streamed per-conversation progress) commands; Teams also runs under serve scheduled syncs — and the daemon now syncs all source types on an identifier (so Teams + Outlook/IMAP on one address both run).
  • Reuses the existing chat schema — no new core tables: chats → direct/group conversations, channels → channel conversations with root+reply threading, plus reactions, sender + recipient (to) + @mention rows, identity resolution (AAD object id → email dedup, unifying with mail identities), inline images downloaded to content-addressed storage, and shared-file links recorded.
  • Incremental sync: chats via lastModifiedDateTime list filtering (no delegated per-chat delta endpoint exists), channels via /messages/delta; per-conversation cursors persisted in sync_runs.cursor_after, flushed after each conversation so an interrupted long backfill resumes mid-stream.
  • Microsoft OAuth kept independent from the existing Outlook/IMAP manager — separate teams_<email>.json token with Graph scopes only, so IMAP and Teams can each be used alone or together.

Use

  1. Register an Entra app with delegated Graph permissions (Chat.Read, ChannelMessage.Read.All, Team.ReadBasic.All, Channel.ReadBasic.All, User.Read) and grant admin consent.
  2. Add to config.toml:
    [microsoft]
    client_id = "<app-client-id>"
    tenant_id = "<directory-tenant-id>"
  3. msgvault add-teams you@tenant.com then msgvault sync-teams you@tenant.com (--no-channels / --limit for scoped runs). Press a inside msgvault tui to filter to the Teams account.

Notes

  • Validated on a live tenant: ~313k messages spanning 2017–2026 (chats, channels, reactions, threaded replies) — confirms full-history backfill beyond Graph's 8-month delta window.
  • Meeting transcripts are intentionally out of scope (separate spec; delegated transcript access is effectively organizer-only). The only remaining follow-up is preserving a departing user's shared SharePoint/OneDrive files before account removal.

🤖 Generated with Claude Code

@roborev-ci

roborev-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown

roborev: Review Unavailable (e58e9fc)

The review agent repeatedly failed to run (likely an agent or configuration error). roborev will try again on the next commit.

Last error: agent: claude-code failed stream: stream errors: You've hit your session limit · resets 5:50am (UTC): exit status 1

@roborev-ci

roborev-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown

roborev: Review Unavailable (7d3cff4)

The review agent repeatedly failed to run (likely an agent or configuration error). roborev will try again on the next commit.

Last error: agent: claude-code failed stream: stream errors: You've hit your session limit · resets 5:50am (UTC): exit status 1

@roborev-ci

roborev-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

roborev: Combined Review (d4652cb)

Summary verdict: changes are close, but there is one high-severity data-loss issue and several medium-severity Teams integration correctness gaps.

High

  • internal/teams/importer.go:266: If fetching replies for a channel root fails during the initial backfill or delta-token fallback, the importer only increments sum.Errors, continues, primes a new delta cursor, and later stores it. The missed historical replies will not be retried on the next sync.
    • Fix: Treat reply-fetch failures during backfill/fallback as channel failure, or avoid advancing that channel’s delta cursor until all roots and replies were fetched successfully.

Medium

  • internal/teams/mapping.go:42: Inline hosted-content images are downloaded as attachment rows, but has_attachments and attachment_count only count Graph attachments and recordings. Messages with only inline images will appear as having no attachments in APIs/search/UI.

    • Fix: Count hosted-content URLs before UpsertMessage, or update the message attachment fields after successful inline downloads/backfill.
  • internal/teams/importer.go:454: Mention rows are only replaced when the current Graph message has at least one resolvable mention. If a message is edited to remove mentions, or a --full repair re-imports it with no mentions, stale recipient_type='mention' rows remain.

    • Fix: Always call ReplaceMessageRecipients(messageID, "mention", mentionIDs, mentionNames) after processing the current mentions, even when the resulting slice is empty.
  • cmd/msgvault/cmd/remove_account.go:155: remove-account has no sourceTypeTeams credential cleanup case, so deleting a Teams account removes database rows but leaves the teams_<email>.json Graph OAuth token on disk.

    • Fix: Add a Teams case that uses microsoft.NewGraphManager(...).DeleteToken(source.Identifier).
  • cmd/msgvault/cmd/serve.go:72: serve still requires Google [oauth] config before starting. A Teams-only setup with [microsoft] client_id and a Teams token cannot run scheduled Teams syncs through the daemon.

    • Fix: Defer Google OAuth validation to Gmail sync paths, or allow startup when Microsoft OAuth is configured and scheduled sources are Teams-only.

Panel: ci_default_security | Synthesis: codex, 16s | Members: codex_default (codex/default, done, 7m17s), codex_security (codex/security, done, 5m24s) | Total: 12m57s

@njt

njt commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Two unrelated bug fixes that surfaced while building this were split out into their own PRs to keep this one scoped to Teams ingestion:

Neither depends on this PR; both branch off current main.

@roborev-ci

roborev-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

roborev: Combined Review (13c4591)

High-level verdict: changes need fixes before merge due to one high-risk Teams backfill data-loss path and several medium issues around auth cleanup, startup validation, embeddings, and stale child data.

High

  • internal/teams/importer.go:270
    A channel backfill can fail to fetch replies for a root message, count the error, and still prime/save the channel delta cursor later. Older replies missed during that first backfill can then be skipped permanently because future runs resume from the delta link instead of retrying the full replies walk.
    Fix: Treat reply-fetch failures during full channel backfill as channel-incomplete: do not save the channel delta/checkpoint for that channel, or return an error so the next run retries the full roots+replies pass.

Medium

  • cmd/msgvault/cmd/serve.go:72
    serve still requires Google OAuth config via cfg.OAuth.HasAnyConfig(). A Teams-only setup with only [microsoft].client_id configured cannot start the daemon, even though this change adds scheduled Teams sync support.
    Fix: Allow the daemon to start when either Google OAuth or Microsoft Graph OAuth is configured, e.g. include cfg.Microsoft.ClientID != "" in the validation.

  • cmd/msgvault/cmd/remove_account.go:155
    Removing a Teams source does not remove the new teams_<email>.json Graph token. The database source is deleted, but the credential remains and can silently re-authorize a future Teams sync. This also leaves a still-valid Graph refresh token behind after an operator believes the account was scrubbed.
    Fix: Add a sourceTypeTeams case that calls microsoft.NewGraphManager(...).DeleteToken(source.Identifier), and add a regression test confirming remove-account --type teams deletes teams_<email>.json.

  • cmd/msgvault/cmd/serve.go:677
    Teams sync ignores vectorFeatures, and sync-teams does not set up vector enqueueing either. New Teams messages are written to the store but never added to pending_embeddings, so semantic/vector search misses them until a full embedding rebuild.
    Fix: Add optional embedding enqueue support to the Teams importer, return/collect persisted message IDs, and wire it from both sync-teams and scheduled Teams sync.

  • internal/teams/importer.go:458
    Re-imported Teams messages do not clear child metadata when Graph returns an empty collection. Mentions are only replaced when len(gm.Mentions) > 0, and reactions/attachments are append-only, so edits that remove mentions/reactions/attachments leave stale rows.
    Fix: Use replace/delete-then-insert semantics for Teams-managed child collections, including calling ReplaceMessageRecipients with empty slices for mentions and adding equivalent replacement paths for reactions and attachments.


Panel: ci_default_security | Synthesis: codex, 14s | Members: codex_default (codex/default, done, 8m21s), codex_security (codex/security, done, 5m45s) | Total: 14m20s

wesm pushed a commit that referenced this pull request Jun 22, 2026
## What

Several importers built `time.Time` values from epoch timestamps with `time.Unix`/`time.UnixMilli` but **without `.UTC()`**, leaving them in the runner's local zone — while the rest of each importer stores dates in UTC. Any code reading the calendar day (or the Parquet year partition) is then off by one in zones east of UTC.

Fixes:
- `internal/sync/sync.go` — `processBatch` oldest-message date (progress tracking).
- `internal/whatsapp/mapping.go` — message `SentAt`.
- `internal/whatsapp/importer.go` — reaction `createdAt`.

## Why it matters

`TestProcessBatch_OldestDatePropagation` fails on any machine east of UTC (e.g. NZ): the fixture `2024-01-10T12:00:00Z` reads back as Jan 11 local. The tests are correct; the production code was the bug. Adds `TestMapMessageSentAtIsUTC` (asserts the stored zone is UTC, machine-independent).

## Possible later fixes (out of scope here)

The same `time.Unix(...)`-without-`.UTC()` pattern also appears in the embedding-generation status timestamps, but these are **operator-facing status values** round-tripped from unix-int columns (not message dates), so they don't affect partitioning/dedup/cross-system date semantics. Local-time display is arguably fine; normalizing them to UTC would be a consistency-only follow-up. Sites:
- `cmd/msgvault/cmd/embeddings_manage.go` — `StartedAt`, `SeededAt`, `CompletedAt`, `ActivatedAt`.
- `internal/vector/pgvector/backend.go` — `StartedAt`, `CompletedAt`, `ActivatedAt`.
- `internal/vector/sqlitevec/backend.go` — `StartedAt`, `CompletedAt`, `ActivatedAt`.

Left unchanged here to avoid churning working code on a style call; documented so a future pass can decide.

## Scope

Independent of the Teams PR (#398) — branched from `main`, touches only `internal/sync` and `internal/whatsapp`.

Co-authored-by: Nat Torkington <njt@users.noreply.github.com>
@wesm

wesm commented Jun 25, 2026

Copy link
Copy Markdown
Member

looking at this

@wesm wesm force-pushed the feat/teams-ingestion branch from 13c4591 to 7e1ecd2 Compare June 25, 2026 03:46
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (7e1ecd2)

Summary verdict: Two Medium issues need attention; no Critical or High findings were reported.

Medium

  • internal/teams/mapping.go:43: Inline hostedContent images are later stored as attachment rows, but HasAttachments / AttachmentCount only count gm.Attachments. Messages with only inline images are therefore marked as having no attachments.

    • Fix: Include hostedContent references in the attachment count before UpsertMessage, or update the message flags after successful inline-image downloads.
  • internal/teams/importer.go:531: Teams reference/recording links are stored in attachments with synthetic content hashes and URL storage_path, but existing attachment query/export paths treat content_hash as a local content-addressed file. These rows will appear exportable while export/read flows look for nonexistent local files and do not expose the URL.

    • Fix: Add explicit support for URL-backed attachments in query/export/show paths, or store these links separately from file-backed attachments.

Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 5m57s), codex_security (codex/security, done, 5m32s) | Total: 11m37s

@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (d06220d)

High: Teams inline media can exfiltrate Graph bearer tokens.
Medium: Incremental chat sync can permanently skip same-timestamp updates.

High

  • internal/teams/importer.go:607, bearer-token sink at internal/teams/client.go:47
    A Teams participant controls message body HTML and can include a crafted hosted-content URL such as:

    https://graph.microsoft.com/v1.0https://attacker.example/hostedContents/1/$value
    

    hostedRe matches it, hostedFetchPath strips the base prefix with strings.TrimPrefix, and the result becomes an attacker-controlled absolute URL. Client.get then treats any http URL as absolute and sends Authorization: Bearer <Graph token> to that host.

    Fix: enforce same-origin Graph URLs before attaching Authorization. For hosted content, parse the URL, require scheme/host to match the configured Graph base URL, require the path prefix on a path-segment boundary, and return only a path-absolute request target. Also make Client.get reject absolute URLs whose scheme/host differ from baseURL, including @odata.nextLink and delta links.

Medium

  • internal/teams/client.go:172
    Chat incremental sync filters with lastModifiedDateTime gt <cursor> after saving the max timestamp seen. If Graph returns second-level or otherwise non-unique timestamps, later messages or updates with the same lastModifiedDateTime as the saved cursor are skipped permanently.

    Fix: use an inclusive boundary such as ge with upsert/deduplication, or store a tie-breaker cursor such as timestamp plus message IDs at the boundary.


Panel: ci_default_security | Synthesis: codex, 11s | Members: codex_default (codex/default, done, 6m23s), codex_security (codex/security, done, 5m22s) | Total: 11m56s

@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (7db407a)

Medium-risk issues remain in the Teams importer/OAuth changes; no Critical or High findings were reported.

Medium

  • internal/teams/importer.go:290 and internal/teams/importer.go:323
    When channel root/reply listing succeeds but delta priming fails, the continue skips phase 1, so none of the already-collected channel messages are persisted. Initial channel backfill becomes dependent on delta succeeding and loses the intended list-based fallback.
    Fix: Count the delta-prime error, leave newDelta empty, and continue to persistence so fetched messages are stored without advancing the channel cursor. Add a regression test for list/replies success with delta failure.

  • internal/microsoft/graph_oauth.go:17
    The importer resolves AAD participants via /users/{id}, but GraphScopes only requests User.Read, which covers the signed-in user rather than directory lookups for other users. In real tenants this can cause participant email resolution to fall back to Teams-only identifiers, breaking cross-platform participant unification.
    Fix: Add a directory-read scope such as User.ReadBasic.All or User.Read.All to GraphScopes, and update setup docs/consent guidance.


Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 7m55s), codex_security (codex/security, done, 5m16s) | Total: 13m20s

@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (c74f849)

Code review verdict: one Medium issue needs to be addressed before merge.

Medium

  • internal/teams/importer.go:261 — Channel backfill de-dupes messages with first-wins semantics before applying delta-prime results. If a root/reply is edited or deleted after it was fetched by the list/replies pass but before the delta prime, the newer delta copy is ignored while the new delta cursor is saved, so that update can be skipped permanently.
    • Fix: Track collected messages by ID and replace the existing entry when a duplicate from delta has a newer lastModifiedDateTime or deletion marker, or persist delta-prime messages after the backfill so their upsert wins before saving the cursor.

Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 13m5s), codex_security (codex/security, done, 5m5s) | Total: 18m17s

Squash the Teams ingestion branch into a single commit before rebasing onto origin/main. The branch adds delegated Microsoft Graph OAuth, Teams source commands, chat/channel import, sync state, hosted-content media handling, daemon scheduling, and the recovery/backfill paths needed to repair already-imported inline media.

After rebasing onto origin/main, Teams messages are also included in the new message_type search/help surface and text-mode message-type allowlists so `message_type:teams` works consistently with the main-branch query changes.

Included branch commits:
- fix(teams): close ingestion review gaps
- fix(teams): migrate legacy raw message ids
- fix(teams): repair legacy id migration references
- fix(teams): make Teams tests portable across CI backends
- fix(teams): keep URL attachments as links
- fix(teams): constrain Graph URL requests
- fix(teams): preserve channel backfill on delta prime errors
- fix(teams): reject stale Graph token scopes

Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wesm wesm force-pushed the feat/teams-ingestion branch from c74f849 to 546a95c Compare June 25, 2026 21:06
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (546a95c)

Teams ingestion has one medium correctness issue; no security findings were reported.

Medium

  • internal/teams/importer.go:90: Teams imports complete without populating conversation_participants or recomputing conversation stats, so conversations.message_count, participant_count, last_message_at, and previews remain stale or empty for imported Teams conversations.
    • Fix: Add resolved senders/members to conversation_participants and call RecomputeConversationStats(sourceID) before CompleteSync.

Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 9m40s), codex_security (codex/security, done, 2m58s) | Total: 12m44s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants