Skip to content

Historical sessions missing user_id after unified user model upgrade #555

@ColeMurray

Description

@ColeMurray

Context

The unified user model migration (Phases 1–6, #523) introduced canonical user tracking via users and user_identities tables. New sessions created after the upgrade automatically get a user_id via resolveOrCreateUser. However, sessions created before the upgrade have user_id = NULL and are grouped by scm_login in analytics instead of the canonical user identity.

This causes the same person to appear as two entries in the Per-User Breakdown on the analytics page — one from historical sessions (keyed by scm_login) and one from new sessions (keyed by canonical user_id).

Impact

  • Analytics only — session functionality is unaffected
  • Temporary — historical sessions age out of the analytics window (7/14/30/90 days)
  • Cosmetic — the duplicate entries show the same person with slightly different labels (e.g., GitHub login vs. display name)

Workaround

After upgrading, each user's first new session (from any source — web, Slack, GitHub, Linear) creates their canonical user record in the users table. Once that exists, you can backfill historical sessions with a targeted D1 query.

Step 1: Find canonical user IDs

npx wrangler d1 execute <database-name> --remote \
  --command "SELECT u.id, u.display_name, ui.provider_login FROM users u JOIN user_identities ui ON u.id = ui.user_id WHERE ui.provider = 'github'"

Step 2: Backfill each user's historical sessions

For each user, link their old sessions by scm_login:

npx wrangler d1 execute <database-name> --remote \
  --command "UPDATE sessions SET user_id = '<user_id>' WHERE scm_login = '<github_login>' AND user_id IS NULL"

Step 3: Verify

npx wrangler d1 execute <database-name> --remote \
  --command "SELECT COUNT(*) total, COUNT(user_id) linked, COUNT(*)-COUNT(user_id) unlinked FROM sessions WHERE spawn_source IN ('user','slack-bot','linear-bot','github-bot')"

Why not an automatic migration?

A generic backfill migration was drafted (PR #554) but removed because it could not safely match historical scm_login values to canonical users without risking identity fragmentation. The scm_login field stores GitHub usernames, but canonical users are keyed by numeric GitHub user ID — and the user's display_name may differ from their login. A migration that creates users from scm_login values could produce duplicate user records for the same person.

The manual approach above is safe because it only runs after canonical users exist (from new sessions) and uses the known user_id directly.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions