Skip to content

Server-side token refresh for long-running async jobs #215

@AjayThorve

Description

@AjayThorve

Background

Async deep-research jobs frequently outlive the auth token captured at submission time, causing tool calls inside the worker to fail with 401. The current flow:

  1. The HTTP request carries the user's ID token (cookie or Authorization header).
  2. aiq_api.auth.middleware validates it and attaches the user dict to a request ContextVar.
  3. submit_agent_job (frontends/aiq_api/src/aiq_api/jobs/submit.py:174-178) captures the token via get_auth_token() and serializes it into Dask job args alongside the input text, data sources, etc.
  4. The Dask worker reads the token in runner.py:243-250 and pushes it into the job_auth_token ContextVar so tools that need authenticated calls can retrieve it via get_auth_token() during the run.
  5. Because the token is frozen at submission, there is no refresh path. Once the ID token expires mid-job, every subsequent authenticated tool call fails with an "authentication token expired or invalid" error and the job ends in failure.

Deep-research workflows commonly run 10+ minutes; jobs that get queued behind others or hit retry loops can run far longer. Any ID token shorter-lived than the worst-case job duration is a guaranteed loss.

Proposal

Move token-refresh responsibility from the client to the server, and have workers fetch a fresh token on demand instead of carrying a frozen one:

  1. Login flow. The OAuth callback receives {access_token, id_token, refresh_token, expiries} from the IdP. Today the refresh token is dropped. Persist all three server-side, keyed by an opaque session id stored in an httpOnly cookie.
  2. Token store. Add a TokenStore interface in aiq_agent.auth with a default Postgres-backed implementation reusing NAT_JOB_STORE_DB_URL. Suggested schema: (session_id, user_sub, refresh_token_encrypted, id_token_expires_at, refresh_token_expires_at, updated_at). Encryption key from a new env var (AIQ_TOKEN_ENCRYPTION_KEY). The interface should be pluggable so deployments can swap in another backing store.
  3. Refresh-on-demand. Replace the captured-string auth_token arg in submit_agent_job / run_agent_job with a session_id. Inside the worker, get_auth_token() consults the store: if the cached ID token is within N seconds of expiry, exchange the refresh token for a new one against the IdP token endpoint, persist, return the fresh token. The tool-facing API (get_auth_token()) stays unchanged.
  4. Client contract. Browsers no longer need to send or refresh the ID token themselves — the session cookie is enough. Programmatic callers using short-lived bearer tokens keep the existing path; they don't have the same long-job exposure because their workflows don't typically hold tokens across multi-step async runs.
  5. Refresh-failure surfacing. Emit a structured job event when refresh fails (refresh token revoked, IdP unreachable) so the UI can prompt re-login mid-job rather than letting the job silently fail.

This is the standard backend-for-frontend pattern: short-lived access token at the edge, long-lived refresh token held by the server only.

Out of scope

  • IdP-specific refresh adapters beyond a reference OIDC implementation — additional providers can be added as separate TokenStore / refresh-flow adapters.
  • Token-handoff for synchronous (non-Dask) requests — the freeze-at-request-time model is fine for sub-minute calls.
  • CLI flows that already manage their own refresh-token cache locally.

References

  • Job submit / token capture: frontends/aiq_api/src/aiq_api/jobs/submit.py:174-201
  • Worker token propagation: frontends/aiq_api/src/aiq_api/jobs/runner.py:243-250, _auth_context.py
  • Public auth contract: frontends/aiq_api/src/aiq_api/auth/middleware.py, auth/base.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions