Background
Async deep-research jobs frequently outlive the auth token captured at submission time, causing tool calls inside the worker to fail with 401. The current flow:
- The HTTP request carries the user's ID token (cookie or
Authorization header).
aiq_api.auth.middleware validates it and attaches the user dict to a request ContextVar.
submit_agent_job (frontends/aiq_api/src/aiq_api/jobs/submit.py:174-178) captures the token via get_auth_token() and serializes it into Dask job args alongside the input text, data sources, etc.
- The Dask worker reads the token in
runner.py:243-250 and pushes it into the job_auth_token ContextVar so tools that need authenticated calls can retrieve it via get_auth_token() during the run.
- Because the token is frozen at submission, there is no refresh path. Once the ID token expires mid-job, every subsequent authenticated tool call fails with an "authentication token expired or invalid" error and the job ends in
failure.
Deep-research workflows commonly run 10+ minutes; jobs that get queued behind others or hit retry loops can run far longer. Any ID token shorter-lived than the worst-case job duration is a guaranteed loss.
Proposal
Move token-refresh responsibility from the client to the server, and have workers fetch a fresh token on demand instead of carrying a frozen one:
- Login flow. The OAuth callback receives
{access_token, id_token, refresh_token, expiries} from the IdP. Today the refresh token is dropped. Persist all three server-side, keyed by an opaque session id stored in an httpOnly cookie.
- Token store. Add a
TokenStore interface in aiq_agent.auth with a default Postgres-backed implementation reusing NAT_JOB_STORE_DB_URL. Suggested schema: (session_id, user_sub, refresh_token_encrypted, id_token_expires_at, refresh_token_expires_at, updated_at). Encryption key from a new env var (AIQ_TOKEN_ENCRYPTION_KEY). The interface should be pluggable so deployments can swap in another backing store.
- Refresh-on-demand. Replace the captured-string
auth_token arg in submit_agent_job / run_agent_job with a session_id. Inside the worker, get_auth_token() consults the store: if the cached ID token is within N seconds of expiry, exchange the refresh token for a new one against the IdP token endpoint, persist, return the fresh token. The tool-facing API (get_auth_token()) stays unchanged.
- Client contract. Browsers no longer need to send or refresh the ID token themselves — the session cookie is enough. Programmatic callers using short-lived bearer tokens keep the existing path; they don't have the same long-job exposure because their workflows don't typically hold tokens across multi-step async runs.
- Refresh-failure surfacing. Emit a structured job event when refresh fails (refresh token revoked, IdP unreachable) so the UI can prompt re-login mid-job rather than letting the job silently fail.
This is the standard backend-for-frontend pattern: short-lived access token at the edge, long-lived refresh token held by the server only.
Out of scope
- IdP-specific refresh adapters beyond a reference OIDC implementation — additional providers can be added as separate
TokenStore / refresh-flow adapters.
- Token-handoff for synchronous (non-Dask) requests — the freeze-at-request-time model is fine for sub-minute calls.
- CLI flows that already manage their own refresh-token cache locally.
References
- Job submit / token capture:
frontends/aiq_api/src/aiq_api/jobs/submit.py:174-201
- Worker token propagation:
frontends/aiq_api/src/aiq_api/jobs/runner.py:243-250, _auth_context.py
- Public auth contract:
frontends/aiq_api/src/aiq_api/auth/middleware.py, auth/base.py
Background
Async deep-research jobs frequently outlive the auth token captured at submission time, causing tool calls inside the worker to fail with 401. The current flow:
Authorizationheader).aiq_api.auth.middlewarevalidates it and attaches the user dict to a requestContextVar.submit_agent_job(frontends/aiq_api/src/aiq_api/jobs/submit.py:174-178) captures the token viaget_auth_token()and serializes it into Dask job args alongside the input text, data sources, etc.runner.py:243-250and pushes it into thejob_auth_tokenContextVar so tools that need authenticated calls can retrieve it viaget_auth_token()during the run.failure.Deep-research workflows commonly run 10+ minutes; jobs that get queued behind others or hit retry loops can run far longer. Any ID token shorter-lived than the worst-case job duration is a guaranteed loss.
Proposal
Move token-refresh responsibility from the client to the server, and have workers fetch a fresh token on demand instead of carrying a frozen one:
{access_token, id_token, refresh_token, expiries}from the IdP. Today the refresh token is dropped. Persist all three server-side, keyed by an opaque session id stored in an httpOnly cookie.TokenStoreinterface inaiq_agent.authwith a default Postgres-backed implementation reusingNAT_JOB_STORE_DB_URL. Suggested schema:(session_id, user_sub, refresh_token_encrypted, id_token_expires_at, refresh_token_expires_at, updated_at). Encryption key from a new env var (AIQ_TOKEN_ENCRYPTION_KEY). The interface should be pluggable so deployments can swap in another backing store.auth_tokenarg insubmit_agent_job/run_agent_jobwith asession_id. Inside the worker,get_auth_token()consults the store: if the cached ID token is within N seconds of expiry, exchange the refresh token for a new one against the IdP token endpoint, persist, return the fresh token. The tool-facing API (get_auth_token()) stays unchanged.This is the standard backend-for-frontend pattern: short-lived access token at the edge, long-lived refresh token held by the server only.
Out of scope
TokenStore/ refresh-flow adapters.References
frontends/aiq_api/src/aiq_api/jobs/submit.py:174-201frontends/aiq_api/src/aiq_api/jobs/runner.py:243-250,_auth_context.pyfrontends/aiq_api/src/aiq_api/auth/middleware.py,auth/base.py