openab-agent: multi-vendor OAuth ADR + cross-process auth.json locking (codex + MCP)#1190
openab-agent: multi-vendor OAuth ADR + cross-process auth.json locking (codex + MCP)#1190brettchien wants to merge 18 commits into
Conversation
Proposed ADR for the openab-agent LLM-provider OAuth revamp: a two-axis OAuthVendor adapter (auth flow vs inference transport), a cross-process flock-guarded credential-store invariant for auth.json, the CLAUDE_CODE_OAUTH_TOKEN env route, a 14-variant vendor feasibility matrix, and the /auth (PR openabdev#1185) auth-trigger model. Surfaced while reviewing PR openabdev#1187 (first OAuth vendor). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… names - Build the OAuthVendor driver on the official `oauth2` crate (already in-tree via the MCP side) instead of a hand-rolled PKCE/exchange/refresh flow; the Anthropic JSON-token-body quirk is applied via the crate's custom http-client hook. Reframe §8 accordingly (hand-rolled flow is the rejected alternative). - Remove the project Rollout-plan section (internal sequencing, not ADR material); keep the race-window mitigation in §5.4. - Use vendor names only; drop internal fleet-agent references from the matrix. - Replace unexplained "ToS-gray" with "ToS-risk" + a definition. - Fix cross-references (crate-qualified paths, §-refs, line numbers) and move the settled model-default decision under "Decisions & open questions". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…w r1) Fold in two release-blocker-class concurrency fixes from Mira's review: - §5.4 refresh-token rotation: the prior "refresh outside lock, re-read on commit" claim was wrong — N processes each send a refresh with the same RT_old before committing, tripping OAuth 2.1 §10.4 reuse -> token-family revocation. Replace with a per-tenant exclusive lock: network refresh held under the tenant lock only (not the global lock), so exactly one refresh per tenant per expiry with no head-of-line blocking across tenants. Extends to mcp:<server> tenants. - §7 /auth: key the pending PKCE verifier+state by the initiating Discord user id instead of a single global entry — prevents concurrent-user overwrite (PKCE mismatch) and session hijack. - §5.1 OAuthVendor::redirect() -> Option, since device flows have no redirect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- §6/§9 Q2: correct the stale "git-safe" claim on the gemini GOCSPX- secret. The value is non-confidential by RFC 8252 / Google docs, but GitHub now push-protects Google secrets by default (changelog 2026-03) and partner-scans them for auto-revoke, so a raw literal is not safe in a public repo. Decide: encode-at-rest (scanner-evasion for a non-secret, NOT a security control) or env-inject at runtime — not raw text. - §7: pending PKCE entries get created_at + a 15-min GC sweep in with_auth_locked so abandoned /auth attempts don't accumulate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rett decisions r3) - §9 Q2 DECIDED: env-injection (b) is the default for the gemini/agy bundled client_secret; encode-at-rest (a) is the fallback for bundled zero-config binaries only, framed as scanner-evasion (not security). Cite rclone rcloneEncryptedClientSecret + obscure.MustReveal() as the canonical precedent (§10). - §9 Q3 DECIDED: GO gemini/grok (first wave) + agy (experimental, opt-in, ToS caveat — shares gemini's Code-Assist provider, residual risk is ToS not secret storage); No-Go cursor/kiro. Mirror as a build-decision line under §6. - §10: add rclone + GitHub/Google secret-scanning references. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GitHub survey 2026-06-24 grounds the agy GO decision: - §6/§9 Q2: agy client_secret requirement CONFIRMED — it needs a GOCSPX- secret, a public constant >=20 antigravity-auth repos hardcode verbatim (NoeFabris/opencode-antigravity-auth, router-for-me/CLIProxyAPI, ...). Literal deliberately NOT pasted into the doc — would trip the very §9 Q2 push-protection, so we dogfood the env/encode decision. Redirect confirmed localhost:51121/oauth-callback. - §9 Q3: ecosystem evidence — agy OAuth is widely ported (opencode/pi/hermes/ openclaw plugins + proxies), proving the integration, while the same ecosystem's anti-ban / quota-locking / multi-account-rotation tooling empirically confirms the ToS-ban + 429 risks, reinforcing the opt-in gate. - §10: add the antigravity OAuth ecosystem reference set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… in scope (r5) Grounded by a codebase survey (2026-06-24): - §4/§6: clarify "agy as a GO vendor" means a native OAuthVendor + Code-Assist inference provider — it does NOT run the agy CLI. agy speaks no ACP; the existing `antigravity` runtime variant (Mira/ECS) only works via a dedicated agy-acp adapter that shells out to the agy binary per prompt and polls its SQLite DB. The provider path sidesteps ACP entirely and supersedes the CLI-wrapper for native use, so agy's lack of ACP doesn't block the GO. - §5.4/§9 Q4: make the MCP CredentialStore revamp explicitly in-scope. auth.json has NO lock today (only atomic rename); provider save_tokens (auth.rs:234) and McpCredentialStore::save/clear (auth.rs:284-328) are two independent unlocked RMW callers, so with_auth_locked must wrap BOTH or the race persists. Also correct the stale save_tokens_for name and note with_auth_locked is new. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The §9 Q4 tail referenced 'openab-agent-mcp.md open items openabdev#1 (reqwest 0.12/0.13 split) and openabdev#8 (doctor/runtime two-store split)' — but that ADR's §10 Open Questions has only two items (mcp.json location; native-vs-broker parity), neither matching, and no such numbered items / terms exist anywhere in it. The phantom reference dated to the original draft. Replace with an accurate statement: McpCredentialStore reuses the same TokenStore/auth.json storage (openab-agent-mcp.md §6.1), so the lock lands once and serves both; the reqwest version conflict is the rmcp-OAuth dependency issue surfaced on the feat/openab-agent-mcp-resilience PR, not an mcp-ADR open item. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ive (Brett) Brett's call given the 93%-plaintext ecosystem survey: keep the bundled zero-config UX via encode-at-rest (obscure, rclone obscure.MustReveal style) as the DEFAULT, with env-injection as the alternative for fleet/pod. Reverses the prior (b)-default ordering. Framing unchanged: encode-at-rest is scanner-evasion for a non-confidential value, not a security control. Record the survey numbers (99/107 plaintext; auto-revoke largely unrealized) and that the real risk mitigated is org-repo push-protection friction, not credential loss. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implements the ADR §5.4 invariant. auth.json had NO lock: provider save_tokens and McpCredentialStore::save/clear each did an independent unlocked read-modify-write, so a concurrent codex refresh and MCP save last-writer-wins the whole map, and N processes (one openab-agent per Discord thread) could each refresh the codex token with the same RT_old → OAuth 2.1 §10.4 token-family revocation (fleet-wide logout). Two locks, flock(2) on sidecar files (kernel auto-releases on death), cfg(unix) with a non-unix no-op: - with_auth_locked: global exclusive lock across the re-read -> mutate -> atomic-write. ALL writers funnel through it — save_tokens (codex) and McpCredentialStore::save/clear (MCP) — so writers merge onto the latest on-disk state instead of lost-updating. Held only for the fast file RMW, never across network I/O. - lock_tenant_refresh: per-tenant refresh serialisation. get_valid_token / force_refresh take the codex tenant lock (non-blocking try + async backoff + 10s timeout, held across the network refresh), with a double-checked re-read so a process that waited adopts the token another already refreshed → exactly one real refresh per tenant per expiry, no RT_old reuse. Uses libc::flock (already a cfg(unix) dep); rustix is NOT in-tree (ADR text was optimistic). New test asserts the locked RMW merges codex + MCP tenants without lost-update. fmt + clippy -D warnings + test (191 passed) green. Known gap (for review): the MCP *network* refresh is owned by rmcp's AuthorizationManager (it refreshes then calls CredentialStore::save), so MCP writes get the file-integrity lock but MCP refreshes are not yet tenant- serialised. Closing that needs an rmcp-level hook — proposed as follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Round-2 code review (Mira): - §7 pending-entry GC: add created_at to PendingPasteLogin and sweep AuthEntry::Pending older than 15 min inside with_auth_locked on every write, so abandoned /auth two-step attempts don't accumulate. PendingPasteLogin is currently a legacy tombstone with no live writer (PKCE state lives in rmcp's in-memory StateStore); the field + GC land now per ADR §7 and are forward-compatible with the forthcoming /auth two-step flow, and meanwhile sweep legacy stray entries (created_at default 0 reads as ancient). New test covers stale-swept / fresh-kept / real-tenant-untouched. - flock_try_exclusive: match std::io::ErrorKind::WouldBlock instead of a single raw EWOULDBLOCK errno, covering EAGAIN/EWOULDBLOCK across libc/BSD. - MCP network-refresh serialisation stays a follow-up (rmcp owns the refresh) — Mira concurs. fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… gap) Closes the known gap from f370110: MCP refreshes now get the same per-tenant serialisation as codex. rmcp's CredentialStore exposes no pre-refresh hook, but openab drives the MCP refresh explicitly in McpRuntimeManager::resolve_oauth_dial via client.get_access_token(). Wrap that call (per-server) with the existing auth::lock_tenant_refresh so only one process refreshes a given server at a time. No explicit double-check needed: rmcp's get_access_token re-load()s auth.json each call and returns the cached token without a network refresh when remaining >= REFRESH_BUFFER (rmcp auth.rs:1238), so a process that loses the race adopts the token the winner wrote to the shared file — no second RT_old presentation, no OAuth 2.1 §10.4 family revocation. rmcp already single-flights within one process via its AuthorizationManager Mutex; this closes the cross-process gap. - auth.rs: lock_tenant_refresh + AuthFileLock made pub(crate) for the mcp module. - ADR §5.4: document the resolve_oauth_dial serialisation point. fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…h fd Quality-only cleanups from a 4-angle /simplify pass (no behaviour change): - Extract lock_global(path) -> Result<Option<AuthFileLock>> and route both with_auth_locked and McpCredentialStore::clear through it, so the "global" sidecar name + the cfg(unix) acquire live in one place instead of two copy-pasted blocks. clear flattens (drop the inner run closure). - lock_tenant_refresh opens the lock fd once and re-issues flock on it each retry, instead of re-opening (and re-create_dir_all-ing) the file every 100ms under contention. Removes the now-unused flock_try_exclusive helper. - Document lock_tenant_refresh's double-check contract (the lock only serialises; callers must re-check freshness after acquiring). fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Doc-only clarifications from a /code-review pass (no behaviour change): - PendingPasteLogin.created_at: warn loudly that any writer of a fresh Pending entry MUST stamp created_at, else gc_stale_pending sweeps it on the next locked write (latent footgun for the forthcoming /auth two-step writer). - lock_tenant_refresh: correct the contract — reuse-safety comes from loading the refresh token INSIDE the lock (so force_refresh, which always refreshes on a 401, is reuse-safe too); the post-lock expiry re-check is only an optimisation to skip a redundant refresh. Review also surfaced a pre-existing, out-of-scope namespacing gap (an MCP server literally named "codex" collides with the codex tenant's auth.json key and refresh lock — committed MCP creds use the bare server name, not mcp:<server> as ADR §5.4 describes) — flagged for a separate change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ci-openab-agent.yml runs cargo fmt/clippy/test/build with working-directory: openab-agent, but the crate is not a member of the parent openab workspace (members = crates/openab-core, openab-gateway) and was not excluded, so cargo errors 'current package believes it's in a workspace when it's not' and every step fails before doing any work. openab-agent is intentionally standalone (own version + dual reqwest 0.12/0.13 for rmcp), so the correct fix is an empty [workspace] table making it its own root. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This reverts commit b40693c.
ci-openab-agent.yml ran cargo from working-directory: openab-agent without the [workspace] table the crate needs (it's standalone, not a parent-workspace member), so every step failed with 'believes it's in a workspace when it's not' — the workflow had been red independently of any PR. Dockerfile.unified already works around this by appending the table at build time; replicate that in the workflow rather than committing it to Cargo.toml (which would double-append in the Dockerfile and break the image build). Also harden the ACP smoke test: build the release binary in its own step and bump the response timeout 5s -> 30s so a loaded runner doesn't flake the agent's first-response window (the binary itself responds in <1s locally, verified incl. a clean-HOME release build). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…osis The smoke test produces empty stdout only in CI (the binary responds with agentInfo in <1s locally across debug/release/clean-HOME/CI-env-var runs). Capture stderr + exit code so the next CI run reveals why stdout is empty.
|
LGTM ✅ — Well-designed cross-process locking for auth.json with thorough ADR documentation and correct concurrency semantics. What This PR DoesIntroduces a multi-vendor OAuth Architecture Decision Record and implements cross-process How It WorksTwo-lock architecture: (a) a global Findings
Finding Details🟢 F1: Two-lock designThe global lock ( 🟢 F2: Double-check after tenant lock acquisition
🟢 F3: MCP cross-process single-flight without rmcp forkrmcp's 🟢 F4: Pending-entry GC
🟢 F5: ADR qualityThe ADR documents: (1) the two-axis auth/inference model, (2) a 14-variant vendor feasibility matrix with concrete OAuth values, (3) the concurrency invariant with pseudocode, (4) prior-art survey (OpenClaw, Hermes, Pi, rclone), (5) rejected alternatives with clear rationale, (6) explicit decisions with owner attribution. 🟢 F6: Test coverageTwo focused tests verify the core invariants: Baseline Check
What's Good (🟢)
|
What problem does this solve?
openab-agent'sauth.jsonis a shared, multi-writer credential file with an unlocked read-modify-write, and openab-agent runs one process per Discord thread. So ordinary concurrent usage = concurrent processes refreshing the same OAuth token → refresh-token-rotation reuse → worst case OAuth 2.1 §10.4 token-family revocation = fleet-wide logout. API-key users never hit this; OAuth adoption (PR #1187, Anthropic) is what activates it.This PR lands the ADR that sets the multi-vendor OAuth pattern (so every future provider stops hand-rolling its own flow) and the concurrency fix for the release-blocker-class storage race — for both the codex tenant and the MCP
CredentialStore.Closes #
Discord Discussion URL: https://discord.com/channels/1490643539682529300/1519372543377539173
At a Glance
Prior Art & Industry Research
OpenClaw: API keys + subscription OAuth; stores per-profile
{access,refresh,expires,accountId}and treats the profile file as a token sink refreshed under a file lock — direct corroboration for the locked-RMW decision.Hermes Agent (NousResearch):
PROVIDER_REGISTRYdataclasses declare each provider's auth type + URLs + env vars behind oneresolve_runtime_provider()entry point;auth.jsonguarded withfcntl/msvcrtfile locks — corroborates both the registry shape (OAuthVendor) and the file-lock decision.Other references:
earendil-works/pi) — ported flow for feat(native-agent): Anthropic OAuth (Claude Pro/Max) login for openab-agent #1187;CLAUDE_CODE_OAUTH_TOKEN(#3591); Kiro/Cursor/xAI provider extensions.rcloneEncryptedClientSecret+ runtimeobscure.MustReveal(): precedent for the ADR §9 Q2 encode-at-rest of a non-confidential bundled OAuth secret (scanner-evasion, explicitly not encryption).Proposed Solution
ADR (
docs/adr/openab-agent-oauth.md): a two-axis model — auth (OAuthVendordescriptor + a shared driver on the officialoauth2crate) kept orthogonal to inference (one provider per wire format) — plus a 14-variant vendor feasibility matrix, theCLAUDE_CODE_OAUTH_TOKENenv route, the/authtwo-step UX (per-user-keyed PKCE pending state), and the storage invariant below.Implementation (this PR):
with_auth_locked— globalflock(2)on anauth.json.global.locksidecar across re-read → mutate → atomic write. All writers funnel through it (codexsave_tokens+ MCPMcpCredentialStore::save/clear), so they merge onto the latest on-disk state instead of lost-updating.lock_tenant_refresh— per-tenantflock(non-blocking try + async backoff + 10s timeout) held across the network refresh, with a double-checked re-read, so concurrent processes do exactly one real refresh per tenant. Codex uses it inget_valid_token/force_refresh; MCP uses it inresolve_oauth_dialaroundget_access_token().flock(2)on sidecar files (kernel auto-releases on fd close / process death — no stale lock);libc::flockundercfg(unix), non-unix no-op.created_at+ a 15-min sweep insidewith_auth_locked.Why this approach?
flock(2)over a sentinel lockfile — the kernel releases it on process death, so a crashed holder never wedges the file (no orphan cleanup / manual timeout).CredentialStoreexposes no pre-refresh hook, but openab drives the MCP refresh explicitly inresolve_oauth_dial; rmcp'sget_access_tokenre-load()sauth.jsonand skips the network refresh when the token is already fresh, so the per-tenant lock + rmcp's own fresh-load gives cross-process single-flight with no extra double-check code.rustixis not in-tree (the ADR text was optimistic), so this useslibc::flockdirectly.Alternatives Considered
Mutex/ tokio single-flight — useless across the per-thread processes.flock(2)auto-releases.RT_oldbefore it commits, so RTR still fires.CredentialStore— lossy; the shared layer sits below both stores (file RMW).Validation
Rust changes:
cargo checkpassescargo testpasses (192 passed, 0 failed, 11 ignored) — incl. newwith_auth_locked_merges_concurrent_tenants_no_lost_updateandwith_auth_locked_gcs_stale_pending_but_keeps_fresh_and_tokenscargo clippy -- -D warningscleancargo fmt --checkcleanAll PRs:
openab-agent/(the crate is built standalone, matchingci-openab-agent.yml). Reviewed end-to-end with a second agent across two ADR-review rounds and two code-review rounds; all findings (RTR refresh race, per-user PKCE keying, pending GC, portableWouldBlock, MCP cross-process refresh) addressed and re-verified.