Skip to content

fix(codex): force real token refresh on 401 and retry transient refresh failures#587

Merged
awsl233777 merged 5 commits into
mainfrom
fix/codex-refresh-token
Jun 18, 2026
Merged

fix(codex): force real token refresh on 401 and retry transient refresh failures#587
awsl233777 merged 5 commits into
mainfrom
fix/codex-refresh-token

Conversation

@Bowl42

@Bowl42 Bowl42 commented May 31, 2026

Copy link
Copy Markdown
Collaborator

问题

Codex adapter 的 getAccessToken 没有 forceRefresh 路径。收到 401 时它只清空内存 cache,然后重新读取持久化的 access_token——也就是上游刚刚拒绝的那个凭据——于是重试又撞上同一个 401。刷新失败时还会回退到这个失效 token,并且没有任何重试,也不区分永久性错误(refresh_token_reused / invalid_grant)。

参考 CLIProxyAPI(internal/auth/codex/openai_auth.goRefreshTokensWithRetry / isNonRetryableRefreshErr)以及同仓库已正确实现的 Claude adapter(getAccessToken(ctx, forceRefresh bool))对齐修复。

改动

internal/adapter/provider/codex/oauth.go

  • 新增 RefreshAccessTokenWithRetry(线性退避,支持 ctx 取消)
  • 新增 isNonRetryableRefreshErr(对 refresh_token_reused / invalid_grant / invalid_request / invalid_client 快速失败)

internal/adapter/provider/codex/adapter.go

  • getAccessToken(ctx)getAccessToken(ctx, forceRefresh bool):forceRefresh 时跳过 cache 和持久化 token,且刷新失败不回退被拒 token
  • 新增 refreshMu 串行化刷新(避免并发 refresh_token 复用),取锁后二次复查 cache
  • 401 重试路径改用 getAccessToken(ctx, true)
  • access token TTL 加 >= 1s 下限保护

internal/service/codex_task.go / internal/handler/codex.go

  • 配额路径统一改用 RefreshAccessTokenWithRetry,与请求路径保持一致

测试

  • go build ./internal/...
  • go vet ./internal/adapter/provider/codex/... ./internal/service/... ./internal/handler/...
  • go test ./internal/adapter/provider/codex/... ./internal/service/... ./internal/handler/... ✅(全部 ok)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • 新增刷新互斥锁以序列化同一账户/refresh_token 的刷新;新增 provider 重载注入点与 SetProviderReloadFunc。
    • 新增带重试与线性退避的 RefreshAccessTokenWithRetry,并在刷新成功后更改过期计算与最佳努力持久化行为。
    • 将 provider 更新回调类型调整以支持路由侧注入。
  • Bug Fixes

    • 引入“强制刷新”与锁内双重检查,避免复用被拒绝或短期过期令牌;强制刷新失败不再回退持久化令牌;统一内存缓存与 60s 缓冲策略,改进重试/并发稳定性。
  • Tests

    • 增加刷新锁、适配器注入与重载相关单元测试。

@coderabbitai

coderabbitai Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

为 Codex 引入带重试的刷新函数、进程级刷新互斥锁;重构适配器令牌复用与强制刷新流程(锁内重读 provider、二次检查、重试刷新并持久化);调用方在 quota/定时任务路径按锁串行化刷新并处理失败。

Changes

Codex 令牌重试与并发控制

Layer / File(s) Summary
令牌重试函数与不可重试错误检测
internal/adapter/provider/codex/oauth.go
新增 RefreshAccessTokenWithRetry(ctx, refreshToken, maxRetries)isNonRetryableRefreshErr,实现线性退避重试、context 取消处理及对不可重试错误的快速失败。
全进程刷新互斥锁与测试
internal/adapter/provider/codex/refresh_lock.go, internal/adapter/provider/codex/refresh_lock_test.go
新增进程内锁注册表、AcquireRefreshLock(key) func()RefreshLockKey(accountID,refreshToken);添加锁行为单元测试(键格式、互斥、清理、幂等、不同键不阻塞)。
适配器强制刷新与双重检查逻辑
internal/adapter/provider/codex/adapter.go
重构 getAccessToken(ctx, forceRefresh, rejectedToken)Execute/WarmToken 改为显式非强制调用;收到 401 时使用强制刷新且绕过被拒绝凭据复用;刷新在锁内通过 providerReload 重读 prov 并调用 RefreshAccessTokenWithRetry(...,3),成功后更新内存并对锁内快照做最佳努力持久化。
处理器与定时任务的刷新调用替换
internal/handler/codex.go, internal/service/codex_task.go
在 quota 获取与定时刷新路径中按 AccountID/RefreshToken 获取刷新锁并在锁内二次检查 token;需刷新时调用 RefreshAccessTokenWithRetry(...,3),成功则更新并持久化 provider,失败时解锁并跳过该 provider。
路由注入与适配器接线测试
internal/router/router.go, internal/adapter/provider/codex/reload_wiring_test.go
路由注入新增对 SetProviderReloadFunc 能力的检测并注入回调;新增适配器 wiring 测试验证注入回调被设置并按预期返回 provider 快照。

Sequence Diagram

sequenceDiagram
  participant Client
  participant Execute
  participant getAccessToken
  participant AcquireLock
  participant RefreshWithRetry
  participant CodexAPI
  participant Store

  Client->>Execute: 发起请求
  Execute->>getAccessToken: getAccessToken(ctx, false, "")
  getAccessToken-->>Execute: 若缓存/持久化令牌在缓冲期内则返回令牌
  alt 需要刷新
    getAccessToken->>AcquireLock: AcquireRefreshLock(key)
    AcquireLock-->>getAccessToken: 锁已获得
    getAccessToken->>RefreshWithRetry: 调用 RefreshAccessTokenWithRetry(refreshToken,3)
    RefreshWithRetry->>CodexAPI: 发起刷新请求(可能多次重试)
    CodexAPI-->>RefreshWithRetry: 返回新令牌或错误
    RefreshWithRetry-->>Store: 更新缓存与持久化(若成功)
    getAccessToken->>AcquireLock: Unlock()
  end
  Execute->>CodexAPI: 使用令牌调用 API
  alt API 返回 401
    CodexAPI-->>Execute: 401 Unauthorized
    Execute->>getAccessToken: getAccessToken(ctx, true, 被拒绝token)
    getAccessToken->>AcquireLock: AcquireRefreshLock(key)
    getAccessToken->>RefreshWithRetry: 强制刷新(绕过被拒凭据复用)
    RefreshWithRetry->>CodexAPI: 刷新请求
    CodexAPI-->>RefreshWithRetry: 返回新令牌
    RefreshWithRetry-->>Store: 更新
    getAccessToken->>AcquireLock: Unlock()
    Execute->>CodexAPI: 使用新令牌重试请求
  end
Loading

估计代码审查工作量

🎯 4 (Complex) | ⏱️ ~45 分钟

可能相关的 PRs

建议审查者

  • awsl233777
  • ymkiux
  • whhjdi

🐰 小兔报喜:
锁护令牌免重争,三次退避稳刷新,
401 呼唤强制换,快照锁内再复审。

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed 标题准确总结了本次变更的主要目标:在收到 401 响应时强制执行真实的 token 刷新,并针对可重试的刷新失败进行重试机制
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/codex-refresh-token

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/adapter/provider/codex/adapter.go (2)

322-342: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

锁内仍然沿用了锁前读取的 refresh_token 快照。

这里拿到全局刷新锁后只重查了 a.tokenCache,但 Line 342 真正用于刷新的是锁前的 config.RefreshToken。如果等待锁期间是 quota handler / task 完成了刷新并轮换了 refresh_token,这个旧 adapter 实例并不会像 internal/handler/codex.go 那样在锁内重读 provider,于是仍可能拿已经失效的 refresh_token 去调刷新,refresh_token_reused 会在跨路径并发下继续出现。建议在锁内重新获取最新 provider 配置,或引入能把最新 refresh token 同步回当前 adapter 的共享状态。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/adapter/provider/codex/adapter.go` around lines 322 - 342, After
acquiring the refresh lock via AcquireRefreshLock(RefreshLockKey(...)) you
currently still call RefreshAccessTokenWithRetry using the pre-lock snapshot
config.RefreshToken; instead, inside the locked section re-read the
authoritative refresh token from the provider/shared state (e.g., reload
a.provider or a.provider.Config or call the same lookup used by
internal/handler/codex.go) or sync the latest token into a.tokenCache before
calling RefreshAccessTokenWithRetry, and then pass that fresh token to
RefreshAccessTokenWithRetry (and update a.tokenCache/a.provider with the new
token on success) so you never attempt refresh with a stale config.RefreshToken.

365-404: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

不要把适配器内存态的 token 更新绑在 providerUpdate 上。

providerUpdate 在这个类型里是可选的,但现在 config.AccessTokenconfig.ExpiresAtconfig.RefreshToken 只会在它非空时更新。这样一旦本次刷新返回了新的 refresh_token,没有回调的 adapter 下一次过期后仍会继续使用旧 token 刷新,等于把轮换语义丢掉了。内存态应始终先更新,再按需持久化。

建议修改
-	// Persist token to database if update function is set
-	if a.providerUpdate != nil {
-		config.AccessToken = tokenResp.AccessToken
-		config.ExpiresAt = expiresAt.Format(time.RFC3339)
-		if tokenResp.RefreshToken != "" {
-			config.RefreshToken = tokenResp.RefreshToken
-		}
+	config.AccessToken = tokenResp.AccessToken
+	config.ExpiresAt = expiresAt.Format(time.RFC3339)
+	if tokenResp.RefreshToken != "" {
+		config.RefreshToken = tokenResp.RefreshToken
+	}
+
+	// Persist token to database if update function is set
+	if a.providerUpdate != nil {
 		if tokenResp.IDToken != "" {
 			if claims, parseErr := ParseIDToken(tokenResp.IDToken); parseErr == nil && claims != nil {
 				if v := strings.TrimSpace(claims.GetAccountID()); v != "" {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/adapter/provider/codex/adapter.go` around lines 365 - 404, The code
currently only updates in-memory token fields when a.providerUpdate is non-nil,
which loses refresh_token rotation for adapters without a callback; change the
logic in the refresh handling (the block that references config, tokenResp,
expiresAt, ParseIDToken and a.providerUpdate) so that you always update the
in-memory config.AccessToken, config.ExpiresAt, and—if
present—config.RefreshToken and all ID token-derived fields (claims.Email, Name,
Picture, AccountID, UserID, PlanType, SubscriptionStart, SubscriptionEnd) before
checking providerUpdate; then, if a.providerUpdate != nil, call
a.providerUpdate(a.provider) to persist and log any error (same best-effort
behavior), leaving ParseIDToken usage and error handling intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/handler/codex.go`:
- Around line 628-632: The call to h.svc.GetByID does not exist on the service
interface and causes typecheck failure; replace it with the existing service
method used elsewhere in this file (e.g. h.svc.GetProvider or
h.svc.GetProviders) so the code compiles. Locate the block using
codex.AcquireRefreshLock and change the call to use the same service method
signature/return values used by other paths in this file (match variable names
and nil checks to the existing GetProvider/GetProviders return types),
preserving the logic that refreshes provider and config when a fresh record is
returned.
- Around line 646-653: The UpdateProvider call's error is ignored which can drop
a rotated refresh_token; change the block after setting config
(AccessToken/ExpiresAt/RefreshToken) to check the error returned by
h.svc.UpdateProvider(provider) and treat it as a refresh failure: if
UpdateProvider returns an error, do not proceed as if refresh succeeded —
log/return the error (propagate it from the current refresh function), avoid
releasing the lock as if success until cleanup, and ensure any adapter cache
refresh that depends on persistence is not considered done; reference the
h.svc.UpdateProvider call, the local config/refresh_token assignment, and the
unlock() call to implement this control flow change.

---

Outside diff comments:
In `@internal/adapter/provider/codex/adapter.go`:
- Around line 322-342: After acquiring the refresh lock via
AcquireRefreshLock(RefreshLockKey(...)) you currently still call
RefreshAccessTokenWithRetry using the pre-lock snapshot config.RefreshToken;
instead, inside the locked section re-read the authoritative refresh token from
the provider/shared state (e.g., reload a.provider or a.provider.Config or call
the same lookup used by internal/handler/codex.go) or sync the latest token into
a.tokenCache before calling RefreshAccessTokenWithRetry, and then pass that
fresh token to RefreshAccessTokenWithRetry (and update a.tokenCache/a.provider
with the new token on success) so you never attempt refresh with a stale
config.RefreshToken.
- Around line 365-404: The code currently only updates in-memory token fields
when a.providerUpdate is non-nil, which loses refresh_token rotation for
adapters without a callback; change the logic in the refresh handling (the block
that references config, tokenResp, expiresAt, ParseIDToken and a.providerUpdate)
so that you always update the in-memory config.AccessToken, config.ExpiresAt,
and—if present—config.RefreshToken and all ID token-derived fields
(claims.Email, Name, Picture, AccountID, UserID, PlanType, SubscriptionStart,
SubscriptionEnd) before checking providerUpdate; then, if a.providerUpdate !=
nil, call a.providerUpdate(a.provider) to persist and log any error (same
best-effort behavior), leaving ParseIDToken usage and error handling intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 84b83891-9c9c-4f86-a9fb-7d7646b1815b

📥 Commits

Reviewing files that changed from the base of the PR and between e934c89 and 641fbc9.

📒 Files selected for processing (3)
  • internal/adapter/provider/codex/adapter.go
  • internal/adapter/provider/codex/refresh_lock.go
  • internal/handler/codex.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Backend Checks
  • GitHub Check: Frontend Checks
  • GitHub Check: e2e
  • GitHub Check: playwright
🧰 Additional context used
🪛 golangci-lint (2.12.2)
internal/handler/codex.go

[error] 629-629: : # github.com/awsl-project/maxx/internal/handler [github.com/awsl-project/maxx/internal/handler.test]
internal/handler/codex.go:629:28: h.svc.GetByID undefined (type *service.AdminService has no field or method GetByID)

(typecheck)

🔇 Additional comments (1)
internal/adapter/provider/codex/refresh_lock.go (1)

9-42: LGTM!

Comment thread internal/handler/codex.go Outdated
Comment thread internal/handler/codex.go Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/router/router.go`:
- Around line 582-588: The function injectProviderUpdate currently references an
undefined variable p; change its signature to injectProviderUpdate(a
provider.ProviderAdapter, p *domain.Provider) and update its internal use of p
(used to capture tenantID and id and to create the reload closure via
repo.GetByID) so the provider pointer is supplied rather than relying on a
nonexistent outer variable; then update the two call sites that invoke
injectProviderUpdate—inside InitAdapters (the loop at the previous line ~113)
and inside RefreshAdapter (the call at the previous line ~129)—to pass the
corresponding *domain.Provider instance into injectProviderUpdate, ensuring
types match provider.ProviderAdapter and *domain.Provider and keeping the
providerReloader/SetProviderReloadFunc logic unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3db96733-dbfc-4c85-8645-af22cfad0b25

📥 Commits

Reviewing files that changed from the base of the PR and between 641fbc9 and fecba1c.

📒 Files selected for processing (7)
  • internal/adapter/provider/codex/adapter.go
  • internal/adapter/provider/codex/refresh_lock.go
  • internal/adapter/provider/codex/refresh_lock_test.go
  • internal/adapter/provider/codex/reload_wiring_test.go
  • internal/handler/codex.go
  • internal/router/router.go
  • internal/service/codex_task.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/handler/codex.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: e2e
  • GitHub Check: playwright
🧰 Additional context used
🪛 GitHub Actions: PR Checks / 0_Backend Checks.txt
internal/router/router.go

[error] 582-582: go vet failed: internal/router/router.go:582:42: undefined: p

🪛 GitHub Actions: PR Checks / Backend Checks
internal/router/router.go

[error] 582-582: go vet reported undefined identifier 'p' at internal/router/router.go:582:42. Step: go vet ./cmd/... ./internal/...

🪛 GitHub Check: Backend Checks
internal/router/router.go

[failure] 584-584:
undefined: p


[failure] 582-582:
undefined: p

🪛 golangci-lint (2.12.2)
internal/router/router.go

[error] 582-582: : # github.com/awsl-project/maxx/internal/router
internal/router/router.go:582:42: undefined: p
internal/router/router.go:584:19: undefined: p

(typecheck)

🔇 Additional comments (11)
internal/adapter/provider/codex/refresh_lock_test.go (1)

9-73: LGTM!

internal/adapter/provider/codex/reload_wiring_test.go (1)

9-48: LGTM!

internal/adapter/provider/codex/adapter.go (5)

57-64: LGTM!

Also applies to: 81-85


133-133: LGTM!

Also applies to: 215-215, 267-267


279-315: LGTM!


341-373: LGTM!


375-454: LGTM!

internal/adapter/provider/codex/refresh_lock.go (3)

8-19: LGTM!


35-59: LGTM!


68-76: LGTM!

internal/service/codex_task.go (1)

150-185: LGTM!

Comment thread internal/router/router.go
@Bowl42

Bowl42 commented May 31, 2026

Copy link
Copy Markdown
Collaborator Author

多方多轮 Review 总结

对本 PR 进行了多方、多轮 review(本地 Claude + 本地 Codex + CodeRabbit),逐轮修复。最终 go build ./internal/... / go vet / go test / go test -race(codex 包)全部通过。

各方结论

Reviewer 状态
CodeRabbit 首轮(commit a2087138)"No actionable comments 🎉"。其后额度耗尽(usage credits run out),本会话无法再触发复审。
本地 Claude (Opus) 最终轮 NO ACTIONABLE FINDINGS(独立验证:无死锁、无锁泄漏、无指针竞争、wiring 正确)。
本地 Codex (gpt-5.5) 多轮中发现并修复多个真实 bug;最终仅余 1 项 pre-existing 竞争(见下「已知后续项」)。

修复要点(在原始 401 刷新修复之上,经 review 迭代)

  1. per-account 刷新锁refresh_lock.go):进程级、按账户(ChatGPT account ID,回退 refresh_token)加锁,带 refcount 空闲驱逐、sync.Once 幂等 unlock。OpenAI 每次刷新轮换 refresh_token 并拒绝复用,故所有可能并发刷新同一账户的路径必须串行。
  2. 覆盖全部 5 条 refresh_token 轮换路径:请求适配器 401 重试、后台配额任务、批量配额 handler、单 provider 用量、provider-info 刷新——全部「加锁 → 锁内重读最新 provider → 复用或刷新一次 → 持久化」。
  3. 修复死的 token 持久化 wiringProviderUpdateFunc/ProviderReloadFunc 改为 type alias,使 router 的 duck-typed 断言真正匹配(此前 defined type 静默失配,刷新后的 token 从未落库——main 上即存在的潜在 bug)。新增 reload_wiring_test.go 回归覆盖。
  4. 适配器锁内重读 provider:等锁期间别的路径可能已轮换并持久化新 token,重读后采用,避免用过期 refresh_token 再次刷新。
  5. 消除 a.provider 指针竞争:getAccessToken 改用局部快照,不再重写共享字段。
  6. isNonRetryableRefreshErr 收窄到 refresh_token_reused/invalid_grant;TTL 钳制;移除重复 60s buffer。

已验证安全(曾被怀疑、经核实不成立)

  • 无死锁:锁内调用的 repo.GetByID/ValidateRefreshToken/RefreshAccessTokenWithRetry 均不重入该锁;refreshLocksMu/entry.mutokenMu/刷新锁均无持有顺序倒置。
  • 无锁泄漏:每处加锁点每条分支恰好 unlock 一次(sync.Once 兜底)。

已知后续项(建议单独 PR)

本地 Codex 指出 adapter.goconfig := ensureCodexConfig(prov) 仍指向共享provider.Config.Codex:锁外对 token 字段的读取与另一 goroutine 锁内的写入存在 field 级 data race。这是先于本 PR 即存在的竞争类型(Execute/applyCodexHeaders 等多处都无锁读取该共享结构),彻底修复需「锁内 deep-copy provider + 同步 Execute 的字段读取」的更大重构,超出本 PR「修复 codex refresh token」范围,记为后续项。本 PR 已将指针竞争与刷新串行化处理到位,相对 main 是明确改进。

🤖 Generated with Claude Code

The Codex adapter never truly refreshed on a 401: it cleared the in-memory
cache then re-read the persisted access_token — the same credential the upstream
had just rejected — so the retry hit the same 401, and refresh failures fell back
to that dead token. There was no retry, no handling of permanent errors, and
concurrent refreshes could reuse a rotated refresh_token (refresh_token_reused).

Aligns with CLIProxyAPI and the sibling Claude adapter:

- oauth.go: add RefreshAccessTokenWithRetry (linear backoff, ctx-aware) and
  isNonRetryableRefreshErr (fast-fail on refresh_token_reused / invalid_grant).
- adapter.go: getAccessToken(ctx, forceRefresh, rejectedToken) forces a real
  refresh on 401, never falling back to the rejected token; works against a local
  provider snapshot (no a.provider mutation race); re-reads the freshest provider
  under the lock to adopt a token another path just rotated.
- refresh_lock.go: process-wide per-account refresh lock (keyed by ChatGPT
  account ID, falling back to refresh_token) with refcount eviction and an
  idempotent unlock. OpenAI rotates the refresh_token on every refresh and
  rejects reuse, so all paths that may refresh the same account serialize.
- All five refresh_token rotation paths now take the lock and re-read under it:
  request adapter, quota task, batch-quota handler, per-provider usage, and
  provider-info refresh.
- router.go: inject a provider-reload callback (repo.GetByID) alongside the
  existing update hook. ProviderUpdateFunc/ProviderReloadFunc are type aliases so
  the router's duck-typed assertions match — previously the defined ProviderUpdateFunc
  silently failed the assertion, leaving refreshed tokens unpersisted.
- handler/codex.go, service/codex_task.go: quota paths use the retry helper and
  persist rotated refresh tokens.
- tests: refresh_lock_test.go (key derivation, mutual exclusion, eviction,
  idempotent unlock) and reload_wiring_test.go (router wiring type assertions).

Reviewed across multiple rounds (local Claude + local Codex + CodeRabbit).
go build ./internal/... / vet / test / -race all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Bowl42 Bowl42 force-pushed the fix/codex-refresh-token branch from 3061bf3 to 210e2be Compare May 31, 2026 16:36
The cached ProviderRepository hands the same *domain.Provider pointer to every
caller, and the request hot path reads token fields off provider.Config.Codex
without a lock (Execute reads AccountID; getAccessToken's fast path reads
AccessToken/ExpiresAt/RefreshToken). The refresh paths mutated that shared
struct in place under the refresh lock — which serializes writers but not the
lock-free readers — a data race flagged in review.

Switch every token-persisting path to copy-on-write: clone the provider
(CloneForTokenPersist), mutate the clone, and persist it via repo.Update, which
atomically swaps the cache pointer. Readers holding the old pointer see a
consistent (if briefly stale) struct; no reader and writer ever touch the same
struct. Applied in the adapter (refresh + fallback), the quota task, and all
three handler paths (RefreshProviderInfo, GetProviderUsage, GetBatchQuotas);
each post-persist read is repointed at the clone.

Tests: persist_test.go covers clone isolation, nil-safety, and a concurrent
persist-vs-read scenario that is clean under -race. go build / vet / test /
-race all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@awsl233777

Copy link
Copy Markdown
Collaborator

Review note from checking the current head (164c71f):

The refresh-token serialization/copy-on-write direction looks good, and the current CI is green. I also verified locally:

go test ./internal/adapter/provider/codex ./internal/handler ./internal/service ./internal/router
go test -race ./internal/adapter/provider/codex

One blocker remains before I would merge this, though: several paths still treat persistence of a newly rotated refresh token as best-effort.

Current examples:

// internal/adapter/provider/codex/adapter.go
if err := a.providerUpdate(cp); err != nil {
    log.Printf("[Codex] failed to persist refreshed token: %v", err)
}

// internal/handler/codex.go
_ = h.svc.UpdateProvider(tenantID, cp) // Best effort update
_ = h.svc.UpdateProvider(tenantID, cp)

Because Codex/OpenAI refresh tokens rotate, once RefreshAccessTokenWithRetry / ValidateRefreshToken returns a new refresh_token, failing to persist it is not just a harmless cache miss: the in-memory access token may work briefly, but the provider record still contains the old refresh token, so the next refresh path can hit invalid_grant / token reuse after restart or after cache expiry.

Suggested fix: if the refresh response includes a new refresh token (or generally after a successful refresh that updates credentials), treat UpdateProvider failure as a refresh failure for that path, release the lock, and return/skip without pretending the provider state is healthy. The background task already does this; the request adapter and handler quota paths should match that stricter behavior.

@awsl233777

Copy link
Copy Markdown
Collaborator

Update from taking over the persistence blocker:

Fixed the refresh-token persistence failure path so a successful Codex token refresh is no longer treated as success when the rotated token cannot be persisted.

Changes in dc58113:

  • Adapter 401/refresh path now persists the refreshed token before updating the in-memory token cache; persistence failure returns an error instead of continuing with an in-memory-only rotated refresh token.
  • Single-provider usage refresh now returns failed to persist refreshed token if UpdateProvider fails.
  • Batch quota refresh now releases the refresh lock, logs the persistence failure, and skips that provider instead of using an unpersisted token.
  • Added regression tests for swallowed provider update failures and missing update wiring when a refresh token rotates.

Validated locally:

go test ./internal/adapter/provider/codex ./internal/handler ./internal/service ./internal/router
go test -race ./internal/adapter/provider/codex
go test ./internal/...
go vet ./internal/...

@coderabbitai re-review current head for regression risk around Codex refresh token persistence, lock release paths, and token cache ordering.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read the current state of the key files at HEAD
cat internal/adapter/provider/codex/adapter.go
echo "=== END adapter.go ==="
cat internal/adapter/provider/codex/oauth.go
echo "=== END oauth.go ==="
cat internal/adapter/provider/codex/refresh_lock.go
echo "=== END refresh_lock.go ==="
cat internal/handler/codex.go
echo "=== END handler/codex.go ==="
cat internal/service/codex_task.go
echo "=== END codex_task.go ==="

Length of output: 108188



@awsl233777 — reviewed dc58113 against all three axes you called out. Here's the assessment:


✅ Persistence-before-cache ordering (adapter.go)

persistRefreshedToken is called and must return nil before tokenCache is written. On any persistence error the function returns the wrapped error, the cache is never touched, and getAccessToken propagates the error to the caller. The order is correct.

persistRefreshedToken itself correctly distinguishes the two cases:

  • providerUpdate is nil and the refresh token rotated → hard error (a rotated token that can't be saved is a trap)
  • providerUpdate is nil and no rotation → returns nil (the same secret, losing only the new access token TTL — acceptable)

✅ Lock release paths — all branches covered

Path Release mechanism All branches?
adapter.go getAccessToken defer unlock() ✅ automatic
handler GetProviderUsage manual unlock() ✅ reuse / ValidateRefreshToken error / invalid / UpdateProvider error / success
handler GetBatchQuotas manual unlock() ✅ reuse / RefreshAccessTokenWithRetry error / UpdateProvider error / success
service refreshAllQuotas manual unlock() ✅ reuse / RefreshAccessTokenWithRetry error / Update error / success
handler RefreshProviderInfo defer unlock() ✅ automatic

No leak paths found.


✅ Hard error on persistence failure — all 5 paths

Every refresh path now treats UpdateProvider/providerRepo.Update failure as non-recoverable for that refresh attempt: adapter returns error, usage handler returns error, batch-quota and task service release the lock and skip the provider. The background task was already the reference implementation; the other paths now match it.


Minor items to track

1. ValidateRefreshToken vs RefreshAccessTokenWithRetry inconsistency

GetProviderUsage and RefreshProviderInfo call codex.ValidateRefreshToken directly, which has no retry loop. The other three paths call RefreshAccessTokenWithRetry(ctx, ..., 3). If ValidateRefreshToken doesn't retry internally, these two user-triggered paths remain vulnerable to transient network failures that the background paths would survive. Probably low-impact (user can retry manually), but worth making consistent.

2. No ExpiresIn == 0 guard

In getAccessToken (and identically in the handler/task paths):

expiresAt := time.Now().Add(time.Duration(tokenResp.ExpiresIn) * time.Second)

If the OAuth server returns ExpiresIn: 0 (malformed response), expiresAt is set to time.Now(), which immediately fails the time.Now().Add(60*time.Second).Before(expiresAt) cache check on the next call. Every subsequent request would attempt a refresh — a hot loop against the token endpoint. A one-line clamp (if tokenResp.ExpiresIn < 60 { tokenResp.ExpiresIn = 60 }) or an early error on zero would prevent this.

3. Non-deferred manual unlocks in handler paths

The manual unlock() calls are correct today, and AcquireRefreshLock's sync.Once provides idempotency as a safety net. However, a future code change inside the lock body (e.g., adding an early return) could silently leak the lock. Since the unlock function is already idempotent, adding a defer unlock() immediately after acquisition (and keeping the early unlock() calls where needed for ordering) costs nothing and makes the pattern panic-safe:

unlock := codex.AcquireRefreshLock(...)
defer unlock() // safety net; explicit unlock() calls below still release early

if config.AccessToken != "" && !h.isTokenExpired(...) {
    accessToken = config.AccessToken
    unlock() // release early so quota fetch is outside the lock
    // ...

Items 1 and 2 are the only actionable ones for this PR; item 3 is a style suggestion for robustness. Overall, the persistence blocker is properly fixed.

@awsl233777

Copy link
Copy Markdown
Collaborator

Update after CodeRabbit's actionable follow-up:

Fixed both items in 65b021d:

  • ValidateRefreshToken now uses RefreshAccessTokenWithRetry(ctx, refreshToken, 3), so RefreshProviderInfo and GetProviderUsage get the same transient-refresh retry behavior as the adapter, batch quota handler, and background quota task.
  • Added codex.TokenExpiresAt(expiresIn) and switched Codex token expiry persistence paths to it. Malformed zero/negative expires_in values are clamped to 60 seconds instead of persisting an already-expired token and risking a refresh hot loop.
  • Added regression tests for malformed/valid expires_in handling.

Validated locally:

go test ./internal/adapter/provider/codex ./internal/handler ./internal/service ./internal/router
go test -race ./internal/adapter/provider/codex
go test ./internal/...
go vet ./internal/...

@coderabbitai re-review current head 65b021d, especially the retry consistency and expires_in clamp paths.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cat internal/adapter/provider/codex/oauth.go
echo "=== END oauth.go ==="
cat internal/adapter/provider/codex/adapter.go
echo "=== END adapter.go ==="
cat internal/handler/codex.go
echo "=== END handler/codex.go ==="
cat internal/service/codex_task.go
echo "=== END codex_task.go ==="
# Look for TokenExpiresAt and test files
rg -n "TokenExpiresAt" --type=go
echo "=== TokenExpiresAt usages ==="
fd "expires_in" --type=f
echo "=== fd result ==="
fd "_test.go" internal/adapter/provider/codex/ --type=f
echo "=== test files ==="

Length of output: 109546


🏁 Script executed:

#!/bin/bash
# Check ValidateRefreshToken implementation
rg -n "ValidateRefreshToken" --type=go -A 30 | head -120
echo "=== END ValidateRefreshToken ==="
# Read test files
cat internal/adapter/provider/codex/oauth_test.go
echo "=== END oauth_test.go ==="
cat internal/adapter/provider/codex/persist_test.go
echo "=== END persist_test.go ==="

Length of output: 12181


🏁 Script executed:

#!/bin/bash
cat internal/adapter/provider/codex/service.go
echo "=== END service.go ==="
# Also check if there's a specific test for UpdateProvider failure regression
rg -n "UpdateProvider\|persist.*fail\|swallow" --type=go internal/adapter/provider/codex/
rg -n "persist_test\|provider.*update.*fail\|TestPersist" --type=go internal/adapter/provider/codex/
echo "=== persist regression tests ==="
# Check adapter_test.go for the persistence failure regression test
cat internal/adapter/provider/codex/adapter_test.go
echo "=== END adapter_test.go ==="

Length of output: 30158



@awsl233777 — reviewed 65b021d against both axes. Everything checks out.


✅ Retry consistency — fully aligned across all 5 paths

ValidateRefreshToken (internal/adapter/provider/codex/service.go) now opens with:

tokenResp, err := RefreshAccessTokenWithRetry(ctx, refreshToken, 3)

This means GetProviderUsage and RefreshProviderInfo — which call ValidateRefreshToken — inherit the same 3-attempt linear-backoff retry and non-retryable fast-fail behaviour as the adapter's 401 path, the batch quota handler, and the background task. All five paths are now consistent.

ValidateRefreshToken also calls TokenExpiresAt(tokenResp.ExpiresIn) internally before returning result.ExpiresAt, so callers that store result.ExpiresAt directly (like GetProviderUsage) automatically get the clamped value without a separate guard.


expires_in clamp — all call sites covered

Call site How
adapter.go getAccessToken TokenExpiresAt(tokenResp.ExpiresIn) direct
service.go ValidateRefreshToken TokenExpiresAt(tokenResp.ExpiresIn) direct (used by GetProviderUsage + RefreshProviderInfo)
handler/codex.go OAuth callback + exchange codex.TokenExpiresAt(tokenResp.ExpiresIn)
handler/codex.go GetBatchQuotas codex.TokenExpiresAt(tokenResp.ExpiresIn)
service/codex_task.go refreshAllQuotas codex.TokenExpiresAt(tokenResp.ExpiresIn)

No call site computes time.Now().Add(time.Duration(expiresIn) * time.Second) directly anymore.


✅ Regression tests

  • TestTokenExpiresAtClampsMalformedExpiresInexpiresIn=0 → result ≥ now + 60s. Tight upper bound (now+61s) prevents the clamp from silently over-extending.
  • TestTokenExpiresAtPreservesValidExpiresInexpiresIn=3600 passes through correctly.
  • TestPersistRefreshedTokenFailsOnUpdateErrorproviderUpdate returning an error causes persistRefreshedToken to return a wrapped error, and the original provider struct is not mutated.
  • TestPersistRefreshedTokenRequiresUpdaterForRotatedRefreshToken — missing providerUpdate callback + rotated refresh token → hard error (not a silent no-op).

No further actionable items. The PR is clean on the two axes requested. 🎉

🐇 ૮ ˶ᵔ ᵕ ᵔ˶ ა

@awsl233777 awsl233777 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved after all required checks passed and CodeRabbit reported no further actionable items on head 65b021d.

@awsl233777 awsl233777 merged commit b002f57 into main Jun 18, 2026
5 checks passed
@awsl233777 awsl233777 deleted the fix/codex-refresh-token branch June 18, 2026 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants