Skip to content

fix(server): clamp CoreRequest.MaxTokens to per-route max_output_tokens#57

Open
thezzisu wants to merge 1 commit into
ZhiYi-R:mainfrom
thezzisu:fix/clamp-coreReq-max-tokens-to-route
Open

fix(server): clamp CoreRequest.MaxTokens to per-route max_output_tokens#57
thezzisu wants to merge 1 commit into
ZhiYi-R:mainfrom
thezzisu:fix/clamp-coreReq-max-tokens-to-route

Conversation

@thezzisu

Copy link
Copy Markdown

Problem

When an inbound OpenAI Responses request omits max_output_tokens (or sets a value above the upstream model's actual ceiling), moonbridge falls back to defaults.max_tokens from the global config and forwards that value verbatim to the upstream protocol adapters. Each protocol adapter (anthropic, chat, openai, google) currently has its own defaultMaxTokens helper that only consults the inbound request and the global default — none of them clamp to the per-route models.<slug>.max_output_tokens declared in config.

In a deployment that routes Claude (max 64K out), Qwen DashScope (32K / 65K), Gemini (65K) and DeepSeek V4 Pro (320K) through a single moonbridge instance, the global default has to be sized for the largest cap (DeepSeek), which then makes every smaller-cap upstream return 400 Invalid request / 400 Range of max_tokens should be [1, 65536] whenever the client doesn't supply an explicit cap.

Real production trace excerpt (anonymised):

level=ERROR msg=提供商错误 model=claude-sonnet-4-6 status=400 ... req_max_tokens=320000
level=ERROR msg=提供商错误 model=qwen3.6-plus    status=400 ... req_max_tokens=320000
    error="<400> Range of max_tokens should be [1, 65536]"

Fix

Single-point clamp in internal/service/server/adapter_dispatch.go::handleWithAdapters, applied right after client.ToCoreRequest and the upstream model alias resolution and before providerAdapter.FromCoreRequest. The clamp uses a new helper (*Server).routeMaxOutputTokens(modelAlias, preferred) that prefers config.Routes[alias].MaxOutputTokens and falls back to config.ProviderDefs[providerKey].Models[upstreamModel].MaxOutputTokens. Returns 0 (no clamp) when neither is set, preserving prior behavior for configs that don't declare per-model caps.

Because this lives at the protocol-agnostic format.CoreRequest boundary, all four protocol adapters (anthropic, chat, openai, google) inherit the clamp without changes.

Tests

Adds three tests in internal/service/server/adapter_dispatch_test.go:

  • TestRouteMaxOutputTokensPrefersRouteEntry — route-level cap wins over provider-meta.
  • TestRouteMaxOutputTokensFallsBackToProviderModelMeta — provider catalog metadata is consulted when the route doesn't declare its own.
  • TestRouteMaxOutputTokensReturnsZeroWhenUnset — both unset → 0 (no clamp).

Full go test ./... passes locally on golang:1.26-bookworm.

Backwards compatibility

Pure additive behavior — installations that don't declare max_output_tokens per model see no change. Installations that do declare it now get an automatic clamp instead of relying on the upstream HTTP response.

Operational note

This patch was deployed in production at the PKU CCLab moonbridge instance for ~10 minutes before this PR was opened, where it eliminated the 400 cycle and verified normal operation across Claude / Qwen / DeepSeek / Gemini routes.

When the inbound OpenAI Responses request omits max_output_tokens (or sets
it above the upstream limit), moonbridge previously injected the global
defaults.max_tokens unchanged. Anthropic / Qwen / Gemini upstreams reject
oversized values with 400.

Resolve a per-alias cap from config.Routes[<alias>].MaxOutputTokens, with
fallback to provider catalog ModelMeta.MaxOutputTokens, and clamp
coreReq.MaxTokens before the protocol adapter serializes the upstream
request. Adds three unit tests covering route, fallback, and unset paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant