fix(server): clamp CoreRequest.MaxTokens to per-route max_output_tokens#57
Open
thezzisu wants to merge 1 commit into
Open
fix(server): clamp CoreRequest.MaxTokens to per-route max_output_tokens#57thezzisu wants to merge 1 commit into
thezzisu wants to merge 1 commit into
Conversation
When the inbound OpenAI Responses request omits max_output_tokens (or sets it above the upstream limit), moonbridge previously injected the global defaults.max_tokens unchanged. Anthropic / Qwen / Gemini upstreams reject oversized values with 400. Resolve a per-alias cap from config.Routes[<alias>].MaxOutputTokens, with fallback to provider catalog ModelMeta.MaxOutputTokens, and clamp coreReq.MaxTokens before the protocol adapter serializes the upstream request. Adds three unit tests covering route, fallback, and unset paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When an inbound OpenAI Responses request omits
max_output_tokens(or sets a value above the upstream model's actual ceiling), moonbridge falls back todefaults.max_tokensfrom the global config and forwards that value verbatim to the upstream protocol adapters. Each protocol adapter (anthropic, chat, openai, google) currently has its owndefaultMaxTokenshelper that only consults the inbound request and the global default — none of them clamp to the per-routemodels.<slug>.max_output_tokensdeclared in config.In a deployment that routes Claude (max 64K out), Qwen DashScope (32K / 65K), Gemini (65K) and DeepSeek V4 Pro (320K) through a single moonbridge instance, the global default has to be sized for the largest cap (DeepSeek), which then makes every smaller-cap upstream return
400 Invalid request/400 Range of max_tokens should be [1, 65536]whenever the client doesn't supply an explicit cap.Real production trace excerpt (anonymised):
Fix
Single-point clamp in
internal/service/server/adapter_dispatch.go::handleWithAdapters, applied right afterclient.ToCoreRequestand the upstream model alias resolution and beforeproviderAdapter.FromCoreRequest. The clamp uses a new helper(*Server).routeMaxOutputTokens(modelAlias, preferred)that prefersconfig.Routes[alias].MaxOutputTokensand falls back toconfig.ProviderDefs[providerKey].Models[upstreamModel].MaxOutputTokens. Returns 0 (no clamp) when neither is set, preserving prior behavior for configs that don't declare per-model caps.Because this lives at the protocol-agnostic
format.CoreRequestboundary, all four protocol adapters (anthropic, chat, openai, google) inherit the clamp without changes.Tests
Adds three tests in
internal/service/server/adapter_dispatch_test.go:TestRouteMaxOutputTokensPrefersRouteEntry— route-level cap wins over provider-meta.TestRouteMaxOutputTokensFallsBackToProviderModelMeta— provider catalog metadata is consulted when the route doesn't declare its own.TestRouteMaxOutputTokensReturnsZeroWhenUnset— both unset → 0 (no clamp).Full
go test ./...passes locally ongolang:1.26-bookworm.Backwards compatibility
Pure additive behavior — installations that don't declare
max_output_tokensper model see no change. Installations that do declare it now get an automatic clamp instead of relying on the upstream HTTP response.Operational note
This patch was deployed in production at the PKU CCLab moonbridge instance for ~10 minutes before this PR was opened, where it eliminated the 400 cycle and verified normal operation across Claude / Qwen / DeepSeek / Gemini routes.