Skip to content

bench_serving: fallback when vLLM get_tokenizer hits removed API#1573

Merged
functionstackx merged 1 commit into
mainfrom
chang/bench-serving-tokenizer-fallback
May 27, 2026
Merged

bench_serving: fallback when vLLM get_tokenizer hits removed API#1573
functionstackx merged 1 commit into
mainfrom
chang/bench-serving-tokenizer-fallback

Conversation

@ChangLiu0709
Copy link
Copy Markdown
Collaborator

@ChangLiu0709 ChangLiu0709 commented May 27, 2026

Summary

Unblocks Qwen3.5 FP8 MI355X SGLang disagg smoke / bench runs that use --use-chat-template without changing inference server code.

Test plan

  • python utils/bench_serving/benchmark_serving.py --help (import sanity)
  • Load tokenizer for a Qwen3.5 model id with random dataset + --use-chat-template (no server required for prompt generation path)
  • Qwen3.5 disagg CI smoke after [AMD] Add Qwen3.5 FP8 MI355X SGLang disaggregated benchmark #1570 lands (or locally with matching container)

Related

Out of scope

  • SGLang runtime / server tokenizer changes
  • Benchmark config or image bumps

Made with Cursor


Note

Low Risk
Client-only benchmark prompt-generation path; no auth, serving, or inference changes.

Overview
Adds _load_tokenizer() in benchmark_serving.py and routes tokenizer loading through it in main() and multiprocessing _init_tokenizer_worker, instead of calling vLLM’s get_tokenizer directly.

When vLLM’s wrapper fails with AttributeError on removed all_special_tokens_extended (e.g. Qwen3.5 + newer transformers), the helper retries backend_request_func.get_tokenizer so bench client tokenization stays aligned with the SGLang server; if that import fails, it falls back to AutoTokenizer.from_pretrained.

This unblocks random-dataset / --use-chat-template benchmark runs that previously crashed during prompt generation, without changing inference server code.

Reviewed by Cursor Bugbot for commit e74f133. Bugbot is set up for automated code reviews on this repo. Configure here.

vLLM's tokenizer wrapper can raise AttributeError on
all_special_tokens_extended with newer transformers (e.g. Qwen3.5).
Use backend_request_func.get_tokenizer on fallback so client
tokenization keeps the sglang v5 fix (#1381); bare AutoTokenizer only
if backend is unavailable.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e74f133. Configure here.

tokenizer_mode=tokenizer_mode,
trust_remote_code=trust_remote_code,
)
except ImportError:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback catches only ImportError, missing re-raised AttributeError

Low Severity

The inner except ImportError at line 122 only catches the import failure, not an AttributeError from _backend_get_tokenizer(...). The module-level get_tokenizer (lines 48–51) preferentially resolves to backend_request_func.get_tokenizer. If that function ever raises AttributeError with all_special_tokens_extended, the fallback re-imports and calls the same function, producing the same uncaught AttributeError. The AutoTokenizer.from_pretrained final fallback becomes unreachable. Catching (ImportError, AttributeError) would keep the full fallback chain intact.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e74f133. Configure here.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — small, defensive client-side tokenizer fallback in a benchmark utility.

Extended reasoning...

Overview

This PR adds a _load_tokenizer() helper in utils/bench_serving/benchmark_serving.py that wraps the existing get_tokenizer call with a narrow fallback path. When the call raises AttributeError whose message contains all_special_tokens_extended (a known incompatibility between newer transformers and vLLM's tokenizer wrapper for Qwen3.5), it retries via backend_request_func.get_tokenizer (which uses plain AutoTokenizer + sglang fixups), with a final fallback to AutoTokenizer.from_pretrained. The helper is used in both main() and the multiprocessing worker initializer.

Security risks

None. This is a client-side benchmark utility for generating random prompts; no auth, crypto, permissions, networking, or inference path changes. trust_remote_code is plumbed through unchanged from the existing CLI argument.

Level of scrutiny

Low. This is benchmark tooling under utils/bench_serving/, not production serving code. The change is mechanical and self-contained (~35 lines, one file), the fallback is gated on a specific error-message substring so it won't silently swallow unrelated AttributeErrors, and the cascading fallback preserves the client/server tokenizer alignment from #1381.

Other factors

The bug hunting agent found no issues. The new helper follows the existing import-order convention from #1428. The only minor stylistic note is that the top-level import already prefers backend_request_func.get_tokenizer, so the nested re-import in the fallback is a belt-and-suspenders measure for environments where the vLLM wrapper still gets resolved — the PR description acknowledges this explicitly. Not worth blocking on.

@functionstackx functionstackx merged commit ec05272 into main May 27, 2026
8 checks passed
@functionstackx functionstackx deleted the chang/bench-serving-tokenizer-fallback branch May 27, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants