bench_serving: fallback when vLLM get_tokenizer hits removed API#1573
Conversation
vLLM's tokenizer wrapper can raise AttributeError on all_special_tokens_extended with newer transformers (e.g. Qwen3.5). Use backend_request_func.get_tokenizer on fallback so client tokenization keeps the sglang v5 fix (#1381); bare AutoTokenizer only if backend is unavailable. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e74f133. Configure here.
| tokenizer_mode=tokenizer_mode, | ||
| trust_remote_code=trust_remote_code, | ||
| ) | ||
| except ImportError: |
There was a problem hiding this comment.
Fallback catches only ImportError, missing re-raised AttributeError
Low Severity
The inner except ImportError at line 122 only catches the import failure, not an AttributeError from _backend_get_tokenizer(...). The module-level get_tokenizer (lines 48–51) preferentially resolves to backend_request_func.get_tokenizer. If that function ever raises AttributeError with all_special_tokens_extended, the fallback re-imports and calls the same function, producing the same uncaught AttributeError. The AutoTokenizer.from_pretrained final fallback becomes unreachable. Catching (ImportError, AttributeError) would keep the full fallback chain intact.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit e74f133. Configure here.
There was a problem hiding this comment.
LGTM — small, defensive client-side tokenizer fallback in a benchmark utility.
Extended reasoning...
Overview
This PR adds a _load_tokenizer() helper in utils/bench_serving/benchmark_serving.py that wraps the existing get_tokenizer call with a narrow fallback path. When the call raises AttributeError whose message contains all_special_tokens_extended (a known incompatibility between newer transformers and vLLM's tokenizer wrapper for Qwen3.5), it retries via backend_request_func.get_tokenizer (which uses plain AutoTokenizer + sglang fixups), with a final fallback to AutoTokenizer.from_pretrained. The helper is used in both main() and the multiprocessing worker initializer.
Security risks
None. This is a client-side benchmark utility for generating random prompts; no auth, crypto, permissions, networking, or inference path changes. trust_remote_code is plumbed through unchanged from the existing CLI argument.
Level of scrutiny
Low. This is benchmark tooling under utils/bench_serving/, not production serving code. The change is mechanical and self-contained (~35 lines, one file), the fallback is gated on a specific error-message substring so it won't silently swallow unrelated AttributeErrors, and the cascading fallback preserves the client/server tokenizer alignment from #1381.
Other factors
The bug hunting agent found no issues. The new helper follows the existing import-order convention from #1428. The only minor stylistic note is that the top-level import already prefers backend_request_func.get_tokenizer, so the nested re-import in the fallback is a belt-and-suspenders measure for environments where the vLLM wrapper still gets resolved — the PR description acknowledges this explicitly. Not worth blocking on.


Summary
_load_tokenizer()inutils/bench_serving/benchmark_serving.pyfor random-prompt generation (main()and multiprocessing workers).get_tokenizerraisesAttributeErroronall_special_tokens_extended(removed in newertransformers, seen with Qwen3.5), fall back tobackend_request_func.get_tokenizerso the client keeps the sglang v5 tokenizer alignment from bench: fix v5 tokenizer fix when --model is an HF Hub repo id #1381.backend_request_funcfirst, vLLM only onImportError); this handles environments that still resolve vLLM's wrapper or hit the error on the vLLM fallback path.Unblocks Qwen3.5 FP8 MI355X SGLang disagg smoke / bench runs that use
--use-chat-templatewithout changing inference server code.Test plan
python utils/bench_serving/benchmark_serving.py --help(import sanity)randomdataset +--use-chat-template(no server required for prompt generation path)Related
backend_request_func.get_tokenizerover vLLMbackend_request_funcOut of scope
Made with Cursor
Note
Low Risk
Client-only benchmark prompt-generation path; no auth, serving, or inference changes.
Overview
Adds
_load_tokenizer()inbenchmark_serving.pyand routes tokenizer loading through it inmain()and multiprocessing_init_tokenizer_worker, instead of calling vLLM’sget_tokenizerdirectly.When vLLM’s wrapper fails with
AttributeErroron removedall_special_tokens_extended(e.g. Qwen3.5 + newertransformers), the helper retriesbackend_request_func.get_tokenizerso bench client tokenization stays aligned with the SGLang server; if that import fails, it falls back toAutoTokenizer.from_pretrained.This unblocks random-dataset /
--use-chat-templatebenchmark runs that previously crashed during prompt generation, without changing inference server code.Reviewed by Cursor Bugbot for commit e74f133. Bugbot is set up for automated code reviews on this repo. Configure here.