bench_serving: fallback when vLLM get_tokenizer hits removed API by ChangLiu0709 · Pull Request #1573 · SemiAnalysisAI/InferenceX

ChangLiu0709 · 2026-05-27T15:48:32Z

Summary

Add _load_tokenizer() in utils/bench_serving/benchmark_serving.py for random-prompt generation (main() and multiprocessing workers).
When vLLM's get_tokenizer raises AttributeError on all_special_tokens_extended (removed in newer transformers, seen with Qwen3.5), fall back to backend_request_func.get_tokenizer so the client keeps the sglang v5 tokenizer alignment from bench: fix v5 tokenizer fix when --model is an HF Hub repo id #1381.
Keeps Update dsr1-fp8-mi325x-sglang SGLang image to v0.5.12-rocm700-mi30x #1428 import order (backend_request_func first, vLLM only on ImportError); this handles environments that still resolve vLLM's wrapper or hit the error on the vLLM fallback path.

Unblocks Qwen3.5 FP8 MI355X SGLang disagg smoke / bench runs that use --use-chat-template without changing inference server code.

Test plan

python utils/bench_serving/benchmark_serving.py --help (import sanity)
Load tokenizer for a Qwen3.5 model id with random dataset + --use-chat-template (no server required for prompt generation path)
Qwen3.5 disagg CI smoke after [AMD] Add Qwen3.5 FP8 MI355X SGLang disaggregated benchmark #1570 lands (or locally with matching container)

Update dsr1-fp8-mi325x-sglang SGLang image to v0.5.12-rocm700-mi30x #1428 — prefer backend_request_func.get_tokenizer over vLLM
bench: fix v5 tokenizer fix when --model is an HF Hub repo id #1381 — v5 tokenizer client/server alignment in backend_request_func
[AMD] Add Qwen3.5 FP8 MI355X SGLang disaggregated benchmark #1570 — Qwen3.5 FP8 MI355X SGLang disagg (CI; excludes this fix intentionally)

Out of scope

SGLang runtime / server tokenizer changes
Benchmark config or image bumps

Made with Cursor

Note

Low Risk
Client-only benchmark prompt-generation path; no auth, serving, or inference changes.

Overview
Adds _load_tokenizer() in benchmark_serving.py and routes tokenizer loading through it in main() and multiprocessing _init_tokenizer_worker, instead of calling vLLM’s get_tokenizer directly.

When vLLM’s wrapper fails with AttributeError on removed all_special_tokens_extended (e.g. Qwen3.5 + newer transformers), the helper retries backend_request_func.get_tokenizer so bench client tokenization stays aligned with the SGLang server; if that import fails, it falls back to AutoTokenizer.from_pretrained.

This unblocks random-dataset / --use-chat-template benchmark runs that previously crashed during prompt generation, without changing inference server code.

^{Reviewed by Cursor Bugbot for commit e74f133. Bugbot is set up for automated code reviews on this repo. Configure here.}

vLLM's tokenizer wrapper can raise AttributeError on all_special_tokens_extended with newer transformers (e.g. Qwen3.5). Use backend_request_func.get_tokenizer on fallback so client tokenization keeps the sglang v5 fix (#1381); bare AutoTokenizer only if backend is unavailable. Co-authored-by: Cursor <cursoragent@cursor.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit e74f133. Configure here.}

cursor · 2026-05-27T15:53:45Z

+                tokenizer_mode=tokenizer_mode,
+                trust_remote_code=trust_remote_code,
+            )
+        except ImportError:


Fallback catches only ImportError, missing re-raised AttributeError

Low Severity

The inner except ImportError at line 122 only catches the import failure, not an AttributeError from _backend_get_tokenizer(...). The module-level get_tokenizer (lines 48–51) preferentially resolves to backend_request_func.get_tokenizer. If that function ever raises AttributeError with all_special_tokens_extended, the fallback re-imports and calls the same function, producing the same uncaught AttributeError. The AutoTokenizer.from_pretrained final fallback becomes unreachable. Catching (ImportError, AttributeError) would keep the full fallback chain intact.

Additional Locations (1)

utils/bench_serving/benchmark_serving.py#L114-L121

^{Reviewed by Cursor Bugbot for commit e74f133. Configure here.}

claude

LGTM — small, defensive client-side tokenizer fallback in a benchmark utility.

Extended reasoning...

Overview

This PR adds a _load_tokenizer() helper in utils/bench_serving/benchmark_serving.py that wraps the existing get_tokenizer call with a narrow fallback path. When the call raises AttributeError whose message contains all_special_tokens_extended (a known incompatibility between newer transformers and vLLM's tokenizer wrapper for Qwen3.5), it retries via backend_request_func.get_tokenizer (which uses plain AutoTokenizer + sglang fixups), with a final fallback to AutoTokenizer.from_pretrained. The helper is used in both main() and the multiprocessing worker initializer.

Security risks

None. This is a client-side benchmark utility for generating random prompts; no auth, crypto, permissions, networking, or inference path changes. trust_remote_code is plumbed through unchanged from the existing CLI argument.

Level of scrutiny

Low. This is benchmark tooling under utils/bench_serving/, not production serving code. The change is mechanical and self-contained (~35 lines, one file), the fallback is gated on a specific error-message substring so it won't silently swallow unrelated AttributeErrors, and the cascading fallback preserves the client/server tokenizer alignment from #1381.

Other factors

The bug hunting agent found no issues. The new helper follows the existing import-order convention from #1428. The only minor stylistic note is that the top-level import already prefers backend_request_func.get_tokenizer, so the nested re-import in the fallback is a belt-and-suspenders measure for environments where the vLLM wrapper still gets resolved — the PR description acknowledges this explicitly. Not worth blocking on.

ChangLiu0709 requested a review from a team May 27, 2026 15:48

github-project-automation Bot added this to InferenceMAX Board May 27, 2026

cursor Bot reviewed May 27, 2026

View reviewed changes

claude Bot reviewed May 27, 2026

View reviewed changes

functionstackx merged commit ec05272 into main May 27, 2026
8 checks passed

functionstackx deleted the chang/bench-serving-tokenizer-fallback branch May 27, 2026 16:08

github-project-automation Bot moved this to Done in InferenceMAX Board May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench_serving: fallback when vLLM get_tokenizer hits removed API#1573

bench_serving: fallback when vLLM get_tokenizer hits removed API#1573
functionstackx merged 1 commit into
mainfrom
chang/bench-serving-tokenizer-fallback

ChangLiu0709 commented May 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 27, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ChangLiu0709 commented May 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Related

Out of scope

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 27, 2026

Choose a reason for hiding this comment

Fallback catches only ImportError, missing re-raised AttributeError

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChangLiu0709 commented May 27, 2026 •

edited by cursor Bot

Loading