Gemma4: max_num_batched_tokens error for 26B MoE on 1× A100/H100 (BF16) command

Recipe file: [Google/Gemma4.md](https://github.com/vllm-project/recipes/blob/main/Google/Gemma4.md)
Relevant section: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#multi-gpu-deployment

Hardware: 1x NVIDIA A100 80 GB, AMD EPYC 7643 48-Core Processor

When running the following command for 26B MoE on 1× A100/H100 (BF16):
`vllm serve google/gemma-4-26B-A4B-it \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90`

The following error appears:
File "/home/ubuntu/.venv/lib/python3.12/site-packages/vllm/v1/core/encoder_cache_manager.py", line 302, in compute_mm_encoder_budget
  raise ValueError(
ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.

Setting `--max_num_batched_tokens`  to any value higher than 2496 solves the issue. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma4: max_num_batched_tokens error for 26B MoE on 1× A100/H100 (BF16) command #441

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Gemma4: max_num_batched_tokens error for 26B MoE on 1× A100/H100 (BF16) command #441

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions