Recipe file: Google/Gemma4.md
Relevant section: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#multi-gpu-deployment
Hardware: 1x NVIDIA A100 80 GB, AMD EPYC 7643 48-Core Processor
When running the following command for 26B MoE on 1× A100/H100 (BF16):
vllm serve google/gemma-4-26B-A4B-it \ --max-model-len 32768 \ --gpu-memory-utilization 0.90
The following error appears:
File "/home/ubuntu/.venv/lib/python3.12/site-packages/vllm/v1/core/encoder_cache_manager.py", line 302, in compute_mm_encoder_budget
raise ValueError(
ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.
Setting --max_num_batched_tokens to any value higher than 2496 solves the issue.
Recipe file: Google/Gemma4.md
Relevant section: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#multi-gpu-deployment
Hardware: 1x NVIDIA A100 80 GB, AMD EPYC 7643 48-Core Processor
When running the following command for 26B MoE on 1× A100/H100 (BF16):
vllm serve google/gemma-4-26B-A4B-it \ --max-model-len 32768 \ --gpu-memory-utilization 0.90The following error appears:
File "/home/ubuntu/.venv/lib/python3.12/site-packages/vllm/v1/core/encoder_cache_manager.py", line 302, in compute_mm_encoder_budget
raise ValueError(
ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.
Setting
--max_num_batched_tokensto any value higher than 2496 solves the issue.