Skip to content

Gemma4: max_num_batched_tokens error for 26B MoE on 1× A100/H100 (BF16) command #441

@suleymanzahi

Description

@suleymanzahi

Recipe file: Google/Gemma4.md
Relevant section: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#multi-gpu-deployment

Hardware: 1x NVIDIA A100 80 GB, AMD EPYC 7643 48-Core Processor

When running the following command for 26B MoE on 1× A100/H100 (BF16):
vllm serve google/gemma-4-26B-A4B-it \ --max-model-len 32768 \ --gpu-memory-utilization 0.90

The following error appears:
File "/home/ubuntu/.venv/lib/python3.12/site-packages/vllm/v1/core/encoder_cache_manager.py", line 302, in compute_mm_encoder_budget
raise ValueError(
ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.

Setting --max_num_batched_tokens to any value higher than 2496 solves the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions