Skip to content

Bug: beam search crashes with HybridMambaAttentionDynamicCache (4 bugs in modeling_nemotron_h.py) (NemotronH 30B A3B) #142

@FahdSeddik

Description

@FahdSeddik

Bug: beam search crashes with HybridMambaAttentionDynamicCache (4 bugs in modeling_nemotron_h.py)

Beam search (num_beams > 1, use_cache=True) is broken due to 4 bugs in modeling_nemotron_h.py. Tested with transformers 5.5.0, causal-conv1d 1.6.1, mamba-ssm 2.3.1, PyTorch 2.6, CUDA 12.6, trust_remote_code=True (commit cbd3fa9f).

Bug 1 -- HybridMambaAttentionDynamicCache.__init__ (line 177) computes conv_kernel_size = config.conv_kernel as a local variable but never stores it on self. cuda_kernels_forward (line 461) accesses cache_params.conv_kernel_size and crashes:

AttributeError: 'HybridMambaAttentionDynamicCache' object has no attribute 'conv_kernel_size'

Fix: add self.conv_kernel_size = config.conv_kernel in __init__.

Bug 2 -- update_conv_state (lines 249, 252) and update_ssm_state (line 256) call self.conv_states.device / self.ssm_states.device but both are Python lists:

AttributeError: 'list' object has no attribute 'device'

Fix: use self.conv_states[layer_idx].device / self.ssm_states[layer_idx].device.

Bug 3 -- __init__ allocates conv_states with intermediate_size = mamba_num_heads * mamba_head_dim (4096 for this model) but the mixer stores hidden_states_B_C of size conv_dim = intermediate_size + 2 * n_groups * ssm_state_size (6144). The CUDA kernel detects the mismatch:

RuntimeError: weight must have shape (dim, width)

Fix: allocate conv_states with conv_dim instead of intermediate_size.

Bug 4 -- With device_map="auto" across multiple GPUs, update_conv_state(cache_init=True) calls .to(self.conv_states[layer_idx].device) which moves the tensor back to the initialisation device (cuda:0), even for layers running on cuda:1. The next decode step then runs causal_conv1d_update with x/weight on cuda:1 and conv_state on cuda:0. The CUDA kernel reports this as the same shape error from Bug 3, which made it hard to diagnose.
Fix: for cache_init=True, assign directly without .to() so the tensor stays on the device where the mixer ran. Same applies to update_ssm_state.

All four bugs are in the incremental decode path (cache_position[0] > 0), which is only hit during cached multi-step generation. Greedy and sampling with a fresh cache never reach it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions