Skip to content

ImportError: libcudart.so.12 missing in CUDA 13 nightly image when running Nemotron-3-Super-120B-A12B-NVFP4 #150

@jinho2020

Description

@jinho2020

Describe the bug

When attempting to serve the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 model using the vllm/vllm-openai:cu130-nightly Docker image, the server immediately crashes with an ImportError: libcudart.so.12: cannot open shared object file: No such file or directory.

This occurs because the nixl_ep library (used for Expert Parallelism in MoE models) is compiled against CUDA 12 and explicitly looks for the CUDA 12 runtime library. However, the cu130-nightly image only contains the CUDA 13 runtime, causing the import to fail.

Steps/Code to reproduce bug

Run the vLLM Docker container using the cu130-nightly tag.

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/raw/main/super_v3_reasoning_parser.py

docker run --rm -it --gpus all \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e HF_TOKEN=$HF_TOKEN \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/super_v3_reasoning_parser.py:/app/super_v3_reasoning_parser.py \
  -p 8000:8000 \
  vllm/vllm-openai:cu130-nightly \
    --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --served-model-name nemotron-3-super \
    --host 0.0.0.0 \
    --port 8000 \
    --async-scheduling \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 1 \
    --data-parallel-size 1 \
    --trust-remote-code \
    --gpu-memory-utilization 0.90 \
    --enable-chunked-prefill \
    --max-num-seqs 4 \
    --max-model-len 1000000 \
    --moe-backend marlin \
    --mamba_ssm_cache_dtype float32 \
    --quantization fp4 \
    --speculative_config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \
    --reasoning-parser-plugin /app/super_v3_reasoning_parser.py \
    --reasoning-parser super_v3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder

Expected behavior

The vLLM server should start successfully and begin serving the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 model without throwing dependency import errors related to the CUDA runtime environment.

Additional context

I encountered this issue on a DGX Spark while following the deployment instructions in the unreleased Spark Deployment Guide.

Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/__init__.py", line 4, in <module>
    from vllm.entrypoints.cli.benchmark.mm_processor import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/benchmark/mm_processor.py", line 5, in <module>
    from vllm.benchmarks.mm_processor import add_cli_args, main
  File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/mm_processor.py", line 25, in <module>
    from vllm.benchmarks.datasets import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/datasets/__init__.py", line 4, in <module>
    from vllm.benchmarks.datasets.datasets import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/datasets/datasets.py", line 44, in <module>
    from vllm.lora.utils import get_adapter_absolute_path
  File "/usr/local/lib/python3.12/dist-packages/vllm/lora/utils.py", line 18, in <module>
    from vllm.lora.layers import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/lora/layers/__init__.py", line 4, in <module>
    from vllm.lora.layers.column_parallel_linear import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/lora/layers/column_parallel_linear.py", line 20, in <module>
    from .base_linear import BaseLinearLayerWithLoRA
  File "/usr/local/lib/python3.12/dist-packages/vllm/lora/layers/base_linear.py", line 28, in <module>
    from .utils import _get_lora_device
  File "/usr/local/lib/python3.12/dist-packages/vllm/lora/layers/utils.py", line 10, in <module>
    from vllm.model_executor.layers.fused_moe.fused_moe import try_get_optimal_moe_config
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/__init__.py", line 19, in <module>
    from vllm.model_executor.layers.fused_moe.layer import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 47, in <module>
    from vllm.model_executor.layers.fused_moe.unquantized_fused_moe_method import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py", line 26, in <module>
    from vllm.model_executor.layers.fused_moe.oracle.unquantized import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/oracle/unquantized.py", line 14, in <module>
    from vllm.model_executor.layers.fused_moe.all2all_utils import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/all2all_utils.py", line 46, in <module>
    from .nixl_ep_prepare_finalize import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/nixl_ep_prepare_finalize.py", line 5, in <module>
    import nixl_ep
  File "/usr/local/lib/python3.12/dist-packages/nixl_ep/__init__.py", line 23, in <module>
    from . import nixl_ep_cpp as _nixl_ep_cpp
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions