Skip to content

xumingw/WAM-Diff

 
 

Repository files navigation

WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving

1Fudan University  2Yinwang Intelligent Technology Co., Ltd 


📰 News

  • 2025/12/06: 🎉🎉🎉 Paper submitted on Arxiv.

📅️ Roadmap

Status Milestone ETA
🚀 Release the inference source code 2025.12.21
🚀 Release the SFT and inf code 2025.12.21
🚀 Release pretrained models on Huggingface TBD
🚀 Release NAVSIM evaluation code TBD
🚀 Release the RL code TBD

🔧️ Framework

framework

🏆 Qualitative Results on NAVSIM

NAVSIM-v1 benchmark results

navsim-v1

NAVSIM-v2 benchmark results

navsim-v2

Quick Inference Demo

The WAM-Diff will be available on Hugging Face Hub soon. To quickly test the model, follow these simple steps:

  1. Clone the repository

    git clone https://github.com/fudan-generative-vision/WAM-Diff
    cd WAM-Diff
  2. Initialize the environment
    If you prefer conda, run the environment setup script to install necessary dependencies:

    bash init_env.sh

    Or you can use uv to create the environment:

    uv venv && uv sync
  3. Prepare the Model Download the pretrained WAM-Diff model from Hugging Face (pending release) to the ./model/WAM-Diff directory:

    https://huggingface.co/fudan-generative-ai/WAM-Diff
    

    Download the pretrained Siglip2 model from Hugging Face to the ./model/siglip2-so400m-patch14-384 directory:

    https://huggingface.co/google/siglip2-so400m-patch14-384
    
  4. Run the demo script
    Execute the demo script to test WAM-Diff on an example image:

    bash inf.sh

Training

To fine-tune WAM-Diff, please follow these steps:

  1. Set Up the Environment
    Follow the same environment setup steps as in the Quick Inference Demo section.
  2. Prepare the Data
    Prepare your training dataset in JSON format like
    [
        {
        "image": ["path/to/image1.png"],
        "conversations": [
            {
                "from": "human",
                "value": "Here is front views of a driving vehicle:\n<image>\nThe navigation information is: straight\nThe current position is (0.00,0.00)\nCurrent velocity is: (13.48,-0.29)  and current accelerate is: (0.19,0.05)\nPredict the optimal driving action for the next 4 seconds with 8 new waypoints."
            },
            {
                "from": "gpt",
                "value": "6.60,-0.01,13.12,-0.03,19.58,-0.04,25.95,-0.03,32.27,-0.03,38.56,-0.05,44.88,-0.06,51.16,-0.09"
            }
            ]
        },
        ...
    ]
  3. Run the Training Script
    Execute the training script with the following command:
    cd train
    bash ./scripts/llada_v_finetune.sh

📝 Citation

If you find our work useful for your research, please consider citing the paper:

@article{xu2025wam,
  title={WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving},
  author={Xu, Mingwang and Cui, Jiahao and Cai, Feipeng and Shang, Hanlin and Zhu, Zhihao and Luan, Shan and Xu, Yifang and Zhang, Neng and Li, Yaoyi and Cai, Jia and others},
  journal={arXiv preprint arXiv:2512.11872},
  year={2025}
}

🤗 Acknowledgements

We gratefully acknowledge the contributors to the LLaDA-V, repositories, whose commitment to open source has provided us with their excellent codebases and pretrained models.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.4%
  • Shell 0.6%