Highlights
- Critic- and likelihood-free online RL for flow-based VLAs (no auxiliary value networks).
- Wider Space: SDE-based sampling expands the exploration manifold beyond deterministic ODE trajectories.
- Finer Steps: step-wise supervision targets the immediate next denoising step with a noise-aware regression signal.
- Penalty-free preference learning: a logistic contrastive ranking loss enforces push–pull dynamics (promote successes, suppress failures) to stabilize on-policy learning.
- Efficient: only one forward pass per optimization step, reducing training overhead and latency-critical complexity.
Main Results (see paper for full protocol)
-
LIBERO (few-shot):
$\pi$ -StepNFT improves performance by 32.9% over SFT. -
ManiSkill (OOD generalization):
$\pi$ -StepNFT improves OOD success by 11.1% over critic-based baselines by mitigating multimodal overfitting.
For environment setup and simulator configuration details, please refer to the RLinf repository.
Run experiments using the Docker image.
docker run -it --rm --gpus all \
--shm-size 20g \
--network host \
--name rlinf \
-v .:/workspace/RLinf \
rlinf/rlinf:agentic-rlinf0.1-maniskill_libero
# For faster mirror downloads in mainland China, you can use:
# docker.1ms.run/rlinf/rlinf:agentic-rlinf0.1-maniskill_liberoSwitch to the corresponding virtual environment using the built-in switch_env tool:
source switch_env openpiFor Maniskill,
cd <path_to_pi_StepNFT>/rlinf/envs/maniskill
# For faster downloads in mainland China, you can set:
# export HF_ENDPOINT=https://hf-mirror.com
hf download --repo-type dataset RLinf/maniskill_assets --local-dir ./assets
bash examples/embodiment/run_embodiment.sh libero_object_nft_actor_openpi# Batch eval on embodiment checkpoints (auto-scan global_step_* in descending order)
TIMESTAMP=YOUR_TIMESTAMP \
EXP_SUBPATH=maniskill_nft_actor_openpi/checkpoints \
EVAL_NAME=embodiment_${TIMESTAMP} \
MIN_STEP=160 \
bash examples/embodiment/batch_eval_embodiment.sh maniskill_ppo_openvlaoft# Batch eval on ManiSkill OOD tasks across multiple envs
TIMESTAMP=YOUR_TIMESTAMP \
EXP_SUBPATH=maniskill_nft_actor_openpi/checkpoints \
CONFIG_NAME=YOUR_CFG_NAME \
EVAL_NAME=mani_ood_${TIMESTAMP} \
MIN_STEP=160 \
bash examples/embodiment/batch_eval_mani_ood.sh@article{yu2025rlinf,
title={RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation},
author={Yu, Chao and Wang, Yuanqing and Guo, Zhen and Lin, Hao and Xu, Si and Zang, Hongzhi and Zhang, Quanlu and Wu, Yongji and Zhu, Chunyang and Hu, Junhao and others},
journal={arXiv preprint arXiv:2509.15965},
year={2025}
}- DiffusionNFT: https://github.com/NVlabs/DiffusionNFT
@article{zheng2025diffusionnft,
title={DiffusionNFT: Online Diffusion Reinforcement with Forward Process},
author={Zheng, Kaiwen and Chen, Huayu and Ye, Haotian and Wang, Haoxiang and Zhang, Qinsheng and Jiang, Kai and Su, Hang and Ermon, Stefano and Zhu, Jun and Liu, Ming-Yu},
journal={arXiv preprint arXiv:2509.16117},
year={2025}
}@misc{wang2026pistepnftwiderspaceneeds,
title={$\pi$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs},
author={Siting Wang and Xiaofeng Wang and Zheng Zhu and Minnan Pei and Xinyu Cui and Cheng Deng and Jian Zhao and Guan Huang and Haifeng Zhang and Jun Wang},
year={2026},
eprint={2603.02083},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2603.02083},
}