🤖 Awesome World Action Models

📜 A Curated List of World Action Models, Vision-Language-Action (VLA), and Embodied AI Research

Overview

This repository aims to provide a comprehensive, curated, and continuously updated list of research papers, resources, and tools related to World Action Models (WAM), Vision-Language-Action (VLA) models, and Embodied AI. The goal is to help researchers and engineers navigate the rapidly evolving field of robotics foundation models.

World Action Models are robotics policies that leverage world modeling capabilities—predicting future states—for action prediction. They represent a paradigm shift from reactive policies to predictive, world-aware decision-making.

Vision-Language-Action (VLA) models combine the rich language grounding and visual understanding of Vision-Language Models (VLMs) with action prediction, offering a scalable route toward general-purpose, language-conditioned robot policies.

Comparison Methods & Baselines

📊 Click to expand baseline methods and complete paper list

Quick Reference (from experimental tables)

Category	Key Baselines
VLA	RT-1, RT-2, OpenVLA, Octo, π0, X-VLA, UniVLA, SmolVLA, VLANeXt
Policy	Diffusion Policy, ACT, BeT, PerAct, MVP, R3M, CQL, IQL
World Model	DreamerV1/V2/V3, I-JEPA, V-JEPA, DreamZero

Documentation

📋 Complete Paper List (129 papers, sorted newest first)
📊 Baseline Methods (detailed comparison methods from experimental tables)

🆕 Latest Papers (Auto-updated)

Papers are automatically fetched daily from arXiv. Last updated: 2026-06-09

VLA

Paper	Date	Code
LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination Taishan Li, Jiwen Zhang et al.	2026-06-09
VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models Guiyu Zhao, Longteng Guo et al.	2026-06-09
Uncovering Vulnerability of Vision-Language-Action Models under Joint-Level Physical Faults Minsoo Jo, Taeju Kwon et al.	2026-06-09
Act on What You See: Unlocking Safe Social Navigation in Vision-Language-Action Models Qingzi Wang, Xiyang Wu et al.	2026-06-09
A Practical Recipe Towards Improving Sim-and-Real Correlation for VLA Evaluation Shuo Wang, Hanyuan Xu et al.	2026-06-09
What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents Jiaheng Hu, Mohit Shridhar et al.	2026-06-09
Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs Jonathan C. Kao, Jason Chan et al.	2026-06-08
MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models Hao Shi, Weiye Li et al.	2026-06-08
Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action Models Seongbin Park, Fan Zhang et al.	2026-06-08
ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models Fan Zhang, Seongbin Park et al.	2026-06-08

World Model

Paper	Date	Code
HiMem-WAM: Hierarchical Memory-Gated World Action Models for Robotic Manipulation Xiaoquan Sun, Ruijian Zhang et al.	2026-06-09
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation Jia Zheng, Teli Ma et al.	2026-06-08
C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache Weisen Zhao, Lam Nguyen et al.	2026-06-08
Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation Yunfan Lou, Yifan Ye et al.	2026-06-07
FAWAM: Force-Aware World Action Models for Closed-Loop Contact-Rich Manipulation Haotian He, Zeyu Yan et al.	2026-06-07
Light-WAM: Efficient World Action Models with State-Fusion Action Decoding Ziang Li, Dongzhou Cheng et al.	2026-06-06
Dreaming when Necessary: Advancing World Action Models with Adaptive Multi-Modal Reasoning Yinzhou Tang, Jingbo Xu et al.	2026-06-05
Flash-WAM: Modality-Aware Distillation for World Action Models Arman Akbari, Ci Zhang et al.	2026-06-03
OSCAR: Omni-Embodiment Action-Conditioned World Model for Robotics Zhuoyuan Wu, Jun Gao	2026-06-03
GeoSem-WAM: Geometry- and Semantic-Aware World Action Models Fulong Ma, Daojie Peng et al.	2026-06-02

Policy

Paper	Date	Code
Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination Jiajun Li, Tiecheng Guo et al.	2026-06-08
AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing Jisong Cai, Long Ling et al.	2026-06-08
WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation Ning Yang, Yan Huang et al.	2026-06-03

Key Definitions

Vision-Language-Action (VLA) Models

VLA models are robotics policies that inherit the pretrained VLMs' rich language grounding and visual understanding abilities to offer a scalable route toward general-purpose, language-conditioned robot policies.

Key Paper: RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control | Brohan et al. (2023) | Project

World Action Models (WAM)

WAM models are robotics policies that leverage the world modeling capability (i.e., predicting future states) for action prediction.

Key Paper: DreamZero: World Action Models are Zero-shot Policies | Chen et al. (2026) | Project

Note: There is an intersection between VLA and WAM: WAMs built upon pretrained VLMs are simultaneously both VLA and WAM.

Surveys

Title	Authors	Year	Links
Vision-Language-Action (VLA) Models: Concepts, Progress, Applications and Challenges	Applied AI Research Lab	2025	arXiv
A Survey on Vision-Language-Action Models for Embodied AI	Ma et al.	2024	arXiv
Foundation Models for Embodied AI	Driess et al.	2024	arXiv

VLA Models

General VLA

#	Paper	Authors	Year	Links
1	AC2-VLA: Action-Context-Aware Adaptive Computation in VLA	Yu et al.	2026	arXiv
2	APPLV: Adaptive Planner Parameter Learning from VLA	Lu et al.	2026	arXiv
3	Act, Think or Abstain: Complexity-Aware Adaptive Inference for VLA	Izzo et al.	2026	arXiv
4	AnyCamVLA: Zero-Shot Camera Adaptation for Viewpoint Robust VLA	Heo et al.	2026	arXiv
5	CLARE: Continuous Learning for VLA via Adapter Routing	Römer et al.	2026	arXiv
6	DIAL: Decoupling Intent and Action via Latent World Modeling for VLA	Chen et al.	2026	arXiv
7	EAPruning: Adaptive Pruning with Interleaved Inference for VLA	Huang et al.	2026	arXiv
8	ETA-VLA: Efficient Token Adaptation	Wang et al.	2026	arXiv
9	FAVLA: Force-Adaptive Fast-Slow VLA	Li et al.	2026	arXiv
10	HarvestFlex: Harvesting via VLA Policy Adaptation	Zhao et al.	2026	arXiv
11	On-the-Fly VLA: VLA Adaptation via Test-Time RL	Liu et al.	2026	arXiv
12	ProbeFlow: Training-Free Adaptive Flow Matching for VLA	Fang et al.	2026	arXiv
13	RAFT: Adapting VLA Models via Force-aware Curriculum	Zhang et al.	2026	arXiv
14	ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy	Kim et al.	2026	arXiv
15	SAMoE-VLA: Scene Adaptive Mixture-of-Experts VLA	You et al.	2026	arXiv
16	SCALE: Self-Uncertainty Adaptive Looking for VLA	Choi et al.	2026	arXiv
17	SOMA: Memory-Augmented System for VLA Robustness	Li et al.	2026	arXiv
18	VGAS: Adaptive Capacity Allocation for VLA	Kim et al.	2026	arXiv
19	VLA-Acceleration: Accelerate VLA through Visual Token Caching	Wei et al.	2026	arXiv
20	VGAS: Value-Guided Action-Chunk Selection for VLA	Xu et al.	2026	arXiv
21	VLANeXt: Recipes for Building Strong VLA Models	Liu et al.	2026	arXiv Project
22	HoloBrain-0: Technical Report	Horizon Robotics	2026	arXiv Project
23	FocusVLA: Focused Visual Utilization for VLA Models	Zhang et al.	2026	arXiv
24	StreamingVLA: Streaming VLA with Action Flow Matching	Shi et al.	2026	arXiv
25	ABot-M0: VLA with Action Manifold Learning	AMAP CVLab	2026	arXiv Project
26	SimVLA: A Simple VLA Baseline for Robotic Manipulation	FrontierRoBo	2026	arXiv Project
27	Lingbot-VLA: A Pragmatic VLA Foundation Model	Robbyant	2026	arXiv Project
28	AC-DiT: AC-DiT: Adaptive Coordination Diffusion Transformer	Chen et al.	2025	arXiv
29	U-DiT: U-DiT: U-shaped Diffusion Transformers	Wu et al.	2025	arXiv
30	VLA-Adapter: VLA-Adapter: Tiny-Scale VLA Paradigm	Wang et al.	2025	arXiv
31	Gemini Robotics: Bringing AI into the Physical World	DeepMind	2025	arXiv Project
32	*π0.6**: A VLA That Learns From Experience	Black et al.	2025	arXiv Project
33	X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment VLA	Zheng et al.	2025	arXiv Project
34	UniVLA: Unified Vision-Language-Action Model	Wu et al.	2025	arXiv Project
35	SmolVLA: A VLA for Affordable and Efficient Robotics	LeRobot	2025	arXiv Project
36	NORA: A Small Open-Sourced Generalist VLA	Nandan et al.	2025	arXiv Project
37	VLA-0: Building State-of-the-Art VLAs with Zero Modification	VLA0	2025	arXiv Project
38	CronusVLA: Efficient Multi-Frame VLA	Li et al.	2025	arXiv Project
39	OpenVLA-OFT: Fine-Tuning VLAs: Optimizing Speed and Success	Lee et al.	2025	arXiv Project
40	AsyncVLA: Asynchronous Flow Matching for VLA	Jiang et al.	2025	arXiv Project
41	AVA-VLA: VLA with Active Visual Attention	Li et al.	2025	arXiv
42	A-VL: Adaptive Attention for Large VLA	Zhang et al.	2024	arXiv
43	ADEM-VL: Adaptive and Embedded Fusion for VLA	Hao et al.	2024	arXiv
44	OpenVLA: OpenVLA: An Open-Source Vision-Language-Action Model	Kim et al.	2024	arXiv Project
45	Octo: Octo: An Open-Source Generalist Robot Policy	Ghosh et al.	2024	arXiv Project
46	π0: A Multimodal Autoregressive Action Model	Black et al.	2024	arXiv Project
47	RT-2: Vision-Language-Action Models	Brohan et al.	2023	arXiv Project
48	RT-1: Robotics Transformer for Real-World Control at Scale	Brohan et al.	2022	arXiv Project
49	VL-Adapter: VL-Adapter: Parameter-Efficient Transfer Learning	Sung et al.	2021	arXiv

VLA with Reasoning

#	Paper	Authors	Year	Links
1	ACoT-VLA: Action Chain-of-Thought for VLA Models	AgibotTech	2026	arXiv Project
2	Fast-ThinkAct: Efficient Vision-Language-Action Reasoning	Chen et al.	2026	arXiv
3	CoT-VLA: Visual Chain-of-Thought Reasoning for VLA	Chen et al.	2025	arXiv
4	ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning	Chen et al.	2025	arXiv
5	UniVLA: VLA with World Model	Wu et al.	2025	arXiv

VLA with 3D/4D Modeling

#	Paper	Authors	Year	Links
1	3D-VLA: A 3D Vision-Language-Action Generative World Model	Chen et al.	2024	arXiv
2	VoxPoser: 3D-Aware VLA	Huang et al.	2023	arXiv Project

Efficient VLA

#	Paper	Authors	Year	Links
1	FASTER: Rethinking Real-Time Flow VLAs	Liu et al.	2026	arXiv
2	SmolVLA: A Vision-Language-Action Model for Affordable Robotics	LeRobot	2025	arXiv
3	AsyncVLA: Asynchronous Flow Matching for VLA	Jiang et al.	2025	arXiv Project
4	OpenHelix: A Short Survey and Open-Source Dual-System VLA	Google DeepMind	2025	arXiv
5	RTC: Running VLAs at Real-time Speed	Google DeepMind	2025	arXiv
6	AVA-VLA: VLA with Active Visual Attention	Li et al.	2025	arXiv

VLA with RL Fine-tuning

#	Paper	Authors	Year	Links
1	SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning	Pi Team	2025	arXiv
2	OpenVLA-OFT: Fine-Tuning VLAs: Optimizing Speed and Success	Lee et al.	2025	arXiv Project

WAM from Video Generation

#	Paper	Authors	Year	Links
1	DreamZero: World Action Models are Zero-shot Policies	Chen et al.	2026	arXiv Project
2	DiT4DiT: Jointly Modeling Video Dynamics and Actions	Ma et al.	2026	arXiv
3	Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning	NVIDIA	2026	arXiv Project
4	Video2Act: A Dual-System Video Diffusion Policy	Sun et al.	2025	arXiv

WAM from VLMs

#	Paper	Authors	Year	Links
1	Goal-VLA: Image-Generative VLMs as Object-Centric World Models Empower VLA	Lee et al.	2025	arXiv

WAM from Scratch

#	Paper	Authors	Year	Links
1	V-JEPA: Video Joint-Embedding Predictive Architecture	Esser et al.	2024	arXiv
2	DreamerV3: Mastering Atari from Pixels	Hafner et al.	2023	arXiv Project
3	I-JEPA: Image-based Joint-Embedding Predictive Architecture	Esser et al.	2023	arXiv

Action Representations

Discrete Tokenization

#	Paper	Authors	Year	Links
1	Behavior Transformers (BeT): Multimodal Action Discretization	Shafiullah et al.	2022	arXiv
2	Action Bins: Discretizing Continuous Actions	Brohan et al.	2023	arXiv
3	Action Tokens: Learning Discrete Action Spaces	Chi et al.	2023	arXiv

Diffusion Policies

#	Paper	Authors	Year	Links
1	Diffusion Policy: Diffusion for Robot Control	Chi et al.	2023	arXiv Project
2	ACT/ALOHA: Action Chunking Transformer	Zhao et al.	2023	arXiv Project
3	Flow Matching Policy: Flow Matching for Action Generation	Zhou et al.	2024	arXiv Project
4	Transfusion: AR + Diffusion in One Transformer	Zhou et al.	2024	arXiv

Robotics Policies

#	Paper	Authors	Year	Links
1	RT-1: Robotics Transformer for Real-World Control	Brohan et al.	2022	arXiv Code
2	Diffusion Policy: Diffusion for Robot Control	Chi et al.	2023	arXiv
3	ACT/ALOHA: Action Chunking Transformer	Zhao et al.	2023	arXiv
4	Behavior Transformers (BeT): Multimodal Action	Shafiullah et al.	2022	arXiv
5	PerAct: Behavior Primitive Discovery	Nasiriany et al.	2023	arXiv

Resources

Datasets

Name	Description	Size	Links
OXE	Open X-Embodiment Dataset	500k+ episodes	Project
RT-1 Dataset	Real Robot Manipulation	130k episodes	Project
BridgeData	Robot Learning Dataset	70k episodes	Project
ALOHA	Bimanual Manipulation	10k+ episodes	Project
AgiBot World	Large-scale Robot Dataset	1M+ episodes	Project
UMI	Unified Manipulation Interface	15k episodes	Project
DROID	Dataset for Robot Imitation	80k episodes	Project

Benchmarks

Name	Description	Links
Libero	Modular Benchmark for Robot Learning	Project
RLBench	Real Robot Benchmark	Project
ManiSkill	Generalizable Manipulation	Project
CALVIN	Language-conditioned Manipulation	Project
RoboNet	Large-scale Robot Dataset	Project
Metaworld	Multi-task Benchmark	Project

Simulation Platforms

Name	Description	Links
Isaac Gym	NVIDIA GPU-accelerated Physics	Project
Isaac Lab	Robot Learning Framework	Project
Gazebo	Classic Robot Simulator	Project
Mujoco	Physics Engine	Project
PyBullet	Physics Simulation	Project
Habitat	Embodied AI Simulation	Project
iGibson	Interactive Gibson Environment	Project

Tools & Frameworks

Name	Description	Links
LeRobot	Hugging Face Robotics Framework	Project
PyRobot	Robotics Learning Framework	Project
Robomimic	Imitation Learning Framework	Project
OmniGibson	Sim2Real Platform	Project
ManiSkill2	Manipulation Benchmark	Project

Contributing

Contributions are welcome! This repository uses automated tools for paper discovery:

Add a paper: Edit the README directly or open an issue
Fix errors: Submit a PR with corrections
Suggest improvements: Open an issue with your ideas

To run the paper scraper locally:

pip install -r requirements.txt
python scripts/arxiv_scraper.py --max-results 100 --days-back 180

License

This repository is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Inspired by awesome-vla-wam
Inspired by awesome-physical-ai
Inspired by awesome-vla-study

If you find this repository useful, please consider giving it a ⭐

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.claude		.claude
.github		.github
data		data
docs		docs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🤖 Awesome World Action Models

Overview

Table of Contents

Comparison Methods & Baselines

Quick Reference (from experimental tables)

Documentation

🆕 Latest Papers (Auto-updated)

VLA

World Model

Policy

Key Definitions

Vision-Language-Action (VLA) Models

World Action Models (WAM)

Surveys

VLA Models

General VLA

VLA with Reasoning

VLA with 3D/4D Modeling

Efficient VLA

VLA with RL Fine-tuning

WAM from Video Generation

WAM from VLMs

WAM from Scratch

Action Representations

Discrete Tokenization

Diffusion Policies

Robotics Policies

Resources

Datasets

Benchmarks

Simulation Platforms

Tools & Frameworks

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages