This repo contains the CtrlAct solution for the Embodied Agent Interface Challenge at NeurIPS 2025.
- Official Website: https://neurips25-eai.github.io/
- Benchmark Dataset: https://huggingface.co/datasets/Inevitablevalor/EmbodiedAgentInterface
- 100 tasks in BEHAVIOR
- 338 tasks in VirtualHome
- Leaderboard: https://eval.ai/web/challenges/challenge-page/2621/leaderboard/6818
- 10 teams outperformed the basline model
The competition includes four main tasks:
- Goal Interpretation: Understanding objectives and grounding them in environmental states.
- Subgoal Decomposition: Breaking complex goals into actionable steps.
- Action Sequencing: Planning coherent action sequences.
- Transition Modeling: Predicting environment state changes caused by actions.
- Technical Report: https://openreview.net/forum?id=0dt9Ho6dXA
The goal of CtrlAct is to evaluate the performance of open-source models on the Embodied Agent Interface benchmark and analyze the performance gap between these models and top-ranked systems.
The following open-source models were evaluated:
- Qwen3-235B-A22B (thinking): https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
- Qwen3-30B-A3B (thinking): https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507-FP8
- Qwen3-Next80B-A3B (thinking): https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking
- gpt-oss-120B (high): https://huggingface.co/openai/gpt-oss-120b
- gpt-oss-20B (high): https://huggingface.co/openai/gpt-oss-20b
conda create -n ctrlact python=3.12
conda activate ctrlact
pip install vllm==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu128
pip install transformers==4.57.1
pip install flashinfer-python==0.4.1
pip install scikit-learn matplotlib pandas- 4 NVIDIA H100 GPUs
- 8 NVIDIA L40S GPUs
We used Tinker for supervised fine-tuning (SFT) experiments as part of our evaluation pipeline.
- Tinker cookbook: https://github.com/thinking-machines-lab/tinker-cookbook
We thank the Tinker team for providing free credits that supported our large-scale model experiments.
- 2025-12-07: Technical report released.
- 2026-04-30: GitHub repository updated.
- 2026-05-03: vLLM inference code released.