MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

🚧 TODO

Release the benchmark dataset on Hugging Face Datasets (expected within one weak)
Release the Dockerfile for deploying the Minecraft sandbox environment (expected within one weak)

MineExplorer is a benchmark for evaluating the open-world exploration capabilities of multimodal large language model (MLLM) agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning, then organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Experiments show that open-world exploration remains challenging: strong models handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories, and larger models or thinking modes do not consistently translate into better performance.

1. Environment Setup

Install the required Python packages:

pip install gymnasium numpy requests pillow loguru python-dotenv typer fastapi uvicorn pydantic imageio imageio-ffmpeg

Set the required environment variables:

export AGENT_API_KEY="your_api_key"
export AGENT_API_BASE="https://your-api-endpoint/v1/openai/native"

2. Generating the Benchmark

Use generate_benchmark.py to generate Minecraft evaluation tasks. The benchmark directory contains the benchmark used in the paper, covering single-hop to 4-hop tasks.

python generate_benchmark.py multi \
    --model aws.claude-opus-4.6 \
    --num-samples 10 \
    --k-min 1 \
    --k-max 1 \
    --candidate-num 1 \
    --output benchmark_new

Key Arguments

Argument	Description
`multi` / `single`	Multi-agent or single-agent benchmark generation. The paper uses multi-agent mode, which produces more reliable instances but is slower due to sandbox interaction.
`--model`	Model name to use for generation
`--num-samples`	Number of samples to generate
`--k-min` / `--k-max`	Range of subtask hops per sample (e.g., set both to `1` for single-hop tasks only)
`--candidate-num`	Number of candidate atomic tasks
`--output`	Output directory

Output Structure

benchmark_new/
├── 0000/
│   └── multi-agent/
│       ├── metadata.json        # Scene configuration
│       ├── milestones.json      # Milestone definitions
│       ├── reasoning_graph.json # Dependency graph
│       └── debate_log.json      # Agent dialogue log
├── 0001/
│   └── multi-agent/
│       └── ...

3. Evaluating the Benchmark

Use eval_benchmark.py to run an agent on the generated benchmark and evaluate its performance.

Using an OpenAI-compatible API

python eval_benchmark.py \
    --model aws.claude-opus-4.6 \
    --benchmark-dir benchmark_new \
    --output-dir results \
    --num-workers 10 \
    --resume

Using a Local vLLM Service

Start the vLLM server first:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen2.5-7B \
    --port 8000

Then run evaluation:

python eval_benchmark.py \
    --model Qwen2.5-7B \
    --benchmark-dir benchmark_new \
    --output-dir results \
    --num-workers 10 \
    --use-vllm

Common Arguments

Argument	Description
`--model`	Model to use for evaluation
`--benchmark-dir`	Path to the benchmark directory
`--output-dir`	Directory to save results
`--num-workers`	Number of parallel sandbox workers
`--resume`	Resume from checkpoint (skip completed tasks)
`--limit`	Limit number of evaluation samples (for testing)

Output Structure

results/
└── aws.claude-opus-4.6/
    ├── 0000/
    │   ├── result.json       # Evaluation result
    │   ├── episode.mp4       # Episode replay video
    │   └── messages/         # Conversation logs
    ├── 0001/
    │   └── ...
    └── eval_summary.json     # Aggregated statistics

Results

Citation

If you find this work useful, please cite:

@misc{ju2026mineexplorerevaluatingopenworldexploration,
      title={MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft}, 
      author={Tianjie Ju and Yueqing Sun and Zheng Wu and Wei Zhang and Yaqi Huo and Xi Su and Qi Gu and Xunliang Cai and Gongshen Liu and Zhuosheng Zhang},
      year={2026},
      eprint={2605.30931},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.30931}, 
}

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
benchmark		benchmark
benchmark_gen		benchmark_gen
env		env
figures		figures
mc_agent		mc_agent
.DS_Store		.DS_Store
README.md		README.md
eval_benchmark.py		eval_benchmark.py
generate_benchmark.py		generate_benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

🚧 TODO

1. Environment Setup

2. Generating the Benchmark

Key Arguments

Output Structure

3. Evaluating the Benchmark

Using an OpenAI-compatible API

Using a Local vLLM Service

Common Arguments

Output Structure

Results

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

🚧 TODO

1. Environment Setup

2. Generating the Benchmark

Key Arguments

Output Structure

3. Evaluating the Benchmark

Using an OpenAI-compatible API

Using a Local vLLM Service

Common Arguments

Output Structure

Results

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages