How to Evaluate Qwen3-VL on the TraceSpatial Benchmark?

Thanks for the great work.

I’m trying to reproduce your reported results by running Qwen3-VL inference on the TraceSpatial benchmark. However, when I use the system prompt provided in your repository, the output format differs from what the evaluation script expects, which results in a metric score of 0.0 or scores that do not match those reported in the paper.

Could you please share the exact prompting or evaluation setup you used (e.g., system prompt, output format constraints, or post-processing steps) to obtain the reported results?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Evaluate Qwen3-VL on the TraceSpatial Benchmark? #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How to Evaluate Qwen3-VL on the TraceSpatial Benchmark? #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions