Thanks for the great work.
I’m trying to reproduce your reported results by running Qwen3-VL inference on the TraceSpatial benchmark. However, when I use the system prompt provided in your repository, the output format differs from what the evaluation script expects, which results in a metric score of 0.0 or scores that do not match those reported in the paper.
Could you please share the exact prompting or evaluation setup you used (e.g., system prompt, output format constraints, or post-processing steps) to obtain the reported results?
Thanks for the great work.
I’m trying to reproduce your reported results by running Qwen3-VL inference on the TraceSpatial benchmark. However, when I use the system prompt provided in your repository, the output format differs from what the evaluation script expects, which results in a metric score of 0.0 or scores that do not match those reported in the paper.
Could you please share the exact prompting or evaluation setup you used (e.g., system prompt, output format constraints, or post-processing steps) to obtain the reported results?