现在有一些vlm也可以通过写代码之类的操作完成babyvision-gen里面的任务。请问你们有测过/打算测非图像生成的vlm在babyvision-gen上的性能吗?
比如gpt5.4thinking做迷宫:https://chatgpt.com/share/69ae8b45-746c-8003-9b57-b8be05412d2a
Some VLMs can now complete tasks in BabyVision-Gen by performing actions such as writing code. Have you tested, or do you plan to test, the performance of non–image-generation VLMs on BabyVision-Gen?
For example, GPT-5.4 Thinking solving a maze:
https://chatgpt.com/share/69ae8b45-746c-8003-9b57-b8be05412d2a