Dear Siwei Wen,
I have a few questions regarding your work, and I would appreciate your insights:
-
Have you compared your method with more widely recognized benchmarks or training-based methods, such as the GenImage test set or AIGC Benchmark?
-
In Table 2 of the paper, the results show significantly better performance than GPT-4O, with FakeVLM also performing far better than models like Qwen2-VL, DeepSeek, and IntelVL. The last three LMMS are labeled using FakeClue, which raises some concerns about the quality of the FakeClue dataset. I examined the FakeClue test set and encountered severe quality issues. For example, for images from the Chameleon dataset like:
- chameleon/fake/762482d5-d381-4586-8964-c3d567e233cf.jpg
- chameleon/fake/f1159a40-3475-4ffc-bd87-7ae9b73e925a.jpg
- chameleon/fake/939a1cf4-ba98-4570-abc7-a7a3f23c6b08.jpg
- chameleon/fake/71ec1b22-945b-4a6c-bfb7-36cb4e70bfa4.jpg
- chameleon/fake/99cd97e4-6c8f-4fe9-a80a-9b8d5f0971e8.jpg
- chameleon/fake/8f05fe0f-3b17-442b-9b8b-cde419445110.jpg
The explanations provided for these images offer very little information, and despite varying image contents, the reasoning is often generic: "Despite its seemingly authentic appearance, certain features, such as disproportionate textures or odd lighting, hint that this image was generated by AI." For example, for the image chameleon/fake/492b2e59-5da3-4218-8ff0-ea6283c8bae3.jpg, the explanation merely repeats the content description, which clearly seems like a stitched-together response from multiple LMMs. In contrast, for images from the GenImage dataset, the explanations are overly verbose, sometimes exceeding 1500 characters for a single image. This leads me to a new concern: Given the questionable quality of the annotations, is it meaningful to calculate ROUGE_L and CSS scores under such conditions?
- Finally, I would like to request the release of the FakeVLM model weights to help clarify these issues. Due to limited computational resources, I might not be able to fully replicate the original training accuracy.
Thank you for your attention to these matters, and I look forward to your responses.
Best regards
P.S. Thank you for open-sourcing such an interesting piece of work!
Dear Siwei Wen,
I have a few questions regarding your work, and I would appreciate your insights:
Have you compared your method with more widely recognized benchmarks or training-based methods, such as the GenImage test set or AIGC Benchmark?
In Table 2 of the paper, the results show significantly better performance than GPT-4O, with FakeVLM also performing far better than models like Qwen2-VL, DeepSeek, and IntelVL. The last three LMMS are labeled using FakeClue, which raises some concerns about the quality of the FakeClue dataset. I examined the FakeClue test set and encountered severe quality issues. For example, for images from the Chameleon dataset like:
The explanations provided for these images offer very little information, and despite varying image contents, the reasoning is often generic: "Despite its seemingly authentic appearance, certain features, such as disproportionate textures or odd lighting, hint that this image was generated by AI." For example, for the image chameleon/fake/492b2e59-5da3-4218-8ff0-ea6283c8bae3.jpg, the explanation merely repeats the content description, which clearly seems like a stitched-together response from multiple LMMs. In contrast, for images from the GenImage dataset, the explanations are overly verbose, sometimes exceeding 1500 characters for a single image. This leads me to a new concern: Given the questionable quality of the annotations, is it meaningful to calculate ROUGE_L and CSS scores under such conditions?
Thank you for your attention to these matters, and I look forward to your responses.
Best regards
P.S. Thank you for open-sourcing such an interesting piece of work!