Skip to content

Questions Regarding Benchmark Comparisons, FakeClue Dataset Quality, and Request for FakeVLM Model Weights #10

@JoyHanson-34

Description

@JoyHanson-34

Dear Siwei Wen,

I have a few questions regarding your work, and I would appreciate your insights:

  1. Have you compared your method with more widely recognized benchmarks or training-based methods, such as the GenImage test set or AIGC Benchmark?

  2. In Table 2 of the paper, the results show significantly better performance than GPT-4O, with FakeVLM also performing far better than models like Qwen2-VL, DeepSeek, and IntelVL. The last three LMMS are labeled using FakeClue, which raises some concerns about the quality of the FakeClue dataset. I examined the FakeClue test set and encountered severe quality issues. For example, for images from the Chameleon dataset like:

    • chameleon/fake/762482d5-d381-4586-8964-c3d567e233cf.jpg
    • chameleon/fake/f1159a40-3475-4ffc-bd87-7ae9b73e925a.jpg
    • chameleon/fake/939a1cf4-ba98-4570-abc7-a7a3f23c6b08.jpg
    • chameleon/fake/71ec1b22-945b-4a6c-bfb7-36cb4e70bfa4.jpg
    • chameleon/fake/99cd97e4-6c8f-4fe9-a80a-9b8d5f0971e8.jpg
    • chameleon/fake/8f05fe0f-3b17-442b-9b8b-cde419445110.jpg

The explanations provided for these images offer very little information, and despite varying image contents, the reasoning is often generic: "Despite its seemingly authentic appearance, certain features, such as disproportionate textures or odd lighting, hint that this image was generated by AI." For example, for the image chameleon/fake/492b2e59-5da3-4218-8ff0-ea6283c8bae3.jpg, the explanation merely repeats the content description, which clearly seems like a stitched-together response from multiple LMMs. In contrast, for images from the GenImage dataset, the explanations are overly verbose, sometimes exceeding 1500 characters for a single image. This leads me to a new concern: Given the questionable quality of the annotations, is it meaningful to calculate ROUGE_L and CSS scores under such conditions?

  1. Finally, I would like to request the release of the FakeVLM model weights to help clarify these issues. Due to limited computational resources, I might not be able to fully replicate the original training accuracy.

Thank you for your attention to these matters, and I look forward to your responses.

Best regards
P.S. Thank you for open-sourcing such an interesting piece of work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions