Skip to content

Performance variance in top-down (text-prompted) patch selection with custom images #14

@huhanwj

Description

@huhanwj

Hi everyone,

First, thank you for this incredible work on PS3! I'm exploring the model's capabilities, particularly the top-down selection feature.

I've noticed a significant difference in the precision of text-prompted patch selection when using my own images compared to the dock.jpg example provided in the README.md.

While the demo image shows a very clear and accurate shift in the attention heatmap based on the text prompt, the model's selection on my images is often diffuse or doesn't strongly correlate with the prompt, even for seemingly simple queries.

My main question is: Is this performance variance expected for images that might be "out-of-distribution" compared to the (currently unreleased) pre-training dataset?

The train/README.md mentions the training data is not yet public, so I suspect this might be the core reason. Could you provide any insights on the types of images or scenes the model is most proficient with?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions