Performance variance in top-down (text-prompted) patch selection with custom images

Hi everyone,

First, thank you for this incredible work on PS3! I'm exploring the model's capabilities, particularly the top-down selection feature.

I've noticed a significant difference in the precision of text-prompted patch selection when using my own images compared to the `dock.jpg` example provided in the `README.md`.

While the demo image shows a very clear and accurate shift in the attention heatmap based on the text prompt, the model's selection on my images is often diffuse or doesn't strongly correlate with the prompt, even for seemingly simple queries.

My main question is: Is this performance variance expected for images that might be "out-of-distribution" compared to the (currently unreleased) pre-training dataset?

The train/README.md mentions the training data is not yet public, so I suspect this might be the core reason. Could you provide any insights on the types of images or scenes the model is most proficient with?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance variance in top-down (text-prompted) patch selection with custom images #14

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Performance variance in top-down (text-prompted) patch selection with custom images #14

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions