Hi everyone,
First, thank you for this incredible work on PS3! I'm exploring the model's capabilities, particularly the top-down selection feature.
I've noticed a significant difference in the precision of text-prompted patch selection when using my own images compared to the dock.jpg example provided in the README.md.
While the demo image shows a very clear and accurate shift in the attention heatmap based on the text prompt, the model's selection on my images is often diffuse or doesn't strongly correlate with the prompt, even for seemingly simple queries.
My main question is: Is this performance variance expected for images that might be "out-of-distribution" compared to the (currently unreleased) pre-training dataset?
The train/README.md mentions the training data is not yet public, so I suspect this might be the core reason. Could you provide any insights on the types of images or scenes the model is most proficient with?
Hi everyone,
First, thank you for this incredible work on PS3! I'm exploring the model's capabilities, particularly the top-down selection feature.
I've noticed a significant difference in the precision of text-prompted patch selection when using my own images compared to the
dock.jpgexample provided in theREADME.md.While the demo image shows a very clear and accurate shift in the attention heatmap based on the text prompt, the model's selection on my images is often diffuse or doesn't strongly correlate with the prompt, even for seemingly simple queries.
My main question is: Is this performance variance expected for images that might be "out-of-distribution" compared to the (currently unreleased) pre-training dataset?
The train/README.md mentions the training data is not yet public, so I suspect this might be the core reason. Could you provide any insights on the types of images or scenes the model is most proficient with?