Inconsistency between the paper and the code

Hi! I found that the **Modality Interaction Task** section of the paper says to use **visual modality as a query** and **text modality as key and value**, but in the code you provided, in lines **497-499** of the **“model_init.py”** file, the first input of _CrossAttention_ is 'text_tokens', and the first input of CrossAttention is as **query** . Is there any error in the provided code?
![image](https://github.com/user-attachments/assets/d9f2b04a-26a6-483a-a315-bc8542ffeee9)
![image](https://github.com/user-attachments/assets/34027bbf-8f6f-4a42-8bc3-27fe63b13097)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency between the paper and the code #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Inconsistency between the paper and the code #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions