Hi! I found that the Modality Interaction Task section of the paper says to use visual modality as a query and text modality as key and value, but in the code you provided, in lines 497-499 of the “model_init.py” file, the first input of CrossAttention is 'text_tokens', and the first input of CrossAttention is as query . Is there any error in the provided code?

