I ‘m reviewing your code on the Visual transformer. I have some questions and hope to get your answers.
First, what is the token_wV mean in your code? why you do multiplication between wV and input feature? In the paper, the author didn't do this step.

The second question is about the way you define the self.token_wA and wV. why have you defined them with the batch size? Can I define it without batch size and expand it in forward like the cls_tokens. Because I don't want to use the model with fixed batch size. I'm not sure if the wW and wV traninable in the model.
I ‘m reviewing your code on the Visual transformer. I have some questions and hope to get your answers.
First, what is the token_wV mean in your code? why you do multiplication between wV and input feature? In the paper, the author didn't do this step.
The second question is about the way you define the self.token_wA and wV. why have you defined them with the batch size? Can I define it without batch size and expand it in forward like the cls_tokens. Because I don't want to use the model with fixed batch size. I'm not sure if the wW and wV traninable in the model.