Is there any reason for using a q former for audio projection layer instead of linear layer like for image
Is there any reason for using a q former for audio projection layer instead of linear layer like for image