Using (AxB)[0][0] = A[0][i]*B[0][i] instead (AxB)[0][0] = A[0][i]*B[i][0] might improve cache locality. I guess that's why pytorch.nn.Linear use transposed weight.
Using (AxB)[0][0] = A[0][i]*B[0][i] instead (AxB)[0][0] = A[0][i]*B[i][0] might improve cache locality. I guess that's why pytorch.nn.Linear use transposed weight.