Hi, I have the following question: Is there a reason for siso using "interleaved" ordering (xoxoxoxo) and mimo using "stretched" ordering (xxxxoooo) when applying the data-dependent RoPE to the B/C projections?
I haven't closely looked into it, and I'm vibe-porting Mamba-3 to another lib, but the AI apparently caught on something like that. Although I guess that this doesn't make a difference (since the channel ordering don't matter as long as they stay consistent), I was wondering if there was some reason that I'm just not aware of.
Sorry in advance if the AI made a mistake and I didn't caught it, and thanks for both the papers and the project!
Hi, I have the following question: Is there a reason for siso using "interleaved" ordering (xoxoxoxo) and mimo using "stretched" ordering (xxxxoooo) when applying the data-dependent RoPE to the B/C projections?
I haven't closely looked into it, and I'm vibe-porting Mamba-3 to another lib, but the AI apparently caught on something like that. Although I guess that this doesn't make a difference (since the channel ordering don't matter as long as they stay consistent), I was wondering if there was some reason that I'm just not aware of.
Sorry in advance if the AI made a mistake and I didn't caught it, and thanks for both the papers and the project!