Hi,
thank you for your hard work you’ve put into this project. I really appreciate the effort and dedication that has gone into maintaining Tora.
I have a question regarding the difference between the model architecture described in the paper and the current implementation.
In the paper, the usage of cross-attention mechanism for both the S-DiT-B block and T-DiT-B block is mentioned.
However, from reviewing the code, it doesn’t seem like that it is actually implemented.
If I’m missing something, please let me know.
Thanks again!