The accuracy and robustness of the selection of key points seems to be crucial, which depend on "Self-supervised Pretraining"
However, this step needs "a single subject and its set of aligned different-modality scans"
I wonder how can you get enough aligned images to train this module?
Looks like we need to do a groupwise registration first, right?