Awesome work!
I wonder do you have any insights of using DiTs instead of UNets for your Any2Any models?
To my understanding, remote sensing images have more locality and less global semantic information compared to general domain images. Therefore, a convolutional kernel based U-Net would work well.