Hi,
many thanks for your amazing work and open-sourcing the code. While using and experimenting with the codebase, I found a critical bug in the PositionalEncoding class implementation ([here])(https://github.com/DiffPoseTalk/DiffPoseTalk/blob/main/models/common.py#L22).
The bugged line is:
x = x + self.pe[:, x.shape[1], :]
and the corrected version is:
x = x + (self.pe[:, :x.shape[1], :]).requires_grad(False)
As you can see, the bug happens when adding the encodings of the first x.shape[1] elements but for that to happen we need the slicing which was missing, leading to adding only one PE to all the input sequence elements which corresponds to position x.shape[1].