I am confused about this sentence in your papar of "GPT Understands, Too":
Moreover, in the inference, we only need the output embedding h and can discard the LSTM head.
If the LSTM encoder was used during training, and the finally embeddings was combined by the outputs of LSTM encoder and the original embeddings, while it was discarded duraing inference, the finally embeddings was just the outputs of two embedding layers. Does this make different performance?
So why LSTM can be discarded in the inference?
Thanks a lot.