Beyond External Guidance:
Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training
Lingchen Sun1,2 | Rongyuan Wu1,2 | Zhengqiang Zhang1 | Ruibin Li1 | Yujing Sun1,2 | Shuaizheng Liu1,2 | Lei Zhang1,2
1The Hong Kong Polytechnic University, 2OPPO Research Institute
Both shallower and deeper layers gradually learn more discriminative patterns over time, but the shallower layer progresses very slowly. This indicates that the slow convergence of DiT is mainly due to the difficulty in learning clean and semantically rich features in shallow layers.
We answer this question: Can internal features be used as effective semantic guidance signals to guide the shallow layers' training? and introduce Self-Transcendence, a simple yet effective self-guided training strategy achieving REPA-level performance without any external feature supervision. Our proposed approach produces more discriminative and semantically richer features than pre-trained DINO used in REPA. Our method significantly improves training efficiency and generation quality, acheiving FID=1.25 at just 400 epochs.
- 2026.1.12: The paper and this repo are released.
⭐ If Self-Transcendence is helpful to your images or projects, please help star this repo. Thanks! 🤗
We find that the most effective guiding features should meet two criteria:
(1) they should have a clean structure, in the sense that they can effectively help shallow blocks distinguish noise from signal.
(2) they should be semantically discriminative, making it easier for shallow layers to learn effective representations.
With these considerations, we propose a two-stage training framework.
(a) Firstly, we use clean VAE features as guidance to help the model distinguish useful information from noise in shallow layers.(b) After a certain number of iterations, the model has learned more meaningful representations. We then freeze this model and use its representation as a fixed teacher. To enhance the semantic expression in the features, we build a self-guided representation that better aligns with the target conditions.
VAE-based alignment accelerates SiT training, while leveraging this model for self-transcendence leads to further improvements.
If our code helps your research or work, please consider citing our paper. The following are BibTeX references:
@article{sun2026beyond,
title={Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training},
author={Sun, Lingchen and Wu, Rongyuan and Zhang, Zhengqiang and Li, Ruibin and Sun, Yujing and Liu, Shuaizheng and Zhang, Lei},
journal={arXiv preprint arXiv: 2601.07773},
year={2026}
}
This project is released under the Apache 2.0 license.
This project is based on REPA. Thanks for the awesome work.
If you have any questions, please contact: ling-chen.sun@connect.polyu.hk



