Skip to content

Official code repository for "Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training"

Notifications You must be signed in to change notification settings

csslc/Self-Transcendence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 

Repository files navigation

Beyond External Guidance:
Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training

Lingchen Sun1,2 | Rongyuan Wu1,2 | Zhengqiang Zhang1 | Ruibin Li1 | Yujing Sun1,2 | Shuaizheng Liu1,2 | Lei Zhang1,2

1The Hong Kong Polytechnic University, 2OPPO Research Institute

Self-Transcendence

🧡ྀི Summary

Both shallower and deeper layers gradually learn more discriminative patterns over time, but the shallower layer progresses very slowly. This indicates that the slow convergence of DiT is mainly due to the difficulty in learning clean and semantically rich features in shallow layers.

We answer this question: Can internal features be used as effective semantic guidance signals to guide the shallow layers' training? and introduce Self-Transcendence, a simple yet effective self-guided training strategy achieving REPA-level performance without any external feature supervision. Our proposed approach produces more discriminative and semantically richer features than pre-trained DINO used in REPA. Our method significantly improves training efficiency and generation quality, acheiving FID=1.25 at just 400 epochs.

Self-Transcendence

⏰ Update

  • 2026.1.12: The paper and this repo are released.

⭐ If Self-Transcendence is helpful to your images or projects, please help star this repo. Thanks! 🤗

🌟 Overview framework

We find that the most effective guiding features should meet two criteria:

(1) they should have a clean structure, in the sense that they can effectively help shallow blocks distinguish noise from signal.

(2) they should be semantically discriminative, making it easier for shallow layers to learn effective representations.

With these considerations, we propose a two-stage training framework.

(a) Firstly, we use clean VAE features as guidance to help the model distinguish useful information from noise in shallow layers.

(b) After a certain number of iterations, the model has learned more meaningful representations. We then freeze this model and use its representation as a fixed teacher. To enhance the semantic expression in the features, we build a self-guided representation that better aligns with the target conditions.

VAE-based alignment accelerates SiT training, while leveraging this model for self-transcendence leads to further improvements.

Citations

If our code helps your research or work, please consider citing our paper. The following are BibTeX references:

@article{sun2026beyond,
  title={Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training},
  author={Sun, Lingchen and Wu, Rongyuan and Zhang, Zhengqiang and Li, Ruibin and Sun, Yujing and Liu, Shuaizheng and Zhang, Lei},
  journal={arXiv preprint arXiv: 2601.07773},
  year={2026}
}

License

This project is released under the Apache 2.0 license.

Acknowledgement

This project is based on REPA. Thanks for the awesome work.

Contact

If you have any questions, please contact: ling-chen.sun@connect.polyu.hk

statistics

visitors

About

Official code repository for "Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published