- Dec 16, 2025: 🚀 We are excited to announce the release of LongCat-Video-Avatar, a unified model that delivers expressive and highly dynamic audio-driven character animation, supporting native tasks including Audio-Text-to-Video, Audio-Text-Image-to-Video, and Video Continuation with seamless compatibility for both single-stream and multi-stream audio inputs. The release includes our Technical Report, code, model weights, and project page.
- Support Multiple Generation Modes: one unified model can be used for audio-text-to-video (AT2V) generation, audio-text-image-to-video (ATI2V) generation, and Video Continuation.
- Natural Human Dynamics: The disentangled unconditional guidance is designed to effectively decouple speech signals from motion dynamics for natural behavior.
- Avoid Repetitive Content: The reference skip attention is adopted to​ strategically incorporates reference cues to preserve identity while preventing excessive conditional image leakage
- Alleviate Error Accumulation from VAE: Cross-Chunk Latent Stitching is designed to eliminates redundant VAE decode-encode cycles to reduce pixel degradation in long sequences