10 utility nodes for generating long videos in ComfyUI by splitting them into overlapping chunks with rolling reference frames.
Designed to solve the "context window reversal" problem — where video generation models (Wan 2.1/2.2, FantasyPortrait, etc.) revert to the starting state after ~135 frames because the reference image embedding pulls the generation back.
Now with per-chunk text prompting — change the narrative as your video progresses.
Most image-to-video models have a limited context window (~81–137 frames). When you try to generate longer videos, the model "forgets" the accumulated motion and snaps back to looking like frame 1. The result is an obvious loop or identity shift.
Rolling Reference: Generate video in chunks, using the last frame of each chunk as the reference image for the next chunk. This keeps identity consistent while allowing pose/action to progress naturally.
Chunk 0: [ref=original portrait] → 81 frames → last frame becomes next ref
Chunk 1: [ref=chunk 0 last frame] → 81 frames → last frame becomes next ref
Chunk 2: [ref=chunk 1 last frame] → 81 frames → ...
Final: Concatenate all chunks → 240+ frame seamless video
| Node | Purpose |
|---|---|
| Extract Video Chunk | Pull chunk N from a longer driving video, with configurable overlap |
| Blend Video Chunks (Crossfade) | Crossfade-blend two overlapping pixel-space chunks into one seamless video |
| Blend Latent Chunks (Pre-Decode) | Join two latent chunks before VAE decode — supports slerp, hard_cut, crossfade |
| Concat Video Chunks | Simple concatenation with optional first-frame trim (for rolling reference) |
| Get Frame By Index | Extract a single frame by index (-1 = last frame = next reference) |
| Get Frame Range | Extract a range of frames with negative indexing support |
| Video Chunk Planner | Calculate chunking strategy — shows chunk count, frame ranges, workflow steps |
Wan-Specific Nodes (2 nodes — require ComfyUI-WanVideoWrapper)
| Node | Purpose |
|---|---|
| Wan Chunked I2V Sampler ♾️ | All-in-one node: encode→sample→decode→extract ref→repeat. Supports per-chunk text prompting. |
| Wan Chunk Calculator 🧮 | Calculate exact total_frames for N chunks with 4n+1 normalization |
| Node | Purpose |
|---|---|
| Chain Text Embeds 🔗 | Chain multiple WanVideoTextEncode outputs into an ordered sequence for per-chunk text conditioning |
The Wan nodes gracefully degrade — if WanVideoWrapper isn't installed, the 7 core nodes still load and work fine.
Search for VideoChunkTools in ComfyUI Manager and click Install.
cd ComfyUI/custom_nodes
git clone https://github.com/gregtee2/ComfyUI_VideoChunkTools.gitDownload the ZIP from GitHub, extract to ComfyUI/custom_nodes/ComfyUI_VideoChunkTools/.
Restart ComfyUI after installation.
No pip dependencies required — uses only PyTorch (already in ComfyUI).
Use the core utility nodes to build a rolling-reference pipeline with any image-to-video model:
- Video Chunk Planner → See how many chunks you need
- Extract Video Chunk (index=0) → Get first driving chunk
- Run your I2V model with your original reference image
- Get Frame By Index (index=-1) → Extract last frame as new reference
- Extract Video Chunk (index=1) → Get next driving chunk
- Run your I2V model with the new reference
- Blend Video Chunks or Concat Video Chunks → Join the results
- Repeat for each chunk
The Wan Chunked I2V Sampler handles everything in a single node:
- Connect your Wan model, VAE, and start image
- Set
total_frames(e.g., 241 for ~15 seconds at 16fps) - Set
chunk_frames(e.g., 81) - Hit Queue — the node generates all chunks automatically
Features:
- Single-pass or Two-pass sampling (connect
model_bfor split denoising) - FLF (First-Last-Frame) — connect an
end_imageto guide the final frame - Multi-keyframe FLF — provide a batch of end images to distribute across chunks
- Crossfade overlap — set
end_blend_chunksfor smooth FLF transitions - Auto 4n+1 normalization — chunk sizes are automatically adjusted for Wan's requirements
- Per-chunk text prompts — connect a
ChainTextEmbedsnode to change the text conditioning per-chunk
Change the narrative as your video progresses — each chunk can have its own text prompt:
- Add multiple WanVideoTextEncode nodes, each with a different prompt
- Connect them to a Chain Text Embeds 🔗 node (
embed_1,embed_2,embed_3, ...) - Connect the
embed_sequenceoutput to the sampler'stext_embed_sequenceinput - Chunk 1 uses embed_1, chunk 2 uses embed_2, etc.
- If you have fewer prompts than chunks, the last prompt repeats for remaining chunks
[WanVideoTextEncode: "A cat sleeps on a sofa"]──┐
[WanVideoTextEncode: "The cat wakes up"]─────────┼──▶ [Chain Text Embeds 🔗]──▶ text_embed_sequence
[WanVideoTextEncode: "The cat jumps off"]────────┘
│
[Wan Model + VAE + Image]──────────────────────────▶ [Wan Chunked I2V Sampler ♾️]
Tip: You can still use the single
text_embedsinput if you want the same prompt for all chunks. The sequence input takes priority when connected.
Divides a video into chunks with overlap. Adjacent chunks share overlap_frames at their boundary.
| Input | Type | Default | Description |
|---|---|---|---|
| images | IMAGE | — | Full driving video |
| chunk_index | INT | 0 | Which chunk to extract (0-based) |
| chunk_frames | INT | 81 | Frames per chunk |
| overlap_frames | INT | 16 | Frames shared between adjacent chunks |
| Output | Type | Description |
|---|---|---|
| chunk | IMAGE | Extracted chunk |
| total_chunks | INT | How many chunks cover the full video |
| chunk_index | INT | Pass-through for chaining |
| is_last_chunk | BOOLEAN | True if this is the final chunk |
Crossfade two overlapping pixel-space chunks. The last N frames of chunk_a smooth-transition into the first N frames of chunk_b.
| Input | Type | Default | Description |
|---|---|---|---|
| chunk_a | IMAGE | — | First (earlier) chunk |
| chunk_b | IMAGE | — | Second (later) chunk |
| overlap_frames | INT | 16 | Frames to crossfade |
| blend_curve | ENUM | ease_in_out | linear, ease_in_out, sigmoid |
Join two 5D video latents along the temporal dimension before VAE decode. Operates in latent space — overlap is in latent frames (Wan: pixel_overlap / 4).
| Input | Type | Default | Description |
|---|---|---|---|
| latent_a | LATENT | — | First latent chunk |
| latent_b | LATENT | — | Second latent chunk |
| overlap_frames | INT | 4 | Overlap in latent temporal frames |
| blend_curve | ENUM | hard_cut | hard_cut, slerp, linear, ease_in_out, sigmoid |
hard_cut (recommended for rolling reference): Clean cut at the overlap midpoint — no dissolve artifacts.
slerp: Spherical linear interpolation — preserves latent vector magnitude. Standard technique for diffusion model interpolation.
Simple concatenation. For rolling-reference workflows where chunk B's first frame naturally matches chunk A's last frame.
| Input | Type | Default | Description |
|---|---|---|---|
| chunk_a | IMAGE | — | First chunk |
| chunk_b | IMAGE | — | Second chunk |
| trim_b_start | INT | 1 | Frames to trim from B's start (1 = drop duplicate ref frame) |
Extract a single frame. Use -1 for the last frame (rolling reference).
Extract a range of frames. Supports negative indexing. end=0 means "to the end".
Outputs the total number of chunks needed and a detailed text plan showing frame ranges and workflow steps.
All-in-one node for Wan I2V models. See the Workflows section above.
Simple math: calculates total_frames = chunk_frames + (num_chunks - 1) * (chunk_frames - 1) with 4n+1 normalization.
Chains up to 8 pre-encoded text embeddings into an ordered sequence for per-chunk text conditioning.
| Input | Type | Required | Description |
|---|---|---|---|
| embed_1 | WANVIDEOTEXTEMBEDS | Yes | Text embedding for chunk 1 |
| embed_2–embed_8 | WANVIDEOTEXTEMBEDS | No | Text embeddings for chunks 2–8 |
| Output | Type | Description |
|---|---|---|
| embed_sequence | TEXT_EMBED_SEQUENCE | Ordered list of embeddings — connect to sampler's text_embed_sequence input |
If you have fewer embeds than chunks, the last embed repeats for all remaining chunks. Non-connected slots are skipped (embed_1 + embed_3 = 2-entry sequence).
- Overlap of 16 frames works well for most cases (about 1 second at 16fps)
- ease_in_out blend curve gives the smoothest pixel-space transitions
- hard_cut in latent space is usually best — rolling reference already makes the overlap zone match
- slerp is the gold standard for latent interpolation if you need blending
- For Wan models, chunk_frames must be 4n+1 (5, 9, 13, ..., 77, 81, 85, ...). The nodes auto-normalize this.
- Use Video Chunk Planner first to understand how your video will be divided
- ComfyUI (any recent version)
- PyTorch (included with ComfyUI)
- ComfyUI-WanVideoWrapper — only needed for the 2 Wan-specific nodes. The 7 core nodes work without it.
MIT — see LICENSE
Built by Greg Tee for the ComfyUI community.