Skip to content

currycurry915/Motion_Map_Injection

Repository files navigation

Enhancing Text-to-Video Editing with Motion Map Injection



Abstract

Based on the remarkable performance of text-to-image diffusion models, text-guided video editing studies recently have been expanded. Existing video editing studies have introduced an implicit method of adding cross-frame attention to estimate inter-frame attention, resulting in temporally consistent videos. However, because these methods use models pre-trained on text-image pair data, they do not handle unique property of video: motion. When editing a video with prompts, the attention map of the prompt implying the motion of the video (e.g. 'running', 'moving') is prone to be poorly estimated, which causes inaccurate video editing. To address this problem, we propose the `Motion Map Injection' (MMI) module to consider motion explicitly. The MMI module provides text-to-video (T2V) models a simple but effective way to convey motion in three steps: 1) extracting motion map, 2) calculating the similarity between the motion map and the attention map of each prompt, and 3) injecting motion map into the attention maps. Given experimental results, input video can be edited accurately with MMI module. To the best of our knowledge, our study is the first method that utilizes the motion in video for text-to-video editing.

Examples

You can find more experimental results on our project page.

Input Video Video-P2P Video-P2P + Ours
"There is a spreading water on the man" "There is a colorful spreading water on the man" "There is a colorful spreading water on the man"
"clouds flowing on the mountain" "lava flowing on the mountain" "lava flowing on the mountain"
"spinning wings of windmill are beside the river" "yellow spinning wings of windmill are beside the river" "yellow spinning wings of windmill are beside the river"

Setup

The environment is very similar to Video-P2P.

The versions of the packages we installed are:

torch: 1.12.1
xformers: 0.0.15.dev0+0bad001.d20230712

In the case of xformers, I installed it through the link introduced by Video-P2P.

pip install -r requirements.txt

Weights

We use the pre-trained stable diffusion model. You can download it here.

Quickstart

Since we developed our codes based on Video-P2P codes, you could refer to their github, if you need.

Please replace pretrained_model_path with the path to your stable-diffusion.

To download the pre-trained model, please refer to diffusers.

# Stage 1: Tuning to do model initialization.

# You can minimize the tuning epochs to speed up.
python run_tuning.py  --config="configs/cloud-1-tune.yaml"
# Stage 2: Attention Control

python run_attention_flow.py --config="configs/cloud-1-p2p.yaml" --motion_prompt "Please enter motion prompt"

# If the prompt is "clouds flowing under a skyscraper", the motion prompt is "flowing".
# You can input the motion prompt as below.

python run_attention_flow.py --config="configs/cloud-1-p2p.yaml" --motion_prompt "flowing"

Find your results in Video-P2P/outputs/xxx/results.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages