Skip to content
/ L3DE Public

(ICCV 2025) How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach

License

Notifications You must be signed in to change notification settings

CVMI-Lab/L3DE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach

(ICCV 2025)

Chirui Chang · Jiahui Liu · Zhengzhe Liu · Xiaoyang Lyu · Yi-Hua Huang · Xin Tao · Pengfei Wan · Di Zhang · Xiaojuan Qi
The University of Hong Kong   |   Kling Team, Kuaishou Technology |   Lingnan University
Corresponding author

arXiv Paper Project Page


L3DE Teaser

L3DE Video Evaluation Pipeline

This repository provides a one-pass video evaluation pipeline: given one or more videos, it extracts frames, runs three external feature extractors (DINOv2, RAFT optical flow, UniDepth / monocular depth) via small proxy scripts, packs the results into a fixed format, feeds them to 3D-aware scorer (L3DE), and finally writes per-video scores.

Note on third-party code
The proxy/ directory contains 3 vendored/open-source tools (e.g. DINOv2, RAFT, UniDepth) plus extractor scripts.


1. Repository structure

A recommended layout:

.
├── l3de_pipeline.py        # main pipeline script (video → frames → features → L3DE score)
├── environment.yml         # conda environment to reproduce your setup
├── proxy/
│   ├── dinov2/             # Appearance proxy: DINOv2 code (third-party)
|       └── extract_*.py        # proxy scripts to call the corresponding models
│   ├── RAFT/               # Motion proxy: RAFT optical-flow code (third-party)
|       └── extract_*.py        # proxy scripts to call the corresponding models
│   ├── UniDepth/           # Geometry proxy: UniDepth code (third-party)
│       └── extract_*.py        # proxy scripts to call the corresponding models
├── weights/
│   └── L3DE.pth            # your trained L3DE checkpoint
├── README.md
└── LICENSE

2. Installation

conda env create -f environment.yml
conda activate L3DE
  • The provided environment.yml is meant to reproduce the environment to run the pipeline.
  • If your CUDA / GPU setup is different, install a compatible torch after activating the environment.

3. Checkpoints

Then place it as:

weights/
└── L3DE.pth

4. Usage

4.1 Single video

Run:

python l3de_pipeline.py \
  --input ./examples/demo.mp4 \
  --work-root ./workdir \
  --l3de-weights ./weights/L3DE.pth

What happens:

  1. The script samples frames from the video (by default: 25 frames from the first 4 seconds).

  2. The script calls the three proxy extractors under proxy/ (DINOv2, RAFT, UniDepth).

  3. The script packs the three modalities into the format that L3DE expects.

  4. The script runs the L3DE model and writes a score file.

You should get a structure like:

workdir/
└── demo/
    ├── frames/          # extracted RGB frames
    ├── dinov2/          # appearance features
    ├── flows/           # RAFT optical flow
    ├── mdepth/          # depth / geometry features
    ├── l3de_scores.npy  # optional numpy dump
    └── scores.json      # {"l3de_score": ...}

4.2 Directory of videos

python l3de_pipeline.py \
  --input ./videos \
  --video-glob "*.mp4" \
  --work-root ./workdir \
  --l3de-weights ./weights/L3DE.pth
  • Every matched video will be processed.
  • A CSV will be written, collecting all per-video scores.

Citation

Please cite our paper if you find our work helpful.

@inproceedings{chang2025far,
  title={How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach},
  author={Chang, Chirui and Liu, Jiahui and Liu, Zhengzhe and Lyu, Xiaoyang and Huang, Yi-Hua and Tao, Xin and Wan, Pengfei and Zhang, Di and Qi, Xiaojuan},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={10307--10317},
  year={2025}
}

Acknowledgement

This repository makes use of several excellent open-source projects:

  • DINOv2 for appearance/feature extraction.
  • RAFT for dense optical flow estimation.
  • UniDepth for geometry/depth cues.

About

(ICCV 2025) How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published