How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach
Chirui Chang
·
Jiahui Liu
·
Zhengzhe Liu
·
Xiaoyang Lyu
·
Yi-Hua Huang
·
Xin Tao
·
Pengfei Wan
·
Di Zhang
·
Xiaojuan Qi✉
The University of Hong Kong | Kling Team, Kuaishou Technology | Lingnan University
✉Corresponding author
|
This repository provides a one-pass video evaluation pipeline: given one or more videos, it extracts frames, runs three external feature extractors (DINOv2, RAFT optical flow, UniDepth / monocular depth) via small proxy scripts, packs the results into a fixed format, feeds them to 3D-aware scorer (L3DE), and finally writes per-video scores.
Note on third-party code
Theproxy/directory contains 3 vendored/open-source tools (e.g. DINOv2, RAFT, UniDepth) plus extractor scripts.
A recommended layout:
.
├── l3de_pipeline.py # main pipeline script (video → frames → features → L3DE score)
├── environment.yml # conda environment to reproduce your setup
├── proxy/
│ ├── dinov2/ # Appearance proxy: DINOv2 code (third-party)
| └── extract_*.py # proxy scripts to call the corresponding models
│ ├── RAFT/ # Motion proxy: RAFT optical-flow code (third-party)
| └── extract_*.py # proxy scripts to call the corresponding models
│ ├── UniDepth/ # Geometry proxy: UniDepth code (third-party)
│ └── extract_*.py # proxy scripts to call the corresponding models
├── weights/
│ └── L3DE.pth # your trained L3DE checkpoint
├── README.md
└── LICENSE
conda env create -f environment.yml
conda activate L3DE- The provided environment.yml is meant to reproduce the environment to run the pipeline.
- If your CUDA / GPU setup is different, install a compatible torch after activating the environment.
- Download the pre-trained L3DE model from: Google Drive
Then place it as:
weights/
└── L3DE.pth
Run:
python l3de_pipeline.py \
--input ./examples/demo.mp4 \
--work-root ./workdir \
--l3de-weights ./weights/L3DE.pthWhat happens:
-
The script samples frames from the video (by default: 25 frames from the first 4 seconds).
-
The script calls the three proxy extractors under proxy/ (DINOv2, RAFT, UniDepth).
-
The script packs the three modalities into the format that L3DE expects.
-
The script runs the L3DE model and writes a score file.
You should get a structure like:
workdir/
└── demo/
├── frames/ # extracted RGB frames
├── dinov2/ # appearance features
├── flows/ # RAFT optical flow
├── mdepth/ # depth / geometry features
├── l3de_scores.npy # optional numpy dump
└── scores.json # {"l3de_score": ...}
python l3de_pipeline.py \
--input ./videos \
--video-glob "*.mp4" \
--work-root ./workdir \
--l3de-weights ./weights/L3DE.pth
- Every matched video will be processed.
- A CSV will be written, collecting all per-video scores.
Please cite our paper if you find our work helpful.
@inproceedings{chang2025far,
title={How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach},
author={Chang, Chirui and Liu, Jiahui and Liu, Zhengzhe and Lyu, Xiaoyang and Huang, Yi-Hua and Tao, Xin and Wan, Pengfei and Zhang, Di and Qi, Xiaojuan},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={10307--10317},
year={2025}
}
This repository makes use of several excellent open-source projects:
