Skip to content

[NeurIPS 2025] The official repo of "DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding".

Notifications You must be signed in to change notification settings

weihao1115/dynamicvl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

ArXiv Hugging Face Dataset

About

DynamicVL is a comprehensive framework for analyzing long-term urban dynamics through remote sensing imagery. This repository ships the DVL-Suite dataset, task-specific benchmarks, and evaluation scripts that cover both closed-form vision-language tasks and pixel-level change detection.

News

  • 2025/08   DynamicVL was accepted to NeurIPS 2025! We will add encoder-decoder-based semantic change detection implementations to this repo. Stay tuned!

Environment Setup

# Create the conda environment
conda create -n dvl python=3.10 -y
conda activate dvl

# Install the package
(dvl): pip install -e .

# Optional: manually install PyTorch if the vLLM dependency conflicts with your environment
# Note: Downgrade cu128 if it conflicts with your CUDA drivers.
(dvl): pip install -U torch torchvision xformers --index-url https://download.pytorch.org/whl/cu128

# Optional: fix "version `GLIBCXX_3.4.32' not found" errors
(dvl): conda install -c conda-forge gcc=13 gxx=13 -y
(dvl): export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Data Setup

Download the DVL-Suite dataset and unzip the training and test archives:

mkdir data && cd data
unzip train.zip
unzip test.zip

Expected directory layout:

data/
├── train/                          # DVL-Instruct (Training Set)
│   ├── images/{city}/{region}/{image_id_timestamp}.tif
│   ├── cd_sem_masks/
│   ├── cd_refer_seg_masks/
│   ├── regional_caption/
│   ├── metadata.json
│   ├── basic_change_choice_qa.json
│   ├── basic_change_report_qa.json
│   ├── change_speed_choice_qa.json
│   ├── change_speed_report_qa.json
│   ├── change_referring_seg_qa.json
│   ├── eco_assessment.json
│   ├── dense_temporal_caption.json
│   └── regional_caption.json
└── test/                           # DVL-Bench (Test Set)
    └── [same structure as train/]

Usage

Vision-Language Tasks

Load Data

from dvl.vqa.dataset import DynamicVLVQA

dataset = DynamicVLVQA(subset="BCA-QA", data_dir="data/train")
for item in dataset:
    # images: List[PIL.Image] across time
    # messages: multi-turn Q&A dicts
    # metadata: contains id, task_type, prompts, options_str, image_list, time_stamps
    print(item)

Evaluate Open-Source Models (vLLM)

(dvl): python -m dvl.vqa.run_vllm \
    --model_id Qwen/Qwen2.5-VL-3B-Instruct \
    --subset BCA-QA

Available subsets:

  • BCA-QA - Basic Change Analysis (QA)
  • CSE-QA - Change Speed Estimation (QA)
  • BCA-Report - Basic Change Analysis (Report)
  • CSE-Report - Change Speed Estimation (Report)
  • DTC - Dense Temporal Caption
  • RCC - Regional Change Caption
  • EA - Environmental Assessment

Note: Set --batch_size 1 for llava-hf/llava-onevision-qwen2-7b-ov-hf to avoid GPU OOM.

Output: results/vqa/Qwen--Qwen2.5-VL-3B-Instruct/ stores .jsonl predictions and .json summaries.

Evaluate Commercial Models (Azure OpenAI)

export AZURE_OPENAI_BASE="{your-azure-endpoint}"
export AZURE_OPENAI_KEY="{your-api-key}"
export AZURE_OPENAI_API_VERSION="{your-api-version}"

(dvl): python -m dvl.vqa.run_azure_openai \
    --model_id gpt-4o \
    --subset BCA-QA

Output: results/vqa/gpt-4o/ stores task-specific .jsonl predictions and .json metrics.

GPT-Based Evaluation for Reports and Captions

export AZURE_OPENAI_BASE="{your-azure-endpoint}"
export AZURE_OPENAI_KEY="{your-api-key}"
export AZURE_OPENAI_API_VERSION="{your-api-version}"

(dvl): python -m dvl.vqa.pretty_print.gpt_eval \
    --gpt_model_id gpt-4.1-mini \
    --eval_model_id "Qwen/Qwen2.5-VL-3B-Instruct" \
    --subset DTC

Supported subsets:

  • BCA-Report
  • CSE-Report
  • DTC
  • RCC

Output: results/vqa/Qwen--Qwen2.5-VL-3B-Instruct/ includes GPT-scored .jsonl files (for example DTC.gpt-4.1-mini.jsonl).

Aggregate Metrics

# Multi-choice QA tasks (BCA-QA, CSE-QA, EA)
(dvl): python -m dvl.vqa.pretty_print.acc_table

# Open-ended generation tasks (Reports & Captions)
(dvl): python -m dvl.vqa.pretty_print.gen_table --gpt_model_id gpt-4.1-mini

Tabulated metrics are printed to console and saved in results/vqa/.

Referring Change Detection

Load Data

from dvl.vqa.dataset import DynamicVLReferSeg

dataset = DynamicVLReferSeg(data_dir="data/train")
for item in dataset:
    # t1_image, t2_image: np.ndarray of shape (1024, 1024, 3)
    # gt_mask: binary change mask
    # messages: instruction-response history
    # cd_info: source/target land-cover classes and indices
    # metadata: contains the unique evaluation id
    print(item)

Evaluate Predictions

Organize predicted masks using item["metadata"]["id"] as the filename stem:

{your-pred-dir}/
├── change_referring_seg_qa_0.png
├── change_referring_seg_qa_1.png
└── ...

Run the evaluation utilities:

# LISA-style binary IoU metrics
(dvl): python -m dvl.vqa.pretty_print.referseg_iou --pred_dir "{your-pred-dir}"

# MambaCD-style semantic change detection metrics
(dvl): python -m dvl.vqa.pretty_print.referseg_cd --pred_dir "{your-pred-dir}"

Scores are printed to console and stored alongside the submitted prediction masks.

Citation

If you find DynamicVL useful, please cite:

@article{xuan2025dynamicvl,
  title={DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding},
  author={Xuan, Weihao and Wang, Junjue and Qi, Heli and Chen, Zihang and Zheng, Zhuo and Zhong, Yanfei and Xia, Junshi and Yokoya, Naoto},
  journal={arXiv preprint arXiv:2505.21076},
  year={2025}
}

License

DynamicVL is released under the Apache-2.0 License.

Acknowledgements

DynamicVL builds on NAIP aerial imagery and the open-source multimodal community. We appreciate all contributors who benchmarked cutting-edge MLLMs on our dataset and shared feedback during the public release.

About

[NeurIPS 2025] The official repo of "DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages