Skip to content

KAIST-Visual-AI-Group/VG-AVS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection

Juil Koo* · Daehyeon Choi* · Sangwoo Youn* · Phillip Y. Lee · Minhyuk Sung

(* Equal Contribution)

KAIST

arXiv 2025

Paper PDF Project Page AVS-Dataset VGAVS Teaser

TL;DR

We introduce Visually Grounded Active View Selection (VG-AVS) Framework, enabling embodied agents to actively adjust their viewpoint for better Visual Question Answering using only current visual cues, achieving state-of-the-art performance on synthetic and real-world benchmarks.

Release Checklist

🚧 Pretrained (SFT, SFT+GRPO) model checkpoint. (Expected due: early of January)

✅ AVS-ProcTHOR & AVS-HM3D dataset, training/inference/evaluation code. (12.24)

Code

1. Environment Setup

We tested our code in CUDA 12.8 with NVIDIA H200 GPUs. However, it might work in different CUDA environment and GPU device.

Conda Environment

Clone this repository:

git clone https://github.com/KAIST-Visual-AI-Group/VG-AVS.git
cd VG-AVS
# initialize virtual environment. we used conda.
conda create --name avs python=3.11 -y 
conda activate avs

# Firstly, install torch fit with your gpu. We used 2.8.0+cu128.
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128 

# install other libraries.
bash setup.sh 

2. Download data

ProcTHOR

We release the training data (ProcTHOR) and evaluation data (ProcTHOR) in huggingface, so please download and move these files in your project folder.

# move data samples in 'data' folder
mv avs_procthor_train.tar.gz avs_procthor_existence.tar.gz avs_procthor_counting.tar.gz avs_procthor_state.tar.gz ./data/

# extract files from tar file 
tar -xvf avs_procthor_train.tar.gz
tar -xvf avs_procthor_existence.tar.gz
tar -xvf avs_procthor_counting.tar.gz
tar -xvf avs_procthor_state.tar.gz

HM3D

For the case of HM3D dataset, please download the datasets with this offiical instruction at first. (Habitat-Matterport3D)

# authorize yourself and download 'v0.2/val' splits. 
mv hm3d-val-semantic-configs-v0.2.tar hm3d-val-semantic-annots-v0.2.tar hm3d-val-habitat-v0.2.tar hm3d-val-glb-v0.2.tar ./data/hm3d/val/

# extract files from tar file 
cd ./data/hm3d/val 
tar -xvf hm3d-val-semantic-configs-v0.2.tar
tar -xvf hm3d-val-semantic-annots-v0.2.tar
tar -xvf hm3d-val-habitat-v0.2.tar
tar -xvf hm3d-val-glb-v0.2.tar

Then additionally download data snapshot from huggingface, then move it into "data" folder.

# move data samples in 'data' folder
mv avs_hm3d.tar.gz ./data/

Finally, the folder structure is like below:

data/
├── hm3d/
│   └── val/
│       ├── 00800-TEEsavR23oF/
│       └── 00YYY-zzzzzzzzzzz/
├── avs_procthor_train/
├── avs_procthor_existence/
├── avs_procthor_counting/
├── avs_procthor_state/
├── avs_hm3d/
├── avs_hm3d_overall.jsonl
└──...

3. Setting Simulation Environment

Habitat-Sim

export CMAKE_POLICY_VERSION_MINIMUM=3.5

# building habitat-sim from source. it would take minutes.
git clone --branch stable https://github.com/facebookresearch/habitat-sim.git
cd habitat-sim
pip install . -v

Sanity Check

Our framework utilizes two different types of simulation environment (AI2-THOR and HM3D), so before running the code, please check each environment works properly in your setting.

Please follow notebook/environment_check.ipynb.

4. Download Pretrained model

We will release our pretrained model soon. Stay tuned:)

5. Run

Before running training or evaluation scripts, you need to configure the following paths and API keys.

Configuration

Required Paths
Variable Description Example
PROJECT_ROOT Root directory of the project /home/user/VG-AVS
DATA_JSONL Path to training/evaluation JSONL file /path/to/data/avs_procthor_train.jsonl
IMG_ROOT Root directory containing images ${PROJECT_ROOT}/data
MODEL_PATH Path to trained model (for evaluation) ${PROJECT_ROOT}/src/open-r1-multimodal/output/grpo-procthor
API Keys (for Evaluation)

The evaluation scripts use LLM APIs for the verifier model. Set these environment variables:

export GEMINI_API_KEY="your_gemini_api_key"   # Required for Gemini verifier
export OPENAI_API_KEY="your_openai_api_key"   # Optional, for GPT verifier

Tutorial

You can easliy test our framework in ProcTHOR environment.

bash src/open-r1-multimodal/run_scripts/test_procthor_single_sample.sh

Training

SFT Training (Supervised Fine-tuning):

bash src/open-r1-multimodal/run_scripts/run_sft_procthor_active_qa.sh

GRPO Training (Reinforcement Learning):

# Set required paths
bash src/open-r1-multimodal/run_scripts/run_grpo_procthor_active_qa.sh 

Evaluation

ProcTHOR Evaluation:

# Set required paths and API keys
bash src/open-r1-multimodal/run_scripts/test_procthor_action_accuracy.sh

HM3D Evaluation:

bash src/open-r1-multimodal/run_scripts/test_hm3d_action_accuracy.sh

Acknowledgement

Our implementation is built upon amazing projects including Qwen2.5-VL, VLM-R1, AI2-THOR, Habitat-Sim. We greatly thank all authors and contributors for open-sourcing their code and model checkpoints.

About

Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published