Juil Koo* · Daehyeon Choi* · Sangwoo Youn* · Phillip Y. Lee · Minhyuk Sung
(* Equal Contribution)
KAIST
arXiv 2025
We introduce Visually Grounded Active View Selection (VG-AVS) Framework, enabling embodied agents to actively adjust their viewpoint for better Visual Question Answering using only current visual cues, achieving state-of-the-art performance on synthetic and real-world benchmarks.
🚧 Pretrained (SFT, SFT+GRPO) model checkpoint. (Expected due: early of January)
✅ AVS-ProcTHOR & AVS-HM3D dataset, training/inference/evaluation code. (12.24)
We tested our code in CUDA 12.8 with NVIDIA H200 GPUs. However, it might work in different CUDA environment and GPU device.
Clone this repository:
git clone https://github.com/KAIST-Visual-AI-Group/VG-AVS.git
cd VG-AVS
# initialize virtual environment. we used conda.
conda create --name avs python=3.11 -y
conda activate avs
# Firstly, install torch fit with your gpu. We used 2.8.0+cu128.
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
# install other libraries.
bash setup.sh
We release the training data (ProcTHOR) and evaluation data (ProcTHOR) in huggingface, so please download and move these files in your project folder.
# move data samples in 'data' folder
mv avs_procthor_train.tar.gz avs_procthor_existence.tar.gz avs_procthor_counting.tar.gz avs_procthor_state.tar.gz ./data/
# extract files from tar file
tar -xvf avs_procthor_train.tar.gz
tar -xvf avs_procthor_existence.tar.gz
tar -xvf avs_procthor_counting.tar.gz
tar -xvf avs_procthor_state.tar.gz
For the case of HM3D dataset, please download the datasets with this offiical instruction at first. (Habitat-Matterport3D)
# authorize yourself and download 'v0.2/val' splits.
mv hm3d-val-semantic-configs-v0.2.tar hm3d-val-semantic-annots-v0.2.tar hm3d-val-habitat-v0.2.tar hm3d-val-glb-v0.2.tar ./data/hm3d/val/
# extract files from tar file
cd ./data/hm3d/val
tar -xvf hm3d-val-semantic-configs-v0.2.tar
tar -xvf hm3d-val-semantic-annots-v0.2.tar
tar -xvf hm3d-val-habitat-v0.2.tar
tar -xvf hm3d-val-glb-v0.2.tar
Then additionally download data snapshot from huggingface, then move it into "data" folder.
# move data samples in 'data' folder
mv avs_hm3d.tar.gz ./data/
Finally, the folder structure is like below:
data/
├── hm3d/
│ └── val/
│ ├── 00800-TEEsavR23oF/
│ └── 00YYY-zzzzzzzzzzz/
├── avs_procthor_train/
├── avs_procthor_existence/
├── avs_procthor_counting/
├── avs_procthor_state/
├── avs_hm3d/
├── avs_hm3d_overall.jsonl
└──...
export CMAKE_POLICY_VERSION_MINIMUM=3.5
# building habitat-sim from source. it would take minutes.
git clone --branch stable https://github.com/facebookresearch/habitat-sim.git
cd habitat-sim
pip install . -v
Our framework utilizes two different types of simulation environment (AI2-THOR and HM3D), so before running the code, please check each environment works properly in your setting.
Please follow notebook/environment_check.ipynb.
We will release our pretrained model soon. Stay tuned:)
Before running training or evaluation scripts, you need to configure the following paths and API keys.
| Variable | Description | Example |
|---|---|---|
PROJECT_ROOT |
Root directory of the project | /home/user/VG-AVS |
DATA_JSONL |
Path to training/evaluation JSONL file | /path/to/data/avs_procthor_train.jsonl |
IMG_ROOT |
Root directory containing images | ${PROJECT_ROOT}/data |
MODEL_PATH |
Path to trained model (for evaluation) | ${PROJECT_ROOT}/src/open-r1-multimodal/output/grpo-procthor |
The evaluation scripts use LLM APIs for the verifier model. Set these environment variables:
export GEMINI_API_KEY="your_gemini_api_key" # Required for Gemini verifier
export OPENAI_API_KEY="your_openai_api_key" # Optional, for GPT verifierYou can easliy test our framework in ProcTHOR environment.
bash src/open-r1-multimodal/run_scripts/test_procthor_single_sample.shSFT Training (Supervised Fine-tuning):
bash src/open-r1-multimodal/run_scripts/run_sft_procthor_active_qa.shGRPO Training (Reinforcement Learning):
# Set required paths
bash src/open-r1-multimodal/run_scripts/run_grpo_procthor_active_qa.sh ProcTHOR Evaluation:
# Set required paths and API keys
bash src/open-r1-multimodal/run_scripts/test_procthor_action_accuracy.shHM3D Evaluation:
bash src/open-r1-multimodal/run_scripts/test_hm3d_action_accuracy.shOur implementation is built upon amazing projects including Qwen2.5-VL, VLM-R1, AI2-THOR, Habitat-Sim. We greatly thank all authors and contributors for open-sourcing their code and model checkpoints.
