-
Notifications
You must be signed in to change notification settings - Fork 23
Open
Description
Question
Hello, I tried to reproduce the result of StreamVLN, and I use an H20 machine with 96GB of VRAM, but I sitll encountered a cuda out of memory error, and my configuration is
torchrun --nnodes=$NNODES --nproc_per_node=$NPROC_PER_NODE \
--rdzv_id=12345 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT streamvln/streamvln_train.py \
--deepspeed scripts/zero2.json \
--model_name_or_path $PREV_STAGE_CHECKPOINT \
--version $PROMPT_VERSION \
--video_folder ${VIDEO_FOLDER} \
--group_by_task False \
\
--num_history 8 \
--num_future_steps 4 \
--num_frames 32 \
--data_augmentation True \
\
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
--vision_tower ${VISION_MODEL_VERSION} \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio anyres_max_9 \
--image_grid_pinpoints "(1x1),...,(6x6)" \
--bf16 True \
--run_name $MID_RUN_NAME \
--output_dir checkpoints/$MID_RUN_NAME \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "epoch" \
--save_total_limit 1 \
--learning_rate 2e-5 \
--mm_vision_tower_lr 5e-6 \
--weight_decay 0. \
--warmup_ratio 0.075 \
--lr_scheduler_type "cosine_with_min_lr" \
--lr_scheduler_kwargs '{"min_lr": 1.85e-05}' \
--logging_steps 10 \
--tf32 True \
--model_max_length 32768 \
--gradient_checkpointing True \
--dataloader_num_workers 8 \
--lazy_preprocess True \
--torch_compile True \
--torch_compile_backend "inductor" \
--dataloader_drop_last True \
--report_to tensorboard
However, modifying mm_tunable_parts=mm_vision_tower,mm_mlp_adapter resolved the issue. I'd like to ask is the model performance described in the paper was achived with mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" or no?
And by the way, is the dataset EnvDrop dataset used in the following experiment?

Metadata
Metadata
Assignees
Labels
No labels