Skip to content

Conversation

@Mr-Neutr0n
Copy link

Summary

Fixes #441

When training Qwen2.5-VL with agent-lightning + verl, the model crashes in get_rope_index with a shape mismatch:

position_ids[..., attention_mask == 1] = llm_positions

fails because llm_positions length differs from the attention mask true-count.

Root cause: In get_train_data_batch, prompt truncation (prompt_ids[:max_prompt_length]) changes the token count, potentially removing image placeholder tokens. However, image_grid_thw is computed from the original (untruncated) image_urls list. When get_rope_index processes the truncated sequence, it finds fewer <|vision_start|><|image_pad|> regions than image_grid_thw entries, causing the position ID length to diverge from the attention mask count.

Fix: After prompt truncation, count the remaining image regions in the truncated token sequence using the same vision_start_token_id + image_token_id pattern that get_rope_index uses, and slice image_urls to match before computing image_grid_thw.

  • Added _count_images_in_tokens() helper method to detect image regions in token sequences
  • Modified the transition-level mRoPE code path to reconcile image_urls with truncated prompts

Test plan

  • Verify Qwen2.5-VL training with prompts that exceed max_prompt_length and contain images no longer crashes in get_rope_index
  • Verify Qwen2.5-VL training with prompts shorter than max_prompt_length is unaffected (no truncation, all images retained)
  • Verify non-VL model training paths are unaffected (_use_mrope is False)

When training Qwen2.5-VL with agent-lightning + verl, prompt truncation
changes the token count but image_grid_thw is computed from the original
(untruncated) image_urls. This causes get_rope_index to fail with a
shape mismatch because it finds fewer image tokens in the truncated
input_ids than entries in image_grid_thw.

After prompt truncation, count remaining image regions in the truncated
token sequence and slice image_urls to match before computing
image_grid_thw, ensuring consistency between the token content and the
mRoPE spatial metadata.

Fixes microsoft#441
@Mr-Neutr0n Mr-Neutr0n force-pushed the fix/qwen-vl-mrope-truncation branch from bdd1c8d to ca0be5a Compare February 9, 2026 22:01
@Mr-Neutr0n
Copy link
Author

Friendly bump! Let me know if there's anything I should update or improve to help move this forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant