🌟 This repository contains the training and evaluation code for Mull-Tokens, a method that compresses textual and visual reasoning information into modality agnostic discrete latent tokens for improved multi-modal reasoning with Qwen2.5-VL.
📈 This is also a great template to train Qwen style models and evaluate using lmms-eval all-in-one.
- [2026.2.4] Code for training and evaluations released! This initial release is a work in progress. If you find any errors or bugs, please email us or raise an issue and we will do our best to fix it.
- Installation
- Pre-trained Models
- Minimal CLI Inference
- Dataset Setup
- Training
- Evaluation
- Configuration Reference
pip install -r requirements/requirements.txtFor systems with GLIBC >= 2.32:
pip install flash-attn --no-build-isolationFor systems with GLIBC < 2.32 (e.g., RHEL/CentOS 8):
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.9.post1/flash_attn-2.5.9.post1+cu122torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip install flash_attn-2.5.9.post1+cu122torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whlInstall the custom transformers package with Mull-tokens vocabulary support:
git clone https://github.com/arijitray1993/Video-R1
cd Video-R1
pip install -e .# CUDA setup (adjust for your system)
module load cuda/12.5
export CUDA_HOME=/path/to/cuda/12.5/install
# Optional: Configure HuggingFace cache
export HF_HOME="~/.cache/huggingface"
# Optional: WandB configuration
export WANDB_MODE="online" # or "offline" for local logging| Model | Description | HuggingFace |
|---|---|---|
| Qwen2.5-VL-Mull | Mull-Tokens Stage 2 (SFT) | array/Qwen2.5-VL-Mull |
| Qwen2.5-VL-MullGRPO | Mull-Tokens with GRPO | array/Qwen2.5-VL-MullGRPO |
Minimal inference prompt format with our pre-trained Mull-Tokens models.
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
# Choose model: "array/Qwen2.5-VL-Mull" or "array/Qwen2.5-VL-MullGRPO"
MODEL_ID = "array/Qwen2.5-VL-Mull"
NUM_LATENTS = 20
# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
image_path = "path/to/your/image.jpg"
question = "If you stand at the X marked point and turn left, will the table be to your left or right? Please choose between the following answer choices: A. left. B. right. "
question_type = "multiple choice"
QUESTION_TEMPLATE_LATENT = (
"{Question}\n"
"Please think about this question deeply. "
"It's encouraged to include self-reflection or verification in the reasoning process. "
"Provide your final answer between the <answer> </answer> tags."
)
TYPE_TEMPLATE = {
"multiple choice": " Please provide only the single option letter (e.g., A, B, C, D, etc.) within the <answer> </answer> tags.",
"numerical": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags.",
"OCR": " Please transcribe text from the image/video clearly and provide your text answer within the <answer> </answer> tags.",
"free-form": " Please provide your text answer within the <answer> </answer> tags.",
"regression": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags."
}
prompt = QUESTION_TEMPLATE_LATENT.format(Question=question) + TYPE_TEMPLATE[question_type]
# Prepare input with image
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
},
# IMPORTANT: Mull-Tokens requires latent thinking tokens before answer generation
# Append as assistant message with "<think>" + "<|latent_pad|>"*20 + "</think>"
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "<think>" + "<|latent_pad|>" * NUM_LATENTS + "</think>\n"
}
],
},
]
# Process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
text = text.replace("<|im_end|>\n", "") # Remove end token so model continues generating
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
# Generate response
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
)
# Decode output (skip input tokens)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "image1.jpg"},
{"type": "image", "image": "image2.jpg"},
{"type": "text", "text": "Compare these two images."},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "<think>" + "<|latent_pad|>" * 20 + "</think>\n"}],
},
]messages = [
{
"role": "user",
"content": [
{"type": "video", "video": "path/to/video.mp4", "max_pixels": 360*420, "fps": 1.0},
{"type": "text", "text": "Describe what happens in this video."},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "<think>" + "<|latent_pad|>" * 20 + "</think>\n"}],
},
]Video reasoning dataset with 165K chain-of-thought examples.
# Clone the Video-R1 repository
git clone https://github.com/tulerfeng/Video-R1
cd Video-R1
# Build environment
conda create -n video-r1 python=3.11
conda activate video-r1
bash setup.sh
# Qwen video extraction setting (max frames, resolutions)
# Use the [decord] feature to improve speed
cd src/qwen-vl-utils
pip install -e .[decord]
cd ..
# Download training dataset
git lfs install
git clone https://huggingface.co/datasets/Video-R1/Video-R1-dataPlace the downloaded dataset in /your_path/, edit the path in ./src/unzip.py in root_directory then unzip:
python ./src/unzip.pyRecommended: Keep the online huggingface dataset links.
If using an offline local setup, update the paths in lmms-eval/lmms_eval/tasks/*/ YAML files to match your local setup.
All training scripts should be run from the repository root directory.
Standard supervised fine-tuning without Mull-tokens.
Config: google_scripts/exp_configs/sat_vidr1_zebra_sft.yaml
Update the config with your Video-R1 location:
video_r1_location: '/path/to/Video-R1-COT-165k.json'Launch Script
bash google_scripts/launch_scripts/run_sat_vidr1_zebra_sft.shTrains the model to compress visual embeddings into 20 discrete latent tokens.
Config: google_scripts/exp_configs/vidr1_mmlatent1_qwenbase.yaml
Update the config with your Video-R1 location:
video_r1_location: '/path/to/Video-R1-COT-165k.json'Launch Script
bash google_scripts/launch_scripts/run_vidr1_zebra_mmlatent1_qwenbase.shTrains with discrete latent tokens from Stage 1.
Config: google_scripts/exp_configs/vidr1_sat_zebra_sft_mmlatent2discrete_qwenlatent1.yaml
Prerequisites:
- Completed Stage 1 checkpoint
- Update
model_pathin config to point to Stage 1 checkpoint
Launch Script
bash google_scripts/launch_scripts/run_sft_qwenlatent1_vidr1_SAT_zebra_mmlatent_stage2discrete.shOptimizes with Group Relative Policy Optimization.
Config: google_scripts/exp_configs/vidr1_sat_zebra_grpo_mmlatent2discrete_qwenlatent1_new.yaml
Prerequisites:
- Completed Stage 2 checkpoint
- Update
model_pathin config to point to Stage 2 checkpoint
Launch Script
bash google_scripts/launch_scripts/run_grpo_sat_vidr1_zebra_qwenlatent2discrete_1.shRun evaluations using the lmms-eval framework.
cd lmms-eval
sh examples/models/vidr1_sat_zebra_sft_mmlatent2discrete_qwenlatent1.shcd lmms-eval
MODEL_PATH="array/Qwen2.5-VL-Mull" # or local checkpoint path
MODEL_ARGS="pretrained=${MODEL_PATH},max_pixels=12845056,max_num_frames=16,attn_implementation=flash_attention_2,interleave_visuals=False"
accelerate launch --num_processes=4 -m lmms_eval \
--model qwen2_5_vl_mmlatentdiscrete \
--model_args="${MODEL_ARGS}" \
--gen_kwargs=prompt_mode=mmlatent2,num_latents=20 \
--tasks blink_iqtest,blink_sprel,sat_real,vsibench,erqa,mmsi_bench \
--batch_size 1 \
--output_path "./eval_outputs"# Run identifier
run_name: 'experiment_name'
# Dataset configuration
train_dataset_args:
split: train
mix_datas:
'SAT': 0.6 # Dataset weight (0-1)
'VideoR1': 0.2
'ZebraCOT': 0.2
sat_location: 'array/SAT' # HF repo or local path
video_r1_location: '/path/to/Video-R1.json'
zebracot_location: 'multimodal-reasoning-lab/Zebra-CoT'
mode: 'train'
# Mull-tokens specific
mmlatent_mode_stage1: False # Enable for Stage 1
mmlatent_mode_stage2: False # Enable for Stage 2
mmlatent_rl_mode: False # Enable for GRPO
num_latent_tokens: 20 # Number of latent tokens
# Model configuration
model_name: Qwen2.5-VL-7B # or Qwen2.5-VL-7B-MMLatentDiscrete
model_path: 'path/to/model'
# Training options
freeze_vision: True # Freeze vision encoder
latent_size: 20 # THIS DEFUNCT AND NOT USED.
stage: stage1 # stage1 or stage2If you use this code, please cite:
@misc{ray2025mulltokensmodalityagnosticlatentthinking,
title={Mull-Tokens: Modality-Agnostic Latent Thinking},
author={Arijit Ray and Ahmed Abdelkader and Chengzhi Mao and Bryan A. Plummer and Kate Saenko and Ranjay Krishna and Leonidas Guibas and Wen-Sheng Chu},
year={2025},
eprint={2512.10941},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.10941},
}This work builds upon the awesome work by:
