Modality-agnostic Latent Thinking

🌟 This repository contains the training and evaluation code for Mull-Tokens, a method that compresses textual and visual reasoning information into modality agnostic discrete latent tokens for improved multi-modal reasoning with Qwen2.5-VL.

📈 This is also a great template to train Qwen style models and evaluate using lmms-eval all-in-one.

💥 News

[2026.2.4] Code for training and evaluations released! This initial release is a work in progress. If you find any errors or bugs, please email us or raise an issue and we will do our best to fix it.

Installation

Requirements

pip install -r requirements/requirements.txt

Flash Attention

For systems with GLIBC >= 2.32:

pip install flash-attn --no-build-isolation

For systems with GLIBC < 2.32 (e.g., RHEL/CentOS 8):

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.9.post1/flash_attn-2.5.9.post1+cu122torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip install flash_attn-2.5.9.post1+cu122torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl

Custom Transformers

Install the custom transformers package with Mull-tokens vocabulary support:

git clone https://github.com/arijitray1993/Video-R1
cd Video-R1
pip install -e .

Environment Setup

# CUDA setup (adjust for your system)
module load cuda/12.5
export CUDA_HOME=/path/to/cuda/12.5/install

# Optional: Configure HuggingFace cache
export HF_HOME="~/.cache/huggingface"

# Optional: WandB configuration
export WANDB_MODE="online"  # or "offline" for local logging

Pre-trained Models

Model	Description	HuggingFace
Qwen2.5-VL-Mull	Mull-Tokens Stage 2 (SFT)	array/Qwen2.5-VL-Mull
Qwen2.5-VL-MullGRPO	Mull-Tokens with GRPO	array/Qwen2.5-VL-MullGRPO

Minimal CLI Inference

Minimal inference prompt format with our pre-trained Mull-Tokens models.

Basic Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# Choose model: "array/Qwen2.5-VL-Mull" or "array/Qwen2.5-VL-MullGRPO"
MODEL_ID = "array/Qwen2.5-VL-Mull"
NUM_LATENTS = 20

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)


image_path = "path/to/your/image.jpg"
question = "If you stand at the X marked point and turn left, will the table be to your left or right? Please choose between the following answer choices: A. left. B. right. "
question_type = "multiple choice"

QUESTION_TEMPLATE_LATENT = (
          "{Question}\n"
          "Please think about this question deeply. "
          "It's encouraged to include self-reflection or verification in the reasoning process. "
          "Provide your final answer between the <answer> </answer> tags."
      )
TYPE_TEMPLATE = {
          "multiple choice": " Please provide only the single option letter (e.g., A, B, C, D, etc.) within the <answer> </answer> tags.",
          "numerical": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags.",
          "OCR": " Please transcribe text from the image/video clearly and provide your text answer within the <answer> </answer> tags.",
          "free-form": " Please provide your text answer within the <answer> </answer> tags.",
          "regression": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags."
      }
prompt = QUESTION_TEMPLATE_LATENT.format(Question=question) + TYPE_TEMPLATE[question_type]

# Prepare input with image
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    },
    # IMPORTANT: Mull-Tokens requires latent thinking tokens before answer generation
    # Append as assistant message with "<think>" + "<|latent_pad|>"*20 + "</think>"
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "<think>" + "<|latent_pad|>" * NUM_LATENTS + "</think>\n" 
            }
        ],
    },
]

# Process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
text = text.replace("<|im_end|>\n", "")  # Remove end token so model continues generating

image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

# Generate response
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,
    )

# Decode output (skip input tokens)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Multi-Image Input

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "image1.jpg"},
            {"type": "image", "image": "image2.jpg"},
            {"type": "text", "text": "Compare these two images."},
        ],
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": "<think>" + "<|latent_pad|>" * 20 + "</think>\n"}],
    },
]

Video Input

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "path/to/video.mp4", "max_pixels": 360*420, "fps": 1.0},
            {"type": "text", "text": "Describe what happens in this video."},
        ],
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": "<think>" + "<|latent_pad|>" * 20 + "</think>\n"}],
    },
]

Dataset Setup

Training Datasets

1. Video-R1

Video reasoning dataset with 165K chain-of-thought examples.

# Clone the Video-R1 repository
git clone https://github.com/tulerfeng/Video-R1
cd Video-R1

# Build environment
conda create -n video-r1 python=3.11
conda activate video-r1
bash setup.sh

# Qwen video extraction setting (max frames, resolutions)
# Use the [decord] feature to improve speed
cd src/qwen-vl-utils
pip install -e .[decord]
cd ..

# Download training dataset
git lfs install
git clone https://huggingface.co/datasets/Video-R1/Video-R1-data

Place the downloaded dataset in /your_path/, edit the path in ./src/unzip.py in root_directory then unzip:

python ./src/unzip.py

Evaluation Datasets

Recommended: Keep the online huggingface dataset links.

If using an offline local setup, update the paths in lmms-eval/lmms_eval/tasks/*/ YAML files to match your local setup.

Training

All training scripts should be run from the repository root directory.

1. Simple SFT (Baseline)

Standard supervised fine-tuning without Mull-tokens.

Config: google_scripts/exp_configs/sat_vidr1_zebra_sft.yaml

Update the config with your Video-R1 location:

video_r1_location: '/path/to/Video-R1-COT-165k.json'

Launch Script

bash google_scripts/launch_scripts/run_sat_vidr1_zebra_sft.sh

2. Mull-Tokens Stage 1: Multimodal warm-up for the mull-tokens.

Trains the model to compress visual embeddings into 20 discrete latent tokens.

Config: google_scripts/exp_configs/vidr1_mmlatent1_qwenbase.yaml

Update the config with your Video-R1 location:

video_r1_location: '/path/to/Video-R1-COT-165k.json'

Launch Script

bash google_scripts/launch_scripts/run_vidr1_zebra_mmlatent1_qwenbase.sh

3. Mull-Tokens Stage 2: Free-form optmizing mull-tokens based on final answer loss.

Trains with discrete latent tokens from Stage 1.

Config: google_scripts/exp_configs/vidr1_sat_zebra_sft_mmlatent2discrete_qwenlatent1.yaml

Prerequisites:

Completed Stage 1 checkpoint
Update model_path in config to point to Stage 1 checkpoint

Launch Script

bash google_scripts/launch_scripts/run_sft_qwenlatent1_vidr1_SAT_zebra_mmlatent_stage2discrete.sh

4. GRPO (Reinforcement Learning)

Optimizes with Group Relative Policy Optimization.

Config: google_scripts/exp_configs/vidr1_sat_zebra_grpo_mmlatent2discrete_qwenlatent1_new.yaml

Prerequisites:

Completed Stage 2 checkpoint
Update model_path in config to point to Stage 2 checkpoint

Launch Script

bash google_scripts/launch_scripts/run_grpo_sat_vidr1_zebra_qwenlatent2discrete_1.sh

Evaluation

Run evaluations using the lmms-eval framework.

Quick Start

cd lmms-eval
sh examples/models/vidr1_sat_zebra_sft_mmlatent2discrete_qwenlatent1.sh

Custom Evaluation

cd lmms-eval

MODEL_PATH="array/Qwen2.5-VL-Mull"  # or local checkpoint path
MODEL_ARGS="pretrained=${MODEL_PATH},max_pixels=12845056,max_num_frames=16,attn_implementation=flash_attention_2,interleave_visuals=False"

accelerate launch --num_processes=4 -m lmms_eval \
    --model qwen2_5_vl_mmlatentdiscrete \
    --model_args="${MODEL_ARGS}" \
    --gen_kwargs=prompt_mode=mmlatent2,num_latents=20 \
    --tasks blink_iqtest,blink_sprel,sat_real,vsibench,erqa,mmsi_bench \
    --batch_size 1 \
    --output_path "./eval_outputs"

Configuration Reference

Training Config Structure

# Run identifier
run_name: 'experiment_name'

# Dataset configuration
train_dataset_args:
  split: train
  mix_datas:
    'SAT': 0.6          # Dataset weight (0-1)
    'VideoR1': 0.2
    'ZebraCOT': 0.2
  sat_location: 'array/SAT'                    # HF repo or local path
  video_r1_location: '/path/to/Video-R1.json'
  zebracot_location: 'multimodal-reasoning-lab/Zebra-CoT'
  mode: 'train'

  # Mull-tokens specific
  mmlatent_mode_stage1: False    # Enable for Stage 1
  mmlatent_mode_stage2: False    # Enable for Stage 2
  mmlatent_rl_mode: False        # Enable for GRPO
  num_latent_tokens: 20          # Number of latent tokens

# Model configuration
model_name: Qwen2.5-VL-7B        # or Qwen2.5-VL-7B-MMLatentDiscrete
model_path: 'path/to/model'

# Training options
freeze_vision: True              # Freeze vision encoder
latent_size: 20                  # THIS DEFUNCT AND NOT USED.
stage: stage1                    # stage1 or stage2

Citation

If you use this code, please cite:

@misc{ray2025mulltokensmodalityagnosticlatentthinking,
      title={Mull-Tokens: Modality-Agnostic Latent Thinking}, 
      author={Arijit Ray and Ahmed Abdelkader and Chengzhi Mao and Bryan A. Plummer and Kate Saenko and Ranjay Krishna and Leonidas Guibas and Wen-Sheng Chu},
      year={2025},
      eprint={2512.10941},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.10941}, 
}

Acknowledgments

This work builds upon the awesome work by:

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dataloaders		dataloaders
google_scripts		google_scripts
images		images
lmms-eval		lmms-eval
models		models
requirements		requirements
scripts		scripts
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Modality-agnostic Latent Thinking

💥 News

Table of Contents

Installation

Requirements

Flash Attention

Custom Transformers

Environment Setup

Pre-trained Models

Minimal CLI Inference

Basic Usage

Multi-Image Input

Video Input

Dataset Setup

Training Datasets

1. Video-R1

Evaluation Datasets

Training

1. Simple SFT (Baseline)

2. Mull-Tokens Stage 1: Multimodal warm-up for the mull-tokens.

3. Mull-Tokens Stage 2: Free-form optmizing mull-tokens based on final answer loss.

4. GRPO (Reinforcement Learning)

Evaluation

Quick Start

Custom Evaluation

Configuration Reference

Training Config Structure

Citation

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages