Tianming Liang Qirui Du Jian-Fang Hu Haichao Jiang Zicheng Lin Wei-Shi Zheng
ISEE Lab, Sun Yat-sen University
Through multi-turn interleaved reasoning and web search, Seg-ReSearch is able to localize and segment any text-guided target in images or videos, even those involving new concepts or up-to-date information that lies beyond the internal knowledge of MLLMs.
- [2026/02/06] All training and inference code for Seg-ReSearch has been released. Check it out!
- [2026/02/06] We have released OK-VOS, a challenging VOS Benchmark explicitly requiring external knowledge.
- [2026/02/04] The paper is available on arXiv.
Seg-ReSearch conducts multi-turn interactions with the search engine throughout the dynamic Multi-modal Chain-of-Thought (MCoT). This capability is incentivized by a hierarchical reward design: IGR pilots the initial planning, TPR encourages extensive exploration, and OR ensures final task accuracy.
To support Qwen3-VL, we use verl==0.7.0.dev0 and vllm==0.11.0, which require pytorch>=2.8.0 and cuda>=12.6.
git submodule update --init --recursive
conda create --name seg-research python=3.10
conda activate seg-research
pip install -e verl
pip install -e ".[vllm,search_tool]"
pip install "flash-attn==2.8.3" --no-build-isolationWe recommend creating a separate environment for the retrieval server to avoid dependency conflicts.
conda create --name retrieval python=3.10
conda activate retrieval
conda install -c pytorch -c nvidia faiss-gpu=1.8.0
pip install transformers datasets fastapi numpy torch uvicornbash data/download_okvos.shRun the preprocessing script to prepare the datasets for training and evaluation. This script includes extracting the bbox and point.
# Training Set
python examples/data_preprocess/okvos.py --split train --num_frames 6 --max_size 448 --min_size 448 --model_type Qwen3VL
# Test set
python examples/data_preprocess/okvos.py --split test --num_frames 6 --max_size 448 --min_size 448 --model_type Qwen3VL- Activate the
retrievalenvironment and start the retrieval server.
bash examples/train/okvos/start_retrieval.sh- Activate the
seg-researchenvironment and run one of the following scripts.
- Qwen3-VL-4B-Instruct:
# GRPO Training
bash examples/train/okvos/train_4b.sh
# DAPO Training
bash examples/train/okvos/train_4b_dapo.sh- Qwen3-VL-8B-Instruct:
Note: We set tensor_model_parallel_size=2 for 48G GPU memory. You can reduce it to 1 if you have larger memory.
# GRPO Training
bash examples/train/okvos/train_8b.sh
# DAPO Training
bash examples/train/okvos/train_8b_dapo.shRemember to replace the checkpoint path in the evaluation script before running.
# Serper API (Google Search)
bash examples/train/okvos/eval.sh
# DuckDuckGo (Free API)
bash examples/train/okvos/eval_ddg.shThe prediction results will be saved in a jsonl file.
Based on the generated jsonl file, run the segmentation model to generate object masks:
python post_segmentation/sam2_okvos.py [path_to_jsonl]You can integrate an auxiliary LLM (e.g., Qwen3-Next-80B-A3B-Instruct-FP8) to act as a summarizer, empowering Seg-ReSearch with web browsing capabilities for more precise retrieval results. Give it a try! 😊
To enable this feature,
- Launch the LLM service:
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
--port 8000 \
--tensor-parallel-size 2 \
--max-model-len 262144- Then, set the
SUMM_MODEL_URLandSUMM_MODEL_PATHineval.sh, like
export SUMM_MODEL_URL="http://localhost:8000/v1"
export SUMM_MODEL_PATH="Qwen/Qwen3-Next-80B-A3B-Instruct-FP8"Our work is built upon verl-tool, Seg-Zero and SeC. We sincerely appreciate these excellent works.
If you find our work helpful for your research, please consider citing our paper.
@article{liang2026segresearch,
title={Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search},
author={Tianming Liang and Qirui Du and Jian-Fang Hu and Haichao Jiang and Zicheng Lin and Wei-Shi Zheng},
journal={arXiv preprint arXiv:2602.04454},
year={2026}
}