Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation
SOTA performance
This is a PyTorch/GPU implementation of the paper "Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation", referred to as VFMTok, which establishes new state-of-the-art performance (gFID: 1.33, gIS: 317.4) for class-to-image generation based on the RAR framework.
VFMTok presents the first experimental evidence that features from existing pre-trained vision foundation models (including DINOv2, SigLIP, SigLIP2, etc.) can be directly utilized to reconstruct original images. To accomplish this, VFMTok introduces two innovative components: (1) a region-adaptive quantization framework that minimizes redundancy in pre-trained features on standard 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to maintain semantic fidelity.
When integrated into the AR generative models, the trained VFMTok achieves remarkable performance in class-to-image generation while tripling the convergence speed. Additionally, it enables high-fidelity class-conditional synthesis without requiring classifier-free guidance (CFG).
This repo contains:
- 🪐 A simple PyTorch implementation of VFMTok and various new *state-of-the-art generative models.
- ⚡️ Pre-trained tokenizer: VFMTok and AR generative models trained on ImageNet.
- 🛸 Training and evaluation scripts for tokenizer and generative models, which were also provided in here.
- 🎉 Hugging Face for easy access to pre-trained models.
- [2025/07/11] 🔥 VFMTok has been released. Checkout the paper for details.🔥
- [2025/09/18] 🔥 VFMTok has been accepted by NeurIPS 2025! 🔥
- [2025/10/11] 🔥 Image tokenizers and AR models for class-conditional image generation are released. 🔥
- [2025/10/11] 🔥 All codes of VFMTok have been released. 🔥
If you are not using Linux, do NOT proceed.
- Clone this repository and navigate to Hita folder
git clone https://github.com/CVMI-Lab/VFMTok.git
cd VFMTok- Install Package
conda create -n vfmtok python=3.10 -y
conda activate vfmtok
pip install --upgrade pip # enable PEP 660 support
pip install -e .- Install additional packages for training cases as required.
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
- Install deformable attention module
cd vfmtok/modules/ops
bash make.sh
In this repo, we release:
- One image tokenizers: VFMTok(DINOv2).
- State-of-the-art class-conditional autoregressive generative models ranging from 461M to 1.5B parameters.
In this repo, we release one image tokenizer: VFMTok(DINOv2). It directly utilizes the features from the frozen pre-trained VFM -- DINOv2, to reconstruct the image. Besides, VFMToks also designs 2 key components: region-adaptive quantization and semantic reconstruction to reduce the redundancy in the pretrained features and maintain the semantic fidelity, respectively.
| Method | tokens | rFID (256x256) | rIS (256x256) | weight |
|---|---|---|---|---|
| VFMTok | 256 | 0.98 | 216.2 | vfmtok-tokenizer.pt |
Once the trained VFMTok(DINOv2) is integrated into autoregressive (AR) generative model -- RAR, it ahieves new state-of-the-art image generation performance. Here we provide 2 types of AR generative models: ultra and vanilla. The ultra AR generative model achieves a new state-of-the-art image synthesis performance, while the vanilla ones also produce significant generation performance.
| Method | params | epochs | FID | sFID | IS | Pre. | Rec. |
|---|---|---|---|---|---|---|---|
| RAR-L-ultra | 461M | 400 | 1.33 | 5.72 | 317.4 | 0.78 | 0.65 |
| RAR-L-vanilla | 461M | 400 | 1.44 | 6.03 | 312.8 | 0.78 | 0.66 |
| RAR-XL-vanilla | 955M | 400 | 1.38 | 5.86 | 310.2 | 0.78 | 0.65 |
| RAR-XXL-vanilla | 1.5B | 400 | 1.36 | 5.86 | 301.3 | 0.78 | 0.66 |
The trained VFMTok(DINOv2), when integrated into the AR generation models, can also achieve impressive image generation quality without CFG-guidance (CFG-free guidance).
| Method | params | epochs | FID | sFID | IS | Pre. | Rec. |
|---|---|---|---|---|---|---|---|
| VFMTok-L-ultra | 461M | 400 | 2.01 | 5.34 | 211.1 | 0.78 | 0.63 |
| RAR-L-vanilla | 461M | 400 | 2.02 | 5.51 | 210.4 | 0.79 | 0.63 |
| RAR-XL-vanilla | 955M | 400 | 1.74 | 5.33 | 233.0 | 0.80 | 0.63 |
| RAR-XXL-vanilla | 1.5B | 400 | 1.65 | 5.55 | 253.7 | 0.80 | 0.63 |
- Download the DINOv2-L pre-trained foundation model from the official model zoo.
- Create symbolic links that point from the locations of the pretrained DINOv2-L model and the ImageNet training dataset folders to this directory.
- Create dataset script for your own dataset. Here, we provide a template for training tokenizers and AR generative models using the ImageNet dataset in LMDB format.
ln -s DINOv2-L_folder init_models
ln -s ImageNetFolder imagenet- Training VFMTok(DINOv2) tokenizer (see
scripts/tokenizer/train_tok.sh):
export NODE_COUNT=1
export NODE_RANK=0
export PROC_PER_NODE=8
scripts/autoregressive/torchrun.sh vq_train.py --image-size 336 --results-dir output --mixed-precision none --codebook-slots-embed-dim 12 \
--data-path imagenet/lmdb/train_lmdb --global-batch-size 8 --num-workers 4 --ckpt-every 5000 --epochs 50 \
--transformer-config configs/vit_transformer.yaml --log-every 1 --lr 1e-4 --ema --z-channels 512 \- Training AR generative models (see
scripts/autoregressive/run_train.sh)
config_file='configs/training/generator/rar.yaml'
accelerate launch --config_file $1 train_rar.py --config-file ${config_file} --image-size 336 --anno-file imagenet/lmdb/train_lmdb --num-workers 4- Resume from an AR generative checkpoint
config_file='configs/training/generator/rar.yaml'
accelerate launch --config_file $1 train_rar.py --config-file ${config_file} --image-size 336 --anno-file imagenet/lmdb/train_lmdb --num-workers 4- Evaluated a pretrained tokenizer (see
scripts/tokenizer/run_tok.sh):
scripts/autoregressive/torchrun.sh vqgan_test.py --vq-model VQ-16 --image-size 336 --output_dir recons --batch-size $1 \
--z-channels 512 --vq-ckpt tokenizer/vfmtok-tokenizer.pt --codebook-slots-embed-dim 12- Evaluate a pretrained AR generative model (see
scripts/autoregressive/run_test.sh)
config_file='configs/training/generator/rar.yaml'
iters="checkpoint-$(printf "%06d" "$1")"
scripts/autoregressive/torchrun.sh test_net.py --config-file ${config_file} --compile \
--gpt-ckpt snapshot/RAR-L/${iters}/model.safetensors --image-size 256 --image-size-eval 256 --per-proc-batch-size $2 \
--guidance-scale $3 --sample-dir samples --guidance-scale-pow 1If you find VFMTok useful for your research and applications, please kindly cite using this BibTeX:
@article{zheng2025vision,
title={Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation},
author={Zheng, Anlin and Wen, Xin and Zhang, Xuanyang and Ma, Chuofan and Wang, Tiancai and Yu, Gang and Zhang, Xiangyu and Qi, Xiaojuan},
journal={arXiv preprint arXiv:2507.08441},
year={2025}
}
The majority of this project is licensed under Apacha 2.0 License. Portions of the project are available under separate license of referred projects, detailed in corresponding files.
Our codebase builds upon several excellent open-source projects, including LlamaGen, Deformable DETR, VFMTok, RAR and AliTok. We are grateful to the communities behind them.
This codebase has been cleaned up but has not undergone extensive testing. If you encounter any issues or have questions, please open a GitHub issue. We appreciate your feedback!
