Official implementation of
ViT-5: Vision Transformers for the Mid-2020s
📄 Paper: https://arxiv.org/abs/2602.08071
🤗 Hugging Face: https://huggingface.co/FengWang3211/ViT-5
ViT-5 modernizes the canonical Vision Transformer architecture while preserving the clean Attention–FFN backbone design.
Rather than introducing a new paradigm, ViT-5 systematically upgrades core components of ViT using insights accumulated over the past several years in large-scale vision modeling.
It serves as:
- A strong ImageNet classification backbone
- A scalable Transformer foundation
- A drop-in upgrade for modern vision pipelines
- A competitive backbone for generative modeling
ViT-5 also improves performance when used as a backbone in diffusion-style generative frameworks.
| Model | Input Resolution | Params | Top-1 (ImageNet-1K) | HF Link |
|---|---|---|---|---|
| ViT-5-Small | 224 | 22M | 82.2% | Download |
| ViT-5-Base | 224 | 87M | 84.2% | Download |
| ViT-5-Base | 384 | 87M | 85.4% | Download |
| ViT-5-Large | 224 | 304M | 84.9% | Download |
| ViT-5-Large | 384 | 304M | 86.0% | Download |
# Install PyTorch (CUDA 12.4)
pip install torch==2.4.1 torchvision --index-url https://download.pytorch.org/whl/cu124
# Install core dependencies
pip install timm==0.4.12 numpy==1.26.4 wandb einops
# Install NVIDIA Apex (required for fused optimizers)
git clone https://github.com/NVIDIA/apex
cd apex
APEX_CPP_EXT=1 APEX_CUDA_EXT=1 pip install -v --no-build-isolation .
# (Optional) Install Flash Attention for faster training
pip install flash-attn==2.6.3 --no-build-isolation# ImageNet Pretraining (8 GPUs example)
# ViT-5-Small
torchrun --nproc_per_node 8 main.py \
--model vit5_small --input-size 224 --data-path YOUR_IMAGENET_PATH --output_dir DIR_TO_SAVE_LOG_AND_CKPT \
--batch 256 --accum_iter 1 --lr 4e-3 --weight-decay 0.05 --epochs 800 --opt fusedlamb --unscale-lr \
--mixup .8 --cutmix 1.0 --color-jitter 0.3 --drop-path 0.05 --reprob 0.0 --smoothing 0.0 --ThreeAugment \
--repeated-aug --bce-loss --warmup-epochs 5 --eval-crop-ratio 1.0 --dist-eval --disable_wandb
# ViT-5-Base
torchrun --nproc_per_node 8 main.py \
--model vit5_base --input-size 192 --data-path YOUR_IMAGENET_PATH --output_dir DIR_TO_SAVE_LOG_AND_CKPT \
--batch 256 --accum_iter 1 --lr 3e-3 --weight-decay 0.05 --epochs 800 --opt fusedlamb --unscale-lr \
--mixup .8 --cutmix 1.0 --color-jitter 0.3 --drop-path 0.2 --reprob 0.0 --smoothing 0.0 --ThreeAugment \
--repeated-aug --bce-loss --warmup-epochs 5 --eval-crop-ratio 1.0 --dist-eval --disable_wandb
# ViT-5-Large
torchrun --nproc_per_node 8 main.py \
--model vit5_large --input-size 192 --data-path YOUR_IMAGENET_PATH --output_dir DIR_TO_SAVE_LOG_AND_CKPT \
--batch 256 --accum_iter 1 --lr 3e-3 --weight-decay 0.05 --epochs 400 --opt fusedlamb --unscale-lr \
--mixup .8 --cutmix 1.0 --color-jitter 0.3 --drop-path 0.35 --reprob 0.0 --smoothing 0.0 --ThreeAugment \
--repeated-aug --bce-loss --warmup-epochs 5 --eval-crop-ratio 1.0 --dist-eval --disable_wandb
# Fine-tuning from Pretrained Checkpoint
# ViT-5-Small
torchrun --nproc_per_node 8 main.py \
--model vit5_small --finetune PATH_TO_YOUR_CKPT --data-path YOUR_DATASET_PATH --output_dir DIR_TO_SAVE_LOG_AND_CKPT \
--batch 64 --lr 1e-5 --weight-decay 0.1 --epochs 20 --unscale-lr --aa rand-m9-mstd0.5-inc1 --drop-path 0.05 \
--reprob 0.0 --smoothing 0.1 --no-repeated-aug --dist-eval --load_ema --eval-crop-ratio 1.0 --disable_wandb
# ViT-5-Base
torchrun --nproc_per_node 8 main.py \
--model vit5_base --finetune PATH_TO_YOUR_CKPT --data-path YOUR_DATASET_PATH --output_dir DIR_TO_SAVE_LOG_AND_CKPT \
--batch 64 --lr 1e-5 --weight-decay 0.1 --epochs 20 --unscale-lr --aa rand-m9-mstd0.5-inc1 --drop-path 0.25 \
--reprob 0.0 --smoothing 0.1 --no-repeated-aug --dist-eval --load_ema --eval-crop-ratio 1.0 --disable_wandb
# ViT-5-Large
torchrun --nproc_per_node 8 main.py \
--model vit5_large --finetune PATH_TO_YOUR_CKPT --data-path YOUR_DATASET_PATH --output_dir DIR_TO_SAVE_LOG_AND_CKPT \
--batch 64 --lr 1e-5 --weight-decay 0.1 --epochs 20 --unscale-lr --aa rand-m9-mstd0.5-inc1 --drop-path 0.5 \
--reprob 0.0 --smoothing 0.1 --no-repeated-aug --dist-eval --load_ema --eval-crop-ratio 1.0 --disable_wandbIf you use ViT-5, please cite:
@article{wang2026vit5,
title={ViT-5: Vision Transformers for The Mid-2020s},
author={Wang, Feng and Ren, Sucheng and Zhang, Tiezheng and Neskovic, Predrag and Bhattad, Anand and Xie, Cihang and Yuille, Alan},
journal={arXiv preprint arXiv:2602.08071},
year={2026}
}
This work builds upon the strong foundation of Vision Transformers and recent advances in scalable Transformer architectures. The code is basically built upon DeiT.



