This is the official repository for the NeurIPS 2025 paper "Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era".
ImAge is an implicit aggregation method to get robust global image descriptors for visual place recognition, which neither modifies the backbone nor needs an extra aggregator. It only adds some aggregation tokens before a specific block of the transformer backbone, leveraging the inherent self-attention mechanism to implicitly aggregate patch features. Our method provides a novel perspective different from the previous paradigm, effectively and efficiently achieving SOTA performance.
The difference between ImAge and the previous paradigm is shown in this figure:
To quickly use our model, you can use Torch Hub:
import torch
model = torch.hub.load("Lu-Feng/ImAge", "ImAge")
This repo follows the framework of GSV-Cities for training, and the Visual Geo-localization Benchmark for evaluation. You can download the GSV-Cities datasets HERE, and refer to VPR-datasets-downloader to prepare test datasets.
The test dataset should be organized in a directory tree as such:
├── datasets_vg
└── datasets
└── pitts30k
└── images
├── train
│ ├── database
│ └── queries
├── val
│ ├── database
│ └── queries
└── test
├── database
└── queries
Before training, you should download the pre-trained foundation model DINOv2-register(ViT-B/14) HERE.
python3 train.py --eval_datasets_folder=/path/to/your/datasets_vg/datasets --eval_dataset_name=pitts30k --backbone=dinov2 --freeze_te=8 --num_learnable_aggregation_tokens=8 --train_batch_size=120 --lr=0.00005 --epochs_num=20 --patience=20 --initialization_dataset=msls_train --training_dataset=gsv_cities --foundation_model_path=/path/to/pre-trained/dinov2_vitb14_reg4_pretrain.pth
If you don't have the MSLS-train dataset, you can also set --initialization_dataset=gsv_cities.
python3 eval.py --eval_datasets_folder=/path/to/your/datasets_vg/datasets --eval_dataset_name=pitts30k --backbone=dinov2 --freeze_te=8 --num_learnable_aggregation_tokens=8 --resume=/path/to/trained/model/ImAge_GSV.pth
| Training set | Pitts30k | MSLS-val | Nordland | Download |
|---|---|---|---|---|
| GSV-Cities | 94.0 | 93.0 | 93.2 | LINK |
| Unified dataset | 94.1 | 94.5 | 97.7 | LINK |
!!!The code for merging previous VPR datasets to get the unified (merged) dataset is still being refined and will be released alongside the code of SelaVPR++. Please wait patiently.
This repository also supports training NetVLAD, SALAD, and BoQ on the GSV-Cities dataset with PyTorch (not pytorch-lightning in other repos) and using Automatic Mixed Precision.
Parts of this repo are inspired by the following repositories:
Visual Geo-localization Benchmark
If you find this repo useful for your research, please consider leaving a star⭐️ and citing the paper
@inproceedings{ImAge,
title={Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era},
author={Feng Lu and Tong Jin and Canming Ye and Xiangyuan Lan and Yunpeng Liu and Chun Yuan},
booktitle={The Annual Conference on Neural Information Processing Systems},
year={2025}
}

