Using other ViT models yeilds lower miou score

Thanks for open sourcing the great work! 

I am able to reproduce the metric reported using ViT-B/16 encoder backbone, on city scape and voc20 dataset. However, replacing the vision encoder to ViT-B/32 and ViT-L/14@336px in configs/base_config.py, while keeping all other configuration unchanged, results in a lower score. Below is a summary table of dataset , pretrained vision encoder and mIoU: 

dataset     | encoder               | mIoU
city scape | ViT-B/16              | 32.3500 
city scape | ViT-B/32              | 22.6200
city scape | ViT-L/14@336px | 12.4400

voc 20      | ViT-B/16              | 81.5300 
voc 20      | ViT-B/32              | 76.6700 
voc 20      | ViT-L/14@336px | 50.3500

There seem to be something off with other pretrained clip vision encoders, especially ViT-L/14@336px. Is there is any parameters/configuration need to be adjusted for other vision encoders? Could you suggest some reasons for the lower performance? Thank you so much! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using other ViT models yeilds lower miou score #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Using other ViT models yeilds lower miou score #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions