Skip to content

Using other ViT models yeilds lower miou score #9

@yiheng003

Description

@yiheng003

Thanks for open sourcing the great work!

I am able to reproduce the metric reported using ViT-B/16 encoder backbone, on city scape and voc20 dataset. However, replacing the vision encoder to ViT-B/32 and ViT-L/14@336px in configs/base_config.py, while keeping all other configuration unchanged, results in a lower score. Below is a summary table of dataset , pretrained vision encoder and mIoU:

dataset | encoder | mIoU
city scape | ViT-B/16 | 32.3500
city scape | ViT-B/32 | 22.6200
city scape | ViT-L/14@336px | 12.4400

voc 20 | ViT-B/16   | 81.5300
voc 20 | ViT-B/32   | 76.6700
voc 20 | ViT-L/14@336px | 50.3500

There seem to be something off with other pretrained clip vision encoders, especially ViT-L/14@336px. Is there is any parameters/configuration need to be adjusted for other vision encoders? Could you suggest some reasons for the lower performance? Thank you so much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions