- pytorch>=2.0.1
- torchvision>=0.15.2
- munkres>=1.1.4
- scikit-learn>=1.2.2
- clip>=1.0
- timm>=0.9.2
- faiss-gpu>=1.7.4
CIFAR-10 will be automatically downloaded by Pytorch.
To improve the readability and extendibility of the code, we split different steps of our method into separate .py files. Below is the step-by-step tutorial. Note that the intermediate results would be saved to the ./data folder.
We first need to compute the image embedding with the CLIP model by running
python image_embedding.py
and the embedding of WordNet nouns (provided in the ./data folder) for text space construction by running
python text_embedding.py
Next, we aim to find discriminative nouns to describe image semantic centers. Motivated by the zero-shot classification paradigm of CLIP, we reversely classify all nouns into
python filter_nouns.py
The selected nouns compose the text space catering to the input images. Then, we retrieve nouns for each image to compute its counterpart in the text modality by running
python retrieve_text.py
For better collaboration between image and text features, we train additional cluster heads to further improve the clustering performance by running
python train_head.py
The training is extremely efficient, which takes only one minute for the CIFAR-10 dataset.
Our implementation uses the codebase of TAC.