Skip to content

nagara214/CS282_Project_PIC

Repository files navigation

Dependency

  • pytorch>=2.0.1
  • torchvision>=0.15.2
  • munkres>=1.1.4
  • scikit-learn>=1.2.2
  • clip>=1.0
  • timm>=0.9.2
  • faiss-gpu>=1.7.4

Dataset

CIFAR-10 will be automatically downloaded by Pytorch.

Usage

To improve the readability and extendibility of the code, we split different steps of our method into separate .py files. Below is the step-by-step tutorial. Note that the intermediate results would be saved to the ./data folder.

Image and Text Embedding Inference

We first need to compute the image embedding with the CLIP model by running

python image_embedding.py

and the embedding of WordNet nouns (provided in the ./data folder) for text space construction by running

python text_embedding.py

Text Counterpart Construction

Next, we aim to find discriminative nouns to describe image semantic centers. Motivated by the zero-shot classification paradigm of CLIP, we reversely classify all nouns into $k$ image semantic centers and select the top confident nouns for each image semantic center by running

python filter_nouns.py

The selected nouns compose the text space catering to the input images. Then, we retrieve nouns for each image to compute its counterpart in the text modality by running

python retrieve_text.py

Cluster Heads Training and evaluation

For better collaboration between image and text features, we train additional cluster heads to further improve the clustering performance by running

python train_head.py

The training is extremely efficient, which takes only one minute for the CIFAR-10 dataset.

LICENSE

Our implementation uses the codebase of TAC.

About

ShanghaiTech CS282 Machine Learning Course Project: Preserving Intra-Modal Consistency in Externally Guided Image Clustering.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages