Skip to content

c8s-wk/tinyvlm_sharedembedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tinyvlm

Dual-encoder, shared embedding space, based on CLIP

Set up environment + Install dependencies

Create venv (optional)

python -m venv .venv

# Activate
.venv/Scripts/activate

Install dependencies

pip install -r requirements.txt

If using GPU, download PyTorch(GPU):

pip install torch==2.2.2+cu118 torchvision==0.17.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

Logic

Dataset

  • Uses Flickr8k from Hugging Face (datasets library).
  • Each image comes with 5 human-written captions.
  • Only a small subset (e.g., 2,000 samples) is used for quick training.

Model

  • Image Encoder: ResNet18 (pretrained on ImageNet, classification head removed).
  • Text Encoder: BERT-base uncased, CLS token embedding as text representation.
  • Projection layers: Linear mappings align both encoders into the same latent space.

Loss function

Contrastive loss: Loss = average of image→text and text→image cross-entropy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages