Skip to content

Shreyash-Gaur/Image_Captioning

Repository files navigation


Image Captioning with Transformer

This project demonstrates an advanced image captioning system built using a Transformer model. The model is trained on the COCO 2017 dataset and is capable of generating descriptive captions for images. The implementation leverages TensorFlow and Keras for constructing the model and performing the training process.

Project Overview

The goal of this project is to generate captions for images using a Transformer-based architecture. The system consists of the following key components:

  • Dataset Preprocessing: The COCO 2017 dataset is used to train the model. Captions are preprocessed by lowercasing, removing punctuation, and adding start and end tokens.

  • Model Architecture:

    • CNN Encoder: The project uses an InceptionV3 model pre-trained on ImageNet to extract image features. These features are then reshaped and passed to the Transformer encoder.
    • Transformer Encoder and Decoder: The encoder processes the image features, while the decoder generates the caption word by word. Multi-head attention mechanisms are employed in both the encoder and decoder to capture relationships in the data.
    • Embeddings: Word embeddings are used to convert words into dense vectors, and positional embeddings are applied to retain the order of the words.
  • Training Strategy:

    • Checkpointing: Model checkpoints are saved during training to allow resuming from the last saved point.
    • Data Augmentation: Image augmentation techniques such as random flipping, rotation, and contrast adjustment are used to improve the model's robustness.
    • Loss and Metrics: A custom training loop calculates the loss and accuracy during training and validation. The loss function used is sparse categorical cross-entropy.
  • Model Inference: The trained model can generate captions for new images by passing the image through the encoder and generating words sequentially through the decoder.

Key Features

  • GPU Memory Management: The implementation restricts TensorFlow to allocate a limited amount of GPU memory to prevent out-of-memory errors.
  • Custom Training Loop: The model is trained using a custom training loop that allows for more flexibility in handling data, applying augmentations, and updating metrics.
  • Checkpointing and Resuming Training: The model saves checkpoints during training and can resume from the latest checkpoint, ensuring progress is not lost due to interruptions.
  • Image Augmentation: To improve generalization, various image augmentation techniques are applied during training.
  • Inference on Custom Images: The model can generate captions for any input image, whether provided via a URL or a local file.

Evaluation

The model's performance is evaluated using the following metrics:

  • BLEU (Bilingual Evaluation Understudy): Measures the n-gram overlap between the generated and reference captions. Higher BLEU scores indicate better alignment with ground truth captions.

  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation):

    • ROUGE-1: Measures the unigram (single-word) overlap between the predicted and reference captions.
    • ROUGE-L: Evaluates the longest common subsequence (LCS) between generated and actual captions, reflecting sequence accuracy.
  • CIDEr (Consensus-based Image Description Evaluation): Evaluates the consensus between generated captions and multiple reference captions by rewarding phrases shared with human-written descriptions.

The evaluation process involves generating captions for a sample of images and comparing them against reference captions using these metrics.

Example output:

Avg BLEU: 0.0569
Avg ROUGE-1: 0.3254
Avg ROUGE-L: 0.2934
Avg CIDEr: 0.6290

Getting Started

  1. Dataset Preparation: Ensure the COCO 2017 dataset is available and the necessary annotations are loaded and preprocessed.
  2. Model Training: Train the model by running the cells in the notebook. The training process includes saving checkpoints, which can be used to resume training if interrupted.
  3. Evaluation: Evaluate the model's performance using BLEU, ROUGE, and CIDEr metrics to assess caption quality.
  4. Inference: Use the trained model to generate captions for new images. You can provide an image via a URL or a local file.

Results

The model is capable of generating coherent captions for various images. Example captions generated by the model include:

  • "A man riding a bike down a street"
  • "A herd of cattle standing on top of a grass-covered field"

Future Work

Potential improvements include:

  • Fine-tuning the model on a larger vocabulary or incorporating external datasets to improve caption quality.
  • Experimenting with different architectures and hyperparameters to enhance model performance.
  • Implementing advanced techniques like beam search during inference to generate more accurate captions.

Conclusion

This project demonstrates the power of Transformers in the field of image captioning. The combination of a pre-trained CNN for feature extraction and a Transformer for caption generation provides a robust framework for generating descriptive image captions.


About

An advanced image captioning system built with TensorFlow and Keras, utilizing an InceptionV3 CNN encoder and a Transformer decoder trained on the COCO 2017 dataset featuring a custom training loop, data augmentation, and comprehensive evaluation using BLEU, ROUGE, and CIDEr metrics.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors