Image Captioning with Transformer

This project demonstrates an advanced image captioning system built using a Transformer model. The model is trained on the COCO 2017 dataset and is capable of generating descriptive captions for images. The implementation leverages TensorFlow and Keras for constructing the model and performing the training process.

Project Overview

The goal of this project is to generate captions for images using a Transformer-based architecture. The system consists of the following key components:

Dataset Preprocessing: The COCO 2017 dataset is used to train the model. Captions are preprocessed by lowercasing, removing punctuation, and adding start and end tokens.
Model Architecture:
- CNN Encoder: The project uses an InceptionV3 model pre-trained on ImageNet to extract image features. These features are then reshaped and passed to the Transformer encoder.
- Transformer Encoder and Decoder: The encoder processes the image features, while the decoder generates the caption word by word. Multi-head attention mechanisms are employed in both the encoder and decoder to capture relationships in the data.
- Embeddings: Word embeddings are used to convert words into dense vectors, and positional embeddings are applied to retain the order of the words.
Training Strategy:
- Checkpointing: Model checkpoints are saved during training to allow resuming from the last saved point.
- Data Augmentation: Image augmentation techniques such as random flipping, rotation, and contrast adjustment are used to improve the model's robustness.
- Loss and Metrics: A custom training loop calculates the loss and accuracy during training and validation. The loss function used is sparse categorical cross-entropy.
Model Inference: The trained model can generate captions for new images by passing the image through the encoder and generating words sequentially through the decoder.

Key Features

GPU Memory Management: The implementation restricts TensorFlow to allocate a limited amount of GPU memory to prevent out-of-memory errors.
Custom Training Loop: The model is trained using a custom training loop that allows for more flexibility in handling data, applying augmentations, and updating metrics.
Checkpointing and Resuming Training: The model saves checkpoints during training and can resume from the latest checkpoint, ensuring progress is not lost due to interruptions.
Image Augmentation: To improve generalization, various image augmentation techniques are applied during training.
Inference on Custom Images: The model can generate captions for any input image, whether provided via a URL or a local file.

Evaluation

The model's performance is evaluated using the following metrics:

BLEU (Bilingual Evaluation Understudy): Measures the n-gram overlap between the generated and reference captions. Higher BLEU scores indicate better alignment with ground truth captions.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
- ROUGE-1: Measures the unigram (single-word) overlap between the predicted and reference captions.
- ROUGE-L: Evaluates the longest common subsequence (LCS) between generated and actual captions, reflecting sequence accuracy.
CIDEr (Consensus-based Image Description Evaluation): Evaluates the consensus between generated captions and multiple reference captions by rewarding phrases shared with human-written descriptions.

The evaluation process involves generating captions for a sample of images and comparing them against reference captions using these metrics.

Example output:

Avg BLEU: 0.0569
Avg ROUGE-1: 0.3254
Avg ROUGE-L: 0.2934
Avg CIDEr: 0.6290

Getting Started

Dataset Preparation: Ensure the COCO 2017 dataset is available and the necessary annotations are loaded and preprocessed.
Model Training: Train the model by running the cells in the notebook. The training process includes saving checkpoints, which can be used to resume training if interrupted.
Evaluation: Evaluate the model's performance using BLEU, ROUGE, and CIDEr metrics to assess caption quality.
Inference: Use the trained model to generate captions for new images. You can provide an image via a URL or a local file.

Results

The model is capable of generating coherent captions for various images. Example captions generated by the model include:

"A man riding a bike down a street"
"A herd of cattle standing on top of a grass-covered field"

Future Work

Potential improvements include:

Fine-tuning the model on a larger vocabulary or incorporating external datasets to improve caption quality.
Experimenting with different architectures and hyperparameters to enhance model performance.
Implementing advanced techniques like beam search during inference to generate more accurate captions.

Conclusion

This project demonstrates the power of Transformers in the field of image captioning. The combination of a pre-trained CNN for feature extraction and a Transformer for caption generation provides a robust framework for generating descriptive image captions.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.ipynb_checkpoints		.ipynb_checkpoints
checkpoint		checkpoint
coco2017		coco2017
.gitignore		.gitignore
Image_Captioning.ipynb		Image_Captioning.ipynb
Image_Captioning_Eval.ipynb		Image_Captioning_Eval.ipynb
LICENSE		LICENSE
README.md		README.md
testimage.jpg		testimage.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Captioning with Transformer

Project Overview

Key Features

Evaluation

Getting Started

Results

Future Work

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Image Captioning with Transformer

Project Overview

Key Features

Evaluation

Getting Started

Results

Future Work

Conclusion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages