This repository implements an image captioning system using encoder–decoder architectures with multiple improvements such as beam search, teacher forcing, and attention mechanisms.
It is developed as part of coursework at the University of Tehran and is structured to provide a modular pipeline from data preparation to model evaluation.
config/: YAML files for training configurations and loggingdata/: Data loadermodels/: Model architecture definitions (EncoderCNN, DecoderRNN, Attention)notebooks/: Jupyter notebook for exploration and experimentsscripts/: Training, evaluation, and entry-point scriptsutils/: Metrics computation and visualization toolsreport.pdf: Original report (in Persian)
- Used Flickr8k dataset.
- Preprocessing includes:
- Tokenization of captions (without pre-built tokenizers).
- Adding
<START>and<END>tokens to sequences. - Padding sequences for batching.
- Custom DataLoader implementation for efficient batching.
- Encoder: Pre-trained ResNet-101 CNN backbone (feature extractor).
- Decoder: RNN-based (LSTM / GRU) with embedding layer to generate captions.
- Loss: Cross-Entropy
- Optimizer: Adam
- Techniques:
- Early Stopping
- Hyperparameter tuning (batch size, learning rate, embedding dimension, etc.)
- Training curves for loss and validation BLEU score are logged.
- Implemented BLEU score for caption evaluation.
- Training stops when BLEU stops improving on validation set.
- Beam Search Decoding instead of greedy approach for better captions.
- Teacher Forcing to stabilize RNN training.
- Attention Mechanism:
- Implemented Bahdanau-style attention.
- Improves caption accuracy and interpretability.
- Visualizations of attention maps are included.
- Alternative backbones: Replace ResNet-101 with others (e.g., ResNet-50, EfficientNet).
- Alternative decoders: Try GRU vs LSTM, compare performance.
Clone the repository:
git clone https://github.com/omidnaeej/Image2Caption-EncoderDecoder-Attention.git
cd Image2Caption-EncoderDecoder-Attentionpip install -r requirements.txtChange configurations in config/config.yaml if you want. Then run the main script:
python -m scripts.main- Baseline: Encoder–Decoder (ResNet + LSTM/GRU).
- Improvements: BLEU scores improved with beam search, teacher forcing, and attention.
- Visualizations: Attention heatmaps show regions of images aligned with words in captions.