End‑to‑end NLP Pipeline with T5‑Small Transformer
Python 3.8+ · FastAPI · Docker · AWS · MLOps :contentReference[oaicite:1]{index=1}
- Project Overview
- Key Features
- Tech Stack
- Installation & Setup
- Usage
- API Documentation
- Project Structure
- Pipeline Stages
- Model Information
- Dataset
- Docker Deployment
- AWS Deployment
- Contributing
- License
- Contact
This project implements a full text summarization system using the T5‑Small transformer. It establishes an MLOps pipeline covering data ingestion, validation, transformation, training, evaluation, and deployment using FastAPI and Docker :contentReference[oaicite:2]{index=2}.
Objective: Automatically produce concise, coherent summaries from long text using a production-grade transformer model.
- Modular Pipeline Architecture – Separate stages for easy maintenance :contentReference[oaicite:3]{index=3}
- T5‑Small Transformer – Efficient and powerful text-to-text pretrained model :contentReference[oaicite:4]{index=4}
- Data Validation – Ensures correct data format and structure :contentReference[oaicite:5]{index=5}
- ROUGE Evaluation – Standard summarization metrics :contentReference[oaicite:6]{index=6}
- FastAPI Deployment – High-performance API with built-in docs :contentReference[oaicite:7]{index=7}
- Cloud-Native Design – Dockerized and AWS-compatible with CI/CD :contentReference[oaicite:8]{index=8}
- Machine Learning: Hugging Face Transformers, PyTorch, NLTK, ROUGE, Datasets :contentReference[oaicite:9]{index=9}
- Backend / API: FastAPI, Uvicorn, Jinja2, PyYAML, Python 3.8+ :contentReference[oaicite:10]{index=10}
- DevOps / Cloud: Docker, AWS (EC2/ECR/S3), GitHub Actions, Boto3 :contentReference[oaicite:11]{index=11}
git clone https://github.com/MOHD-AFROZ-ALI/textsummarize.git
cd textsummarize
conda create -n textsummarizer python=3.8 -y
conda activate textsummarizer
pip install -r requirements.txtpython main.pyOr run full pipeline in Python:
from textSummarizer.pipeline.stage_01_data_ingestion import DataIngestionTrainingPipeline
# ... import other pipeline stages ...
# Execute complete pipelinepython app.pyVisit http://localhost:8080 for live API or /docs for Swagger UI.
from textSummarizer.pipeline.prediction import PredictionPipeline
predictor = PredictionPipeline()
text = """
Machine learning is a method of data analysis...
"""
summary = predictor.predict(text)
print(summary)- GET / → Redirects to API docs
- GET /train → Triggers model training pipeline
- POST /predict → Runs summarization:
curl -X POST "http://localhost:8080/predict" \
-H "Content-Type: application/json" \
-d '{"text": "Your long text..."}'([qwjfxnkd.gensparkspace.com][1])
textsummarize/
├─ .github/workflows/ # CI/CD config
├─ artifacts/ # Model & data outputs
│ ├─ data_ingestion/
│ ├─ data_validation/
│ ├─ data_transformation/
│ ├─ model_trainer/
│ └─ model_evaluation/
├─ config/config.yaml
├─ src/textSummarizer/
│ ├─ components/
│ ├─ pipeline/ # Training & prediction
│ ├─ utils/
│ └─ config/
├─ app.py # API server
├─ main.py # Training entry point
├─ Dockerfile
├─ params.yaml
├─ requirements.txt
└─ setup.py
- Data Ingestion – Downloads & extracts SAMSum JSON dataset ([qwjfxnkd.gensparkspace.com][1])
- Data Validation – Checks splits, formats, schemas ([qwjfxnkd.gensparkspace.com][1])
- Data Transformation – Tokenizes and prepares inputs
- Model Training – Fine-tunes T5‑Small with config params ([qwjfxnkd.gensparkspace.com][1])
- Model Evaluation – Computes ROUGE metrics & reports ([qwjfxnkd.gensparkspace.com][1])
- Model: T5‑Small (~60M parameters, encoder-decoder)
- Sequence Limits: 512 input / 150 summary tokens
- Training Config: 1 epoch, batch size=4, LR=5e-5 (AdamW), warmup=500, weight decay=0.01 ([qwjfxnkd.gensparkspace.com][1])
SAMSum
-
~16K conversation-summary pairs
- Training: ~14K
- Validation: ~818
- Test: ~819
-
Domain: daily English chats, JSON formatted ([qwjfxnkd.gensparkspace.com][1])
Sample Entry:
{
"dialogue": "John: Hey, ...",
"summary": "John and Sarah confirm their lunch..."
}docker build -t textsummarizer .
docker run -p 8080:8080 textsummarizerOr with docker-compose (optional):
services:
textsummarizer:
build: .
ports:
- "8080:8080"
environment:
- PYTHONPATH=/app
volumes:
- ./artifacts:/app/artifacts- Create IAM user (ECR, EC2, S3 full access)
- Create ECR repo (e.g.,
textsummarizer) - Launch EC2 (Ubuntu, open port 8080)
- Set GitHub Secrets for AWS creds and ECR info
- Configure EC2 as self-hosted runner
- CI/CD pipeline builds image, pushes to ECR, and deploys to EC2 container on push ([qwjfxnkd.gensparkspace.com][1])
We welcome contributions! 🎉
- Fork the repo
- Branch (
feature/your-idea) - Commit & push
- Submit Pull Request
- Model & performance upgrades
- Additional evaluation metrics
- Preprocessing enhancements
- UI/UX or docs improvements
Code standards:
- Follow PEP 8
- Add docstrings & tests
- Ensure CI checks pass
MIT License © 2025 MOHD AFROZ ALI
- MOHD AFROZ ALI – Aspiring SDE / AIML Intern
- B.Tech (IT), Muffakham Jah College of Engineering & Technology
- Email: afrozali3001.aa@gmail.com
- GitHub & LinkedIn in repo profile
- Phone: +91 9959786710
Enjoy using and contributing to this Text Summarization pipeline!