Text Summarization Project 🚀

End‑to‑end NLP Pipeline with T5‑Small Transformer
Python 3.8+ · FastAPI · Docker · AWS · MLOps :contentReference[oaicite:1]{index=1}

📘 Table of Contents

Project Overview
Key Features
Tech Stack
Installation & Setup
Usage
API Documentation
Project Structure
Pipeline Stages
Model Information
Dataset
Docker Deployment
AWS Deployment
Contributing
License
Contact

Project Overview

This project implements a full text summarization system using the T5‑Small transformer. It establishes an MLOps pipeline covering data ingestion, validation, transformation, training, evaluation, and deployment using FastAPI and Docker :contentReference[oaicite:2]{index=2}.

Objective: Automatically produce concise, coherent summaries from long text using a production-grade transformer model.

Key Features

Modular Pipeline Architecture – Separate stages for easy maintenance :contentReference[oaicite:3]{index=3}
T5‑Small Transformer – Efficient and powerful text-to-text pretrained model :contentReference[oaicite:4]{index=4}
Data Validation – Ensures correct data format and structure :contentReference[oaicite:5]{index=5}
ROUGE Evaluation – Standard summarization metrics :contentReference[oaicite:6]{index=6}
FastAPI Deployment – High-performance API with built-in docs :contentReference[oaicite:7]{index=7}
Cloud-Native Design – Dockerized and AWS-compatible with CI/CD :contentReference[oaicite:8]{index=8}

Tech Stack

Machine Learning: Hugging Face Transformers, PyTorch, NLTK, ROUGE, Datasets :contentReference[oaicite:9]{index=9}
Backend / API: FastAPI, Uvicorn, Jinja2, PyYAML, Python 3.8+ :contentReference[oaicite:10]{index=10}
DevOps / Cloud: Docker, AWS (EC2/ECR/S3), GitHub Actions, Boto3 :contentReference[oaicite:11]{index=11}

Installation & Setup

git clone https://github.com/MOHD-AFROZ-ALI/textsummarize.git
cd textsummarize

conda create -n textsummarizer python=3.8 -y
conda activate textsummarizer

pip install -r requirements.txt

Usage

Train the Model

python main.py

Or run full pipeline in Python:

from textSummarizer.pipeline.stage_01_data_ingestion import DataIngestionTrainingPipeline
# ... import other pipeline stages ...
# Execute complete pipeline

Run the API Server

python app.py

Visit http://localhost:8080 for live API or /docs for Swagger UI.

Example Prediction

from textSummarizer.pipeline.prediction import PredictionPipeline

predictor = PredictionPipeline()

text = """
Machine learning is a method of data analysis...
"""
summary = predictor.predict(text)
print(summary)

API Documentation

GET / → Redirects to API docs
GET /train → Triggers model training pipeline
POST /predict → Runs summarization:

curl -X POST "http://localhost:8080/predict" \
-H "Content-Type: application/json" \
-d '{"text": "Your long text..."}'

([qwjfxnkd.gensparkspace.com][1])

Project Structure

textsummarize/
├─ .github/workflows/      # CI/CD config
├─ artifacts/              # Model & data outputs
│   ├─ data_ingestion/
│   ├─ data_validation/
│   ├─ data_transformation/
│   ├─ model_trainer/
│   └─ model_evaluation/
├─ config/config.yaml
├─ src/textSummarizer/
│   ├─ components/
│   ├─ pipeline/  # Training & prediction
│   ├─ utils/
│   └─ config/
├─ app.py               # API server
├─ main.py              # Training entry point
├─ Dockerfile
├─ params.yaml
├─ requirements.txt
└─ setup.py

Pipeline Stages

Data Ingestion – Downloads & extracts SAMSum JSON dataset ([qwjfxnkd.gensparkspace.com][1])
Data Validation – Checks splits, formats, schemas ([qwjfxnkd.gensparkspace.com][1])
Data Transformation – Tokenizes and prepares inputs
Model Training – Fine-tunes T5‑Small with config params ([qwjfxnkd.gensparkspace.com][1])
Model Evaluation – Computes ROUGE metrics & reports ([qwjfxnkd.gensparkspace.com][1])

Model Information

Model: T5‑Small (~60M parameters, encoder-decoder)
Sequence Limits: 512 input / 150 summary tokens
Training Config: 1 epoch, batch size=4, LR=5e-5 (AdamW), warmup=500, weight decay=0.01 ([qwjfxnkd.gensparkspace.com][1])

Dataset

SAMSum

~16K conversation-summary pairs
- Training: ~14K
- Validation: ~818
- Test: ~819
Domain: daily English chats, JSON formatted ([qwjfxnkd.gensparkspace.com][1])

Sample Entry:

{
  "dialogue": "John: Hey, ...",
  "summary": "John and Sarah confirm their lunch..."
}

Docker Deployment

docker build -t textsummarizer .
docker run -p 8080:8080 textsummarizer

Or with docker-compose (optional):

services:
  textsummarizer:
    build: .
    ports:
      - "8080:8080"
    environment:
      - PYTHONPATH=/app
    volumes:
      - ./artifacts:/app/artifacts

AWS Deployment

Create IAM user (ECR, EC2, S3 full access)
Create ECR repo (e.g., textsummarizer)
Launch EC2 (Ubuntu, open port 8080)
Set GitHub Secrets for AWS creds and ECR info
Configure EC2 as self-hosted runner
CI/CD pipeline builds image, pushes to ECR, and deploys to EC2 container on push ([qwjfxnkd.gensparkspace.com][1])

Contributing

We welcome contributions! 🎉

Fork the repo
Branch (feature/your-idea)
Commit & push
Submit Pull Request

Areas to improve

Model & performance upgrades
Additional evaluation metrics
Preprocessing enhancements
UI/UX or docs improvements

Code standards:

Follow PEP 8
Add docstrings & tests
Ensure CI checks pass

🛡️ License

Contact

MOHD AFROZ ALI – Aspiring SDE / AIML Intern
B.Tech (IT), Muffakham Jah College of Engineering & Technology
Email: afrozali3001.aa@gmail.com
GitHub & LinkedIn in repo profile
Phone: +91 9959786710

Enjoy using and contributing to this Text Summarization pipeline!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Summarization Project 🚀

📘 Table of Contents

Project Overview

Key Features

Tech Stack

Installation & Setup

Usage

Train the Model

Run the API Server

Example Prediction

API Documentation

Project Structure

Pipeline Stages

Model Information

Dataset

Docker Deployment

AWS Deployment

Contributing

Areas to improve

🛡️ License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
config		config
logs		logs
research		research
src/textSummarizer		src/textSummarizer
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
main.py		main.py
params.yaml		params.yaml
requirements.txt		requirements.txt
setup.py		setup.py
template.py		template.py

License

MOHD-AFROZ-ALI/textsummarize

Folders and files

Latest commit

History

Repository files navigation

Text Summarization Project 🚀

📘 Table of Contents

Project Overview

Key Features

Tech Stack

Installation & Setup

Usage

Train the Model

Run the API Server

Example Prediction

API Documentation

Project Structure

Pipeline Stages

Model Information

Dataset

Docker Deployment

AWS Deployment

Contributing

Areas to improve

🛡️ License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages