Skip to content

Built an abstractive text summarizer using T5and Hugging Face, evaluated with ROUGE/BLEU to ensure high-quality, human-like summaries of long documents.

License

Notifications You must be signed in to change notification settings

MOHD-AFROZ-ALI/textsummarize

Repository files navigation

Text Summarization Project 🚀

End‑to‑end NLP Pipeline with T5‑Small Transformer
Python 3.8+ · FastAPI · Docker · AWS · MLOps :contentReference[oaicite:1]{index=1}


📘 Table of Contents

  1. Project Overview
  2. Key Features
  3. Tech Stack
  4. Installation & Setup
  5. Usage
  6. API Documentation
  7. Project Structure
  8. Pipeline Stages
  9. Model Information
  10. Dataset
  11. Docker Deployment
  12. AWS Deployment
  13. Contributing
  14. License
  15. Contact

Project Overview

This project implements a full text summarization system using the T5‑Small transformer. It establishes an MLOps pipeline covering data ingestion, validation, transformation, training, evaluation, and deployment using FastAPI and Docker :contentReference[oaicite:2]{index=2}.

Objective: Automatically produce concise, coherent summaries from long text using a production-grade transformer model.


Key Features

  • Modular Pipeline Architecture – Separate stages for easy maintenance :contentReference[oaicite:3]{index=3}
  • T5‑Small Transformer – Efficient and powerful text-to-text pretrained model :contentReference[oaicite:4]{index=4}
  • Data Validation – Ensures correct data format and structure :contentReference[oaicite:5]{index=5}
  • ROUGE Evaluation – Standard summarization metrics :contentReference[oaicite:6]{index=6}
  • FastAPI Deployment – High-performance API with built-in docs :contentReference[oaicite:7]{index=7}
  • Cloud-Native Design – Dockerized and AWS-compatible with CI/CD :contentReference[oaicite:8]{index=8}

Tech Stack

  • Machine Learning: Hugging Face Transformers, PyTorch, NLTK, ROUGE, Datasets :contentReference[oaicite:9]{index=9}
  • Backend / API: FastAPI, Uvicorn, Jinja2, PyYAML, Python 3.8+ :contentReference[oaicite:10]{index=10}
  • DevOps / Cloud: Docker, AWS (EC2/ECR/S3), GitHub Actions, Boto3 :contentReference[oaicite:11]{index=11}

Installation & Setup

git clone https://github.com/MOHD-AFROZ-ALI/textsummarize.git
cd textsummarize

conda create -n textsummarizer python=3.8 -y
conda activate textsummarizer

pip install -r requirements.txt

Usage

Train the Model

python main.py

Or run full pipeline in Python:

from textSummarizer.pipeline.stage_01_data_ingestion import DataIngestionTrainingPipeline
# ... import other pipeline stages ...
# Execute complete pipeline

Run the API Server

python app.py

Visit http://localhost:8080 for live API or /docs for Swagger UI.

Example Prediction

from textSummarizer.pipeline.prediction import PredictionPipeline

predictor = PredictionPipeline()

text = """
Machine learning is a method of data analysis...
"""
summary = predictor.predict(text)
print(summary)

API Documentation

  • GET / → Redirects to API docs
  • GET /train → Triggers model training pipeline
  • POST /predict → Runs summarization:
curl -X POST "http://localhost:8080/predict" \
-H "Content-Type: application/json" \
-d '{"text": "Your long text..."}'

([qwjfxnkd.gensparkspace.com][1])


Project Structure

textsummarize/
├─ .github/workflows/      # CI/CD config
├─ artifacts/              # Model & data outputs
│   ├─ data_ingestion/
│   ├─ data_validation/
│   ├─ data_transformation/
│   ├─ model_trainer/
│   └─ model_evaluation/
├─ config/config.yaml
├─ src/textSummarizer/
│   ├─ components/
│   ├─ pipeline/  # Training & prediction
│   ├─ utils/
│   └─ config/
├─ app.py               # API server
├─ main.py              # Training entry point
├─ Dockerfile
├─ params.yaml
├─ requirements.txt
└─ setup.py

Pipeline Stages

  1. Data Ingestion – Downloads & extracts SAMSum JSON dataset ([qwjfxnkd.gensparkspace.com][1])
  2. Data Validation – Checks splits, formats, schemas ([qwjfxnkd.gensparkspace.com][1])
  3. Data Transformation – Tokenizes and prepares inputs
  4. Model Training – Fine-tunes T5‑Small with config params ([qwjfxnkd.gensparkspace.com][1])
  5. Model Evaluation – Computes ROUGE metrics & reports ([qwjfxnkd.gensparkspace.com][1])

Model Information

  • Model: T5‑Small (~60M parameters, encoder-decoder)
  • Sequence Limits: 512 input / 150 summary tokens
  • Training Config: 1 epoch, batch size=4, LR=5e-5 (AdamW), warmup=500, weight decay=0.01 ([qwjfxnkd.gensparkspace.com][1])

Dataset

SAMSum

  • ~16K conversation-summary pairs

    • Training: ~14K
    • Validation: ~818
    • Test: ~819
  • Domain: daily English chats, JSON formatted ([qwjfxnkd.gensparkspace.com][1])

Sample Entry:

{
  "dialogue": "John: Hey, ...",
  "summary": "John and Sarah confirm their lunch..."
}

Docker Deployment

docker build -t textsummarizer .
docker run -p 8080:8080 textsummarizer

Or with docker-compose (optional):

services:
  textsummarizer:
    build: .
    ports:
      - "8080:8080"
    environment:
      - PYTHONPATH=/app
    volumes:
      - ./artifacts:/app/artifacts

AWS Deployment

  1. Create IAM user (ECR, EC2, S3 full access)
  2. Create ECR repo (e.g., textsummarizer)
  3. Launch EC2 (Ubuntu, open port 8080)
  4. Set GitHub Secrets for AWS creds and ECR info
  5. Configure EC2 as self-hosted runner
  6. CI/CD pipeline builds image, pushes to ECR, and deploys to EC2 container on push ([qwjfxnkd.gensparkspace.com][1])

Contributing

We welcome contributions! 🎉

  1. Fork the repo
  2. Branch (feature/your-idea)
  3. Commit & push
  4. Submit Pull Request

Areas to improve

  • Model & performance upgrades
  • Additional evaluation metrics
  • Preprocessing enhancements
  • UI/UX or docs improvements

Code standards:

  • Follow PEP 8
  • Add docstrings & tests
  • Ensure CI checks pass

🛡️ License

MIT License © 2025 MOHD AFROZ ALI


Contact

  • MOHD AFROZ ALI – Aspiring SDE / AIML Intern
  • B.Tech (IT), Muffakham Jah College of Engineering & Technology
  • Email: afrozali3001.aa@gmail.com
  • GitHub & LinkedIn in repo profile
  • Phone: +91 9959786710

Enjoy using and contributing to this Text Summarization pipeline!

About

Built an abstractive text summarizer using T5and Hugging Face, evaluated with ROUGE/BLEU to ensure high-quality, human-like summaries of long documents.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published