Abstractive Text Summarizer using Transformers

This repository contains code, notes, and fine-tuned models for building abstractive text summarizers using the Hugging Face Transformers ecosystem. The goal is to produce concise, human-like summaries of news articles using sequence-to-sequence Transformer models (BART).

Project summary

Two BART-base models were fine-tuned:

atneel/bart-base-summarizer — trained on CNN/DailyMail (multi-sentence summaries; some extractive bias).
atneel/bart-base-summarizer-xsum — trained on XSUM (single-sentence, highly abstractive summaries).

Platform: Kaggle Notebooks with NVIDIA T4. Framework: PyTorch + Hugging Face Transformers + Datasets + Evaluate.

Key features

Full training & evaluation scripts compatible with the Hugging Face Trainer API.
Data processing pipelines for CNN/DailyMail and XSUM (tokenization, truncation, label shifting).
Evaluation using ROUGE (rouge1, rouge2, rougeL) with standard preprocessing (sentence tokenization, detokenization).

Requirements

Minimum Python environment:

Python 3.8+
torch
transformers
datasets
evaluate
nltk
rouge_score (optional but recommended) Install quickly:

pip install torch transformers datasets evaluate nltk rouge_score
python -m nltk.downloader punkt

Data

Datasets used:

CNN/DailyMail — long news articles with multi-sentence highlights
XSUM — short, single-sentence highly-abstractive summaries

Typical preprocessing steps:

Lowercasing optional (kept original casing for BART).
Truncate articles to max_source_length (e.g., 1024 for BART if fine-tuned that way; for bart-base use 768–1024 carefully depending on memory).
Truncate summaries to max_target_length (e.g., 64–128 for XSUM).
Tokenize with the model tokenizer and set padding='max_length' for batching or use dynamic padding in DataCollator.

Example preprocessing pseudocode:

def preprocess(example, tokenizer, max_src=1024, max_tgt=64):
    inputs = tokenizer(example["article"], max_length=max_src, truncation=True)
    targets = tokenizer(example["highlights"], max_length=max_tgt, truncation=True)
    inputs["labels"] = targets["input_ids"]
    return inputs

Training (high-level)

Typical training hyperparameters used for final runs (adjust for GPU memory):

model: facebook/bart-base
optimizer: AdamW
batch_size: 8–16 per device (use gradient_accumulation if needed)
learning_rate: 3e-5 — 5e-5 (linear warmup + decay)
weight_decay: 0.01
epochs: 2–4 (monitor validation ROUGE)
max_source_length: 512–1024
max_target_length: 64–128
seed: set for reproducibility

Example CLI (Trainer-based):

python src/train.py \
  --model_name_or_path facebook/bart-base \
  --dataset_name cnn_dailymail --dataset_config "3.0.0" \
  --output_dir ./outputs/bart-cnn \
  --per_device_train_batch_size 4 \
  --per_device_eval_batch_size 4 \
  --gradient_accumulation_steps 8 \
  --learning_rate 3e-5 \
  --num_train_epochs 3 \
  --max_source_length 512 \
  --max_target_length 128 \
  --evaluation_strategy steps \
  --eval_steps 2500 \
  --save_steps 2500 \
  --predict_with_generate

Adjust per_device_train_batch_size and gradient_accumulation_steps so that effective batch size fits GPU memory.

Kaggle notes:

Use the provided GPU (T4) and set runtime to GPU.
Increase swap/step logging frequency to avoid running out of time on long runs; often training is split into script checkpoints.

Evaluation

Metric: ROUGE (rouge1, rouge2, rougeL). Use the evaluate or rouge_score package.
Post-process predictions:
- Strip extra whitespace.
- Use NLTK sentence tokenizer for consistent ROUGE reference segmentation as needed.
For predict_with_generate=True ensure num_beams (e.g., 4) for deterministic, high-quality outputs.

Example evaluation snippet:

from evaluate import load
rouge = load("rouge")
preds = [postprocess(p) for p in raw_preds]
refs = [postprocess(r) for r in raw_refs]
scores = rouge.compute(predictions=preds, references=refs, rouge_types=["rouge1","rouge2","rougeL"], use_stemmer=True)

Inference / Usage

Use the transformers pipeline for quick inference:

from transformers import pipeline

summarizer = pipeline("summarization", model="atneel/bart-base-summarizer-xsum")
article = "..."  # your document
summary = summarizer(article, max_length=80, min_length=15, do_sample=False, num_beams=4, length_penalty=1.0)
print(summary[0]['summary_text'])

Tuning inference:

Use num_beams=4 or higher for better search quality.
length_penalty >1 favors longer summaries; <1 favors shorter outputs.
no_repeat_ngram_size=3 helps reduce repetition.

Results

Final reported scores on a held-out test subset (5,000 examples):

Model/Dataset	ROUGE-1	ROUGE-2	ROUGE-L
BART-base on CNN/DailyMail	39.75	17.29	26.94
BART-base on XSUM	38.98	16.21	31.45

Qualitative observations:

CNN/DailyMail model shows lead bias (extractive tendency).
XSUM model produces more abstractive single-sentence summaries.

Example:

Article: NASA's James Webb captured first direct image of exoplanet HIP 65426 b...
CNN/DailyMail summary: more extractive and multi-sentence.
XSUM summary: concise, abstractive single-sentence summary.

Tips for reproducible training

Fix random seeds (PyTorch, numpy, random).
Log hyperparameters and checkpoint model weights.
Save tokenizer and model config along with checkpoints.
Use predict_with_generate in evaluation to obtain full text outputs for ROUGE.

Limitations & Ethics

Models can hallucinate facts — verify critical information with the source.
Data biases from news sources and summarizers will reflect in the model output.

Contributing

Open an issue for bugs or feature requests.
Submit PRs with tests or notebook reproductions.
Include reference to training logs and exact command used.

License

Specify an appropriate license for your code and models (MIT, Apache 2.0, etc.). Add a LICENSE file in the repo root.

Contact / Citation

If you use these models, please cite this repository and include a note about the datasets used (CNN/DailyMail, XSUM) and the model base (facebook/bart-base).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md
text-summarizer-on-cnn-dailymail-dataset.ipynb		text-summarizer-on-cnn-dailymail-dataset.ipynb
text-summarizer-on-xsum-dataset.ipynb		text-summarizer-on-xsum-dataset.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstractive Text Summarizer using Transformers

Project summary

Key features

Requirements

Data

Training (high-level)

Evaluation

Inference / Usage

Results

Tips for reproducible training

Limitations & Ethics

Contributing

License

Contact / Citation

About

Uh oh!

Releases

Packages

Languages

License

AishikTokdar/Abstractive-Text-Summarizer

Folders and files

Latest commit

History

Repository files navigation

Abstractive Text Summarizer using Transformers

Project summary

Key features

Requirements

Data

Training (high-level)

Evaluation

Inference / Usage

Results

Tips for reproducible training

Limitations & Ethics

Contributing

License

Contact / Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages