Skip to content

Transformer-based abstractive text summarizer fine-tuned on CNN/DailyMail and XSUM datasets using BART, with pretrained models available on Hugging Face

License

Notifications You must be signed in to change notification settings

AishikTokdar/Abstractive-Text-Summarizer

Repository files navigation

Abstractive Text Summarizer using Transformers

This repository contains code, notes, and fine-tuned models for building abstractive text summarizers using the Hugging Face Transformers ecosystem. The goal is to produce concise, human-like summaries of news articles using sequence-to-sequence Transformer models (BART).

Project summary

Two BART-base models were fine-tuned:

  • atneel/bart-base-summarizer — trained on CNN/DailyMail (multi-sentence summaries; some extractive bias).
  • atneel/bart-base-summarizer-xsum — trained on XSUM (single-sentence, highly abstractive summaries).

Platform: Kaggle Notebooks with NVIDIA T4. Framework: PyTorch + Hugging Face Transformers + Datasets + Evaluate.

Key features

  • Full training & evaluation scripts compatible with the Hugging Face Trainer API.
  • Data processing pipelines for CNN/DailyMail and XSUM (tokenization, truncation, label shifting).
  • Evaluation using ROUGE (rouge1, rouge2, rougeL) with standard preprocessing (sentence tokenization, detokenization).

Requirements

Minimum Python environment:

  • Python 3.8+
  • torch
  • transformers
  • datasets
  • evaluate
  • nltk
  • rouge_score (optional but recommended) Install quickly:
pip install torch transformers datasets evaluate nltk rouge_score
python -m nltk.downloader punkt

Data

Datasets used:

  • CNN/DailyMail — long news articles with multi-sentence highlights
  • XSUM — short, single-sentence highly-abstractive summaries

Typical preprocessing steps:

  • Lowercasing optional (kept original casing for BART).
  • Truncate articles to max_source_length (e.g., 1024 for BART if fine-tuned that way; for bart-base use 768–1024 carefully depending on memory).
  • Truncate summaries to max_target_length (e.g., 64–128 for XSUM).
  • Tokenize with the model tokenizer and set padding='max_length' for batching or use dynamic padding in DataCollator.

Example preprocessing pseudocode:

def preprocess(example, tokenizer, max_src=1024, max_tgt=64):
    inputs = tokenizer(example["article"], max_length=max_src, truncation=True)
    targets = tokenizer(example["highlights"], max_length=max_tgt, truncation=True)
    inputs["labels"] = targets["input_ids"]
    return inputs

Training (high-level)

Typical training hyperparameters used for final runs (adjust for GPU memory):

  • model: facebook/bart-base
  • optimizer: AdamW
  • batch_size: 8–16 per device (use gradient_accumulation if needed)
  • learning_rate: 3e-5 — 5e-5 (linear warmup + decay)
  • weight_decay: 0.01
  • epochs: 2–4 (monitor validation ROUGE)
  • max_source_length: 512–1024
  • max_target_length: 64–128
  • seed: set for reproducibility

Example CLI (Trainer-based):

python src/train.py \
  --model_name_or_path facebook/bart-base \
  --dataset_name cnn_dailymail --dataset_config "3.0.0" \
  --output_dir ./outputs/bart-cnn \
  --per_device_train_batch_size 4 \
  --per_device_eval_batch_size 4 \
  --gradient_accumulation_steps 8 \
  --learning_rate 3e-5 \
  --num_train_epochs 3 \
  --max_source_length 512 \
  --max_target_length 128 \
  --evaluation_strategy steps \
  --eval_steps 2500 \
  --save_steps 2500 \
  --predict_with_generate

Adjust per_device_train_batch_size and gradient_accumulation_steps so that effective batch size fits GPU memory.

Kaggle notes:

  • Use the provided GPU (T4) and set runtime to GPU.
  • Increase swap/step logging frequency to avoid running out of time on long runs; often training is split into script checkpoints.

Evaluation

  • Metric: ROUGE (rouge1, rouge2, rougeL). Use the evaluate or rouge_score package.
  • Post-process predictions:
    • Strip extra whitespace.
    • Use NLTK sentence tokenizer for consistent ROUGE reference segmentation as needed.
  • For predict_with_generate=True ensure num_beams (e.g., 4) for deterministic, high-quality outputs.

Example evaluation snippet:

from evaluate import load
rouge = load("rouge")
preds = [postprocess(p) for p in raw_preds]
refs = [postprocess(r) for r in raw_refs]
scores = rouge.compute(predictions=preds, references=refs, rouge_types=["rouge1","rouge2","rougeL"], use_stemmer=True)

Inference / Usage

Use the transformers pipeline for quick inference:

from transformers import pipeline

summarizer = pipeline("summarization", model="atneel/bart-base-summarizer-xsum")
article = "..."  # your document
summary = summarizer(article, max_length=80, min_length=15, do_sample=False, num_beams=4, length_penalty=1.0)
print(summary[0]['summary_text'])

Tuning inference:

  • Use num_beams=4 or higher for better search quality.
  • length_penalty >1 favors longer summaries; <1 favors shorter outputs.
  • no_repeat_ngram_size=3 helps reduce repetition.

Results

Final reported scores on a held-out test subset (5,000 examples):

Model/Dataset ROUGE-1 ROUGE-2 ROUGE-L
BART-base on CNN/DailyMail 39.75 17.29 26.94
BART-base on XSUM 38.98 16.21 31.45

Qualitative observations:

  • CNN/DailyMail model shows lead bias (extractive tendency).
  • XSUM model produces more abstractive single-sentence summaries.

Example:

  • Article: NASA's James Webb captured first direct image of exoplanet HIP 65426 b...
  • CNN/DailyMail summary: more extractive and multi-sentence.
  • XSUM summary: concise, abstractive single-sentence summary.

Tips for reproducible training

  • Fix random seeds (PyTorch, numpy, random).
  • Log hyperparameters and checkpoint model weights.
  • Save tokenizer and model config along with checkpoints.
  • Use predict_with_generate in evaluation to obtain full text outputs for ROUGE.

Limitations & Ethics

  • Models can hallucinate facts — verify critical information with the source.
  • Data biases from news sources and summarizers will reflect in the model output.

Contributing

  • Open an issue for bugs or feature requests.
  • Submit PRs with tests or notebook reproductions.
  • Include reference to training logs and exact command used.

License

Specify an appropriate license for your code and models (MIT, Apache 2.0, etc.). Add a LICENSE file in the repo root.

Contact / Citation

If you use these models, please cite this repository and include a note about the datasets used (CNN/DailyMail, XSUM) and the model base (facebook/bart-base).

About

Transformer-based abstractive text summarizer fine-tuned on CNN/DailyMail and XSUM datasets using BART, with pretrained models available on Hugging Face

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published