This repository contains code, notes, and fine-tuned models for building abstractive text summarizers using the Hugging Face Transformers ecosystem. The goal is to produce concise, human-like summaries of news articles using sequence-to-sequence Transformer models (BART).
Two BART-base models were fine-tuned:
atneel/bart-base-summarizer— trained on CNN/DailyMail (multi-sentence summaries; some extractive bias).atneel/bart-base-summarizer-xsum— trained on XSUM (single-sentence, highly abstractive summaries).
Platform: Kaggle Notebooks with NVIDIA T4. Framework: PyTorch + Hugging Face Transformers + Datasets + Evaluate.
- Full training & evaluation scripts compatible with the Hugging Face Trainer API.
- Data processing pipelines for CNN/DailyMail and XSUM (tokenization, truncation, label shifting).
- Evaluation using ROUGE (rouge1, rouge2, rougeL) with standard preprocessing (sentence tokenization, detokenization).
Minimum Python environment:
- Python 3.8+
- torch
- transformers
- datasets
- evaluate
- nltk
- rouge_score (optional but recommended) Install quickly:
pip install torch transformers datasets evaluate nltk rouge_score
python -m nltk.downloader punktDatasets used:
- CNN/DailyMail — long news articles with multi-sentence highlights
- XSUM — short, single-sentence highly-abstractive summaries
Typical preprocessing steps:
- Lowercasing optional (kept original casing for BART).
- Truncate articles to
max_source_length(e.g., 1024 for BART if fine-tuned that way; forbart-baseuse 768–1024 carefully depending on memory). - Truncate summaries to
max_target_length(e.g., 64–128 for XSUM). - Tokenize with the model tokenizer and set
padding='max_length'for batching or use dynamic padding in DataCollator.
Example preprocessing pseudocode:
def preprocess(example, tokenizer, max_src=1024, max_tgt=64):
inputs = tokenizer(example["article"], max_length=max_src, truncation=True)
targets = tokenizer(example["highlights"], max_length=max_tgt, truncation=True)
inputs["labels"] = targets["input_ids"]
return inputsTypical training hyperparameters used for final runs (adjust for GPU memory):
- model: facebook/bart-base
- optimizer: AdamW
- batch_size: 8–16 per device (use gradient_accumulation if needed)
- learning_rate: 3e-5 — 5e-5 (linear warmup + decay)
- weight_decay: 0.01
- epochs: 2–4 (monitor validation ROUGE)
- max_source_length: 512–1024
- max_target_length: 64–128
- seed: set for reproducibility
Example CLI (Trainer-based):
python src/train.py \
--model_name_or_path facebook/bart-base \
--dataset_name cnn_dailymail --dataset_config "3.0.0" \
--output_dir ./outputs/bart-cnn \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--learning_rate 3e-5 \
--num_train_epochs 3 \
--max_source_length 512 \
--max_target_length 128 \
--evaluation_strategy steps \
--eval_steps 2500 \
--save_steps 2500 \
--predict_with_generateAdjust per_device_train_batch_size and gradient_accumulation_steps so that effective batch size fits GPU memory.
Kaggle notes:
- Use the provided GPU (T4) and set runtime to GPU.
- Increase swap/step logging frequency to avoid running out of time on long runs; often training is split into script checkpoints.
- Metric: ROUGE (rouge1, rouge2, rougeL). Use the
evaluateorrouge_scorepackage. - Post-process predictions:
- Strip extra whitespace.
- Use NLTK sentence tokenizer for consistent ROUGE reference segmentation as needed.
- For
predict_with_generate=Trueensurenum_beams(e.g., 4) for deterministic, high-quality outputs.
Example evaluation snippet:
from evaluate import load
rouge = load("rouge")
preds = [postprocess(p) for p in raw_preds]
refs = [postprocess(r) for r in raw_refs]
scores = rouge.compute(predictions=preds, references=refs, rouge_types=["rouge1","rouge2","rougeL"], use_stemmer=True)Use the transformers pipeline for quick inference:
from transformers import pipeline
summarizer = pipeline("summarization", model="atneel/bart-base-summarizer-xsum")
article = "..." # your document
summary = summarizer(article, max_length=80, min_length=15, do_sample=False, num_beams=4, length_penalty=1.0)
print(summary[0]['summary_text'])Tuning inference:
- Use
num_beams=4or higher for better search quality. length_penalty>1 favors longer summaries; <1 favors shorter outputs.no_repeat_ngram_size=3helps reduce repetition.
Final reported scores on a held-out test subset (5,000 examples):
| Model/Dataset | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|
| BART-base on CNN/DailyMail | 39.75 | 17.29 | 26.94 |
| BART-base on XSUM | 38.98 | 16.21 | 31.45 |
Qualitative observations:
- CNN/DailyMail model shows lead bias (extractive tendency).
- XSUM model produces more abstractive single-sentence summaries.
Example:
- Article: NASA's James Webb captured first direct image of exoplanet HIP 65426 b...
- CNN/DailyMail summary: more extractive and multi-sentence.
- XSUM summary: concise, abstractive single-sentence summary.
- Fix random seeds (PyTorch, numpy, random).
- Log hyperparameters and checkpoint model weights.
- Save tokenizer and model config along with checkpoints.
- Use
predict_with_generatein evaluation to obtain full text outputs for ROUGE.
- Models can hallucinate facts — verify critical information with the source.
- Data biases from news sources and summarizers will reflect in the model output.
- Open an issue for bugs or feature requests.
- Submit PRs with tests or notebook reproductions.
- Include reference to training logs and exact command used.
Specify an appropriate license for your code and models (MIT, Apache 2.0, etc.). Add a LICENSE file in the repo root.
If you use these models, please cite this repository and include a note about the datasets used (CNN/DailyMail, XSUM) and the model base (facebook/bart-base).