A high-performance Fake News Detection pipeline built on RoBERTa with LoRA (Low-Rank Adaptation) fine-tuning. The project demonstrates how parameter-efficient fine-tuning can deliver strong results even on limited compute, with a focus on cross-domain generalization across FakeNews-Kaggle and LIAR datasets.
- Overview
- Project Goals
- Pipeline Overview
- Repository Structure
- Setup Instructions
- Usage
- Results
- Key Insights
- Future Work
- License
This project applies Parameter-Efficient Fine-Tuning (PEFT) using LoRA adapters on RoBERTa-base, enabling efficient adaptation of large language models for misinformation detection. It evaluates both in-domain performance (FakeNews-Kaggle) and out-of-domain generalization (LIAR dataset) to measure model robustness.
- Develop an end-to-end reproducible pipeline for fake news detection.
- Fine-tune RoBERTa efficiently using LoRA adapters.
- Benchmark cross-domain performance using both FakeNews and LIAR datasets.
- Maintain lightweight training with minimal GPU memory footprint.
- Provide structured evaluation metrics and results visualization.
| Stage | Description |
|---|---|
| 1. Data Fetching | Downloads and preprocesses FakeNews-Kaggle and LIAR datasets. |
| 2. Preprocessing | Cleans text, encodes labels, and prepares train/val/test splits. |
| 3. Model Setup | Initializes RoBERTa-base with LoRA adapters using the PEFT library. |
| 4. Training | Fine-tunes model using mixed precision and checkpoint saving. |
| 5. Evaluation | Computes metrics (Accuracy, Precision, Recall, F1) across domains. |
Misinformation-Classifier/
├── src/
│ ├── data/
│ │ ├── fetch_datasets.py # Fetch and prepare datasets (FakeNews, LIAR)
│ │ ├── datasets.py # PyTorch Dataset class
│ │ └── preprocess.py # Cleaning, tokenization, and encoding
│ │
│ ├── models/
│ │ ├── text_model.py # RoBERTa + LoRA architecture
│ │ └── train.py # Training loop and checkpoint saving
│ │
│ ├── evaluate.py # Evaluation script for performance metrics
│ └── utils/
│ ├── metrics.py # Metric computation utilities
│ └── seed.py # Reproducibility and seed setup
│
├── notebooks/
│ ├── 01_data_exploration.ipynb # Dataset analysis and visualization
│ ├── 02_model_training.ipynb # LoRA fine-tuning workflow
│ └── 03_model_evaluation.ipynb # Performance evaluation and confusion matrix
│
├── data/
│ ├── raw/ # Raw data (downloaded automatically)
│ └── processed/ # Cleaned and tokenized data splits
│
├── models/
│ ├── roberta_lora_multidomain_best/ # Best-performing model checkpoint
│ ├── roberta_lora_multidomain_steps/ # Step-based checkpoints
│ ├── roberta_lora_multidomain_merged/ # Final merged adapter model
│ ├── roberta_lora_multidomain_history.csv # Training history log
│ └── roberta_lora_multidomain_evaluation_summary.csv # Evaluation summary
│
├── requirements.txt # Python dependencies
├── Makefile # Reproducible training commands
└── README.md # Project documentationgit clone https://github.com/ashbix23/Misinformation-Classifier.git
cd Misinformation-Classifierconda create -n fakenews python=3.10 -y
conda activate fakenewspip install -r requirements.txtpython -m src.data.fetch_datasetsThis script downloads, cleans, and processes both FakeNews-Kaggle and LIAR datasets, saving ready-to-use files in data/processed/.
jupyter notebook notebooks/01_data_exploration.ipynbpython -m src.train --model roberta-base --lora --datasets FakeNews LIARjupyter notebook notebooks/03_model_evaluation.ipynbThis notebook generates accuracy, precision, recall, F1, and confusion matrices for each dataset.
| Dataset | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| FakeNews | 0.773 | 0.846 | 0.773 | 0.763 |
| LIAR | 0.443 | 0.754 | 0.443 | 0.273 |
- Excellent in-domain performance on FakeNews-Kaggle.
- Significant drop in cross-domain generalization for LIAR (short factual claims).
- Model learns strong contextual and linguistic cues for misinformation.
- LoRA fine-tuning reduces memory and compute cost while preserving model quality.
- Cross-domain adaptation remains challenging due to dataset structural differences.
- Clean preprocessing and normalization dramatically improve results.
- Mixed Precision Training (AMP) enhances training efficiency on GPUs.
- Domain-adaptive fine-tuning for short-form factual claims.
- Add more datasets for multi-domain robustness.
- Build an inference API using FastAPI or Gradio for quick model testing.
This project is licensed under the MIT License. You are free to use, modify, and distribute it for academic or research purposes.