A configurable pipeline for OCR'ing, analyzing, and visualizing historical newspapers. Extract entities, build networks, analyze text, and generate interactive dashboards - all customizable through YAML configuration files.
Inspired by a conversation about Jack the Ripper
- OCR Processing: Multiple OCR engines (Tesseract, PaddleOCR, Surya, Google Gemini, ocrmac)
- Article Segmentation: Automatic detection of article boundaries
- Configurable Tagging: Define your own research topics and keywords
- Timeline Analysis: Correlate local publications with reference events
- Text Analysis: Comparative linguistics and sensationalism metrics
- Entity Extraction: Extract people, places, organizations with normalization
- Network Analysis: Co-occurrence networks, temporal analysis, community detection
- Interactive Dashboard: Single-page HTML dashboard with D3.js visualizations
git clone https://github.com/XLabCU/newspaper_utilities.git
cd newspaper_utilities
pip install -r scripts/requirements.txt
# For spaCy entity extraction
python -m spacy download en_core_web_sm
# Install system dependencies
apt-get install -y poppler-utils tesseract-ocr- Place your PDF files in the
pdfs/folder - Run the complete pipeline:
python scripts/run_pipeline.pyThis will:
- Preprocess PDFs to high-quality image snippets
- Run OCR (default: Tesseract)
- Segment articles
- Tag articles by theme
- Generate timeline
- Analyze text
- Extract entities and build networks
- Generate interactive dashboard
Right now, ocrmac gives good results; Gemini requires an API key; and I'm experimenting with groq and the llama4 maverick model [(see info here)[https://console.groq.com/docs/vision].] If you want to try groq, get an API and set it in your terminal with export GROQ_API_KEY=<your-api-key-here> and note you do not put the api key between < and >!. Then run python scripts/process_pdfs_groq.py. You might want to play with the prompt to get best results; start with what is in the script, try things out in the groq playground for this model until you find something that works well, then put it into the script.
All of the OCR methods take time; tesseract is fastest.
- Open
dashboard/index.htmlin your browser
python serve.py Create a custom project configuration:
# Run with Whitechapel Ripper project config
python scripts/run_pipeline.py --config config/projects/whitechapel_ripper.yaml
# Run with a different OCR engine
python scripts/run_pipeline.py --ocr-engine surya --config config/projects/my_project.yamlIf you've already OCR'd your documents and want to re-run just the analysis steps (tagging, timeline, text analysis, entity extraction, dashboard generation), use the data analysis pipeline:
# Auto-detects the most recent OCR output file in data/raw/
python scripts/run_data_analysis.py --config config/projects/your_project.yaml
# Or specify a particular OCR file
python scripts/run_data_analysis.py --config config/projects/your_project.yaml --ocr-file ocr_output_vision.jsonlThis will:
- Segment articles from OCR output (auto-detects most recent
ocr_output*.jsonlor*.json) - Tag articles by theme
- Generate timeline correlations
- Analyze text patterns
- Extract entities and build networks
- Generate interactive dashboard
Use cases:
- Experimenting with different configuration parameters
- Re-generating the dashboard after config changes
- Running analysis on previously OCR'd documents
- Faster iteration during research development
The system uses YAML configuration files to customize analysis for different research projects.
project:
name: "My Research Project"
description: "Analysis of historical newspapers"
date_range: ["1880-01-01", "1890-12-31"]
tags:
- id: "politics"
label: "Political News"
keywords: ["election", "parliament", "government"]
weight: 10
color: "#3498db"
timeline:
reference_events:
- id: "event_01"
date: "1885-03-15"
title: "Major Historical Event"
location: "City, Country"
type: "event"
correlation_tags: ["politics"]
text_analysis:
comparison_groups:
- id: "political_coverage"
label: "Political Coverage"
filter:
tags: ["politics"]
- id: "other_news"
label: "Other News"
filter:
exclude_tags: ["politics"]
entity_extraction:
entity_types:
- name: "PERSON"
enabled: true
color: "#FF6B6B"
- name: "GPE"
enabled: true
color: "#45B7D1"
normalization:
enabled: true
aliases:
"Queen Victoria": ["Victoria", "Her Majesty"]
fuzzy_matching:
enabled: true
threshold: 0.85
network_analysis:
graphs:
- name: "entity_cooccurrence"
enabled: true
type: "cooccurrence"
- name: "temporal_network"
enabled: true
type: "temporal"
parameters:
time_slices: "month"
metrics:
node_metrics:
- "degree_centrality"
- "betweenness_centrality"
- "pagerank"
community_detection:
enabled: true
algorithms:
- name: "louvain"
enabled: trueSee config/projects/whitechapel_ripper.yaml for a complete example.
Converts PDFs to 300 DPI image snippets, detecting columns and article boundaries.
python scripts/preprocess.pyOutput: data/preprocessed/ directory with image snippets
Choose your OCR engine:
# Tesseract (default, most stable)
python scripts/process_pdfs_tesseract.py
# PaddleOCR (better for degraded text)
python scripts/process_pdfs.py
# Surya (highest accuracy, GPU recommended)
python scripts/process_images_surya_batch.py
# Google Gemini API (cloud-based)
export GEMINI_API_KEY=your_key
python scripts/process_pdfs_gemini.pyOutput: data/raw/ocr_output_*.jsonl
Groups OCR text into coherent articles.
python scripts/segment_articles.pyOutput: data/processed/articles.json
Classifies articles by configurable themes.
python scripts/tag_articles.py --config config/projects/my_project.yamlOutput: data/processed/tagged_articles.json
Correlates reference events with local publications.
python scripts/generate_timeline.py --config config/projects/my_project.yamlOutput: data/processed/timeline.json
Comparative linguistic analysis across article groups.
python scripts/analyze_text.py --config config/projects/my_project.yamlOutput: data/processed/text_analysis.json
Extract entities, build networks, calculate metrics, detect communities.
python scripts/extract_entities_enhanced.py --config config/projects/my_project.yamlOutputs:
data/processed/entity_network.json- Complete datasetdata/processed/entities.json- Legacy formatdata/processed/*.graphml- Network files for Gephidata/processed/*_d3.json- D3.js visualization format
See the topic_model.py script. In the config, set:
topic_modeling:
n_topics: 8
n_top_words: 15
iterations: 1000
The topic model script also reads these elements from the config file:
text_analysis: comparison_groups
This is the most critical section for the script.
id&label: Used to identify and name the resulting models (e.g., "Ripper Coverage").filter: tags: The script uses this list to pull only articles that have these specific tags (e.g.,whitechapel_ripper).filter: exclude_tags: The script uses this to ensure certain articles are left out of a specific model (e.g., modeling "General News" by excluding anything tagged as "Advertisement").
text_analysis: custom_stopwords
custom_stopwords: The script pulls this list and merges it with the words frommallet.txt.
Create interactive HTML dashboard.
python scripts/generate_dashboard.py --config config/projects/my_project.yamlOutput: dashboard/index.html
Use default configuration for basic analysis:
python scripts/run_pipeline.pyAnalyzes Jack the Ripper coverage in 1880s Canadian newspapers:
python scripts/run_pipeline.py --config config/projects/whitechapel_ripper.yaml- Copy a config template:
cp config/projects/whitechapel_ripper.yaml config/projects/my_project.yaml-
Edit
my_project.yamlwith your research topics, keywords, and events -
Run the pipeline:
python scripts/run_pipeline.py --config config/projects/my_project.yamlThe generated dashboard (dashboard/index.html) includes:
- Statistics: Quick stats (total articles, entities, time span)
- Timeline: Interactive timeline correlating events and publications
- Entity Network: D3.js force-directed graph showing entity relationships
- Drag nodes to rearrange
- Hover for details
- Zoom with scroll
- Text Analysis: Comparative statistics across article groups
- Article Browser: Searchable, filterable table of all articles
Configure aliases to merge variant entity names:
entity_extraction:
normalization:
enabled: true
aliases:
"Jack the Ripper": ["the Ripper", "Whitechapel Fiend"]
fuzzy_matching:
enabled: true
threshold: 0.85Filter out unwanted entities and OCR artifacts from your network:
entity_extraction:
filtering:
min_mentions: 2 # Minimum times entity must appear
min_entity_length: 3 # Minimum character length
max_entity_length: 100 # Maximum character length
skip_single_char: true # Filter single characters
skip_all_caps: true # Filter all-caps (OCR errors)
blacklist: # Custom blacklist
- "Advertisement"
- "Continued"Default blacklist (automatically filtered):
Untitled Snippet- System-generated placeholderUntitled- Generic placeholderUnknown- Generic unknown value
These filters help clean your network visualization by removing common OCR artifacts and system-generated text.
Identify clusters of related entities:
network_analysis:
community_detection:
enabled: true
algorithms:
- name: "louvain"
enabled: true
resolution: 1.0Track how entity relationships evolve over time:
network_analysis:
graphs:
- name: "temporal_network"
enabled: true
type: "temporal"
parameters:
time_slices: "month" # day, week, month, quarter, yearEntity networks are exported in formats for various tools:
- JSON: For dashboards and custom analysis
- GraphML: Import into Gephi for advanced visualization
- GEXF: Another Gephi-compatible format
- CSV: Edge/node lists for R, Python, Excel
- D3.js: Optimized for web visualization
Quick start using Google Colab:
# Clone repository
!git clone https://github.com/XLabCU/newspaper_utilities.git
# Install dependencies
!pip install -r /content/newspaper_utilities/scripts/requirements.txt
!apt-get install -y poppler-utils tesseract-ocr
# Download spaCy model
!python -m spacy download en_core_web_sm
# After runtime restart
%cd newspaper_utilities
# Run pipeline
!python scripts/run_pipeline.py --ocr-engine suryaThe repository includes a 2-page sample from The Shawville Equity (1888) via BANQ.
Access the full archive: BANQ Shawville Equity Collection
Note: The original BANQ PDFs have poor OCR quality. This pipeline improves accuracy through preprocessing and modern OCR engines. See Ian Milligan's work on the importance of accurate newspaper OCR for historical research.
For newspapers with unusual layouts, adjust preprocessing parameters. See scripts/readme.md for detailed tuning instructions.
Contributions welcome! This pipeline was built for flexibility across different historical newspaper research projects.
MIT License - See LICENSE file for details
If you use this pipeline in your research, please cite:
XLabCU. (2026). Newspaper Utilities: Configurable Pipeline for Historical Newspaper Analysis.
https://github.com/XLabCU/newspaper_utilities
- Original Whitechapel in Shawville research project
- BANQ for historical newspaper archives
- spaCy, NetworkX, D3.js, and other open-source libraries



