Newspaper Utilities

A configurable pipeline for OCR'ing, analyzing, and visualizing historical newspapers. Extract entities, build networks, analyze text, and generate interactive dashboards - all customizable through YAML configuration files.

Inspired by a conversation about Jack the Ripper

Features

OCR Processing: Multiple OCR engines (Tesseract, PaddleOCR, Surya, Google Gemini, ocrmac)
Article Segmentation: Automatic detection of article boundaries
Configurable Tagging: Define your own research topics and keywords
Timeline Analysis: Correlate local publications with reference events
Text Analysis: Comparative linguistics and sensationalism metrics
Entity Extraction: Extract people, places, organizations with normalization
Network Analysis: Co-occurrence networks, temporal analysis, community detection
Interactive Dashboard: Single-page HTML dashboard with D3.js visualizations

Quick Start

Installation

git clone https://github.com/XLabCU/newspaper_utilities.git
cd newspaper_utilities
pip install -r scripts/requirements.txt

# For spaCy entity extraction
python -m spacy download en_core_web_sm

# Install system dependencies
apt-get install -y poppler-utils tesseract-ocr

Basic Usage

Place your PDF files in the pdfs/ folder
Run the complete pipeline:

python scripts/run_pipeline.py

This will:

Preprocess PDFs to high-quality image snippets
Run OCR (default: Tesseract)
Segment articles
Tag articles by theme
Generate timeline
Analyze text
Extract entities and build networks
Generate interactive dashboard

Right now, ocrmac gives good results; Gemini requires an API key; and I'm experimenting with groq and the llama4 maverick model [(see info here)[https://console.groq.com/docs/vision].] If you want to try groq, get an API and set it in your terminal with export GROQ_API_KEY=<your-api-key-here> and note you do not put the api key between < and >!. Then run python scripts/process_pdfs_groq.py. You might want to play with the prompt to get best results; start with what is in the script, try things out in the groq playground for this model until you find something that works well, then put it into the script.

All of the OCR methods take time; tesseract is fastest.

Open dashboard/index.html in your browser

python serve.py

Using Custom Configurations

Create a custom project configuration:

# Run with Whitechapel Ripper project config
python scripts/run_pipeline.py --config config/projects/whitechapel_ripper.yaml

# Run with a different OCR engine
python scripts/run_pipeline.py --ocr-engine surya --config config/projects/my_project.yaml

Running Analysis Only (Skip OCR)

If you've already OCR'd your documents and want to re-run just the analysis steps (tagging, timeline, text analysis, entity extraction, dashboard generation), use the data analysis pipeline:

# Auto-detects the most recent OCR output file in data/raw/
python scripts/run_data_analysis.py --config config/projects/your_project.yaml

# Or specify a particular OCR file
python scripts/run_data_analysis.py --config config/projects/your_project.yaml --ocr-file ocr_output_vision.jsonl

This will:

Segment articles from OCR output (auto-detects most recent ocr_output*.jsonl or *.json)
Tag articles by theme
Generate timeline correlations
Analyze text patterns
Extract entities and build networks
Generate interactive dashboard

Use cases:

Experimenting with different configuration parameters
Re-generating the dashboard after config changes
Running analysis on previously OCR'd documents
Faster iteration during research development

Configuration

The system uses YAML configuration files to customize analysis for different research projects.

Project Configuration Structure

project:
  name: "My Research Project"
  description: "Analysis of historical newspapers"
  date_range: ["1880-01-01", "1890-12-31"]

tags:
  - id: "politics"
    label: "Political News"
    keywords: ["election", "parliament", "government"]
    weight: 10
    color: "#3498db"

timeline:
  reference_events:
    - id: "event_01"
      date: "1885-03-15"
      title: "Major Historical Event"
      location: "City, Country"
      type: "event"

  correlation_tags: ["politics"]

text_analysis:
  comparison_groups:
    - id: "political_coverage"
      label: "Political Coverage"
      filter:
        tags: ["politics"]
    - id: "other_news"
      label: "Other News"
      filter:
        exclude_tags: ["politics"]

entity_extraction:
  entity_types:
    - name: "PERSON"
      enabled: true
      color: "#FF6B6B"
    - name: "GPE"
      enabled: true
      color: "#45B7D1"

  normalization:
    enabled: true
    aliases:
      "Queen Victoria": ["Victoria", "Her Majesty"]
    fuzzy_matching:
      enabled: true
      threshold: 0.85

network_analysis:
  graphs:
    - name: "entity_cooccurrence"
      enabled: true
      type: "cooccurrence"
    - name: "temporal_network"
      enabled: true
      type: "temporal"
      parameters:
        time_slices: "month"

  metrics:
    node_metrics:
      - "degree_centrality"
      - "betweenness_centrality"
      - "pagerank"

  community_detection:
    enabled: true
    algorithms:
      - name: "louvain"
        enabled: true

See config/projects/whitechapel_ripper.yaml for a complete example.

Pipeline Steps

Step 1: Preprocessing (`preprocess.py`)

Converts PDFs to 300 DPI image snippets, detecting columns and article boundaries.

python scripts/preprocess.py

Output: data/preprocessed/ directory with image snippets

Step 2: OCR Processing

Choose your OCR engine:

# Tesseract (default, most stable)
python scripts/process_pdfs_tesseract.py

# PaddleOCR (better for degraded text)
python scripts/process_pdfs.py

# Surya (highest accuracy, GPU recommended)
python scripts/process_images_surya_batch.py

# Google Gemini API (cloud-based)
export GEMINI_API_KEY=your_key
python scripts/process_pdfs_gemini.py

Output: data/raw/ocr_output_*.jsonl

Step 3: Article Segmentation (`segment_articles.py`)

Groups OCR text into coherent articles.

python scripts/segment_articles.py

Output: data/processed/articles.json

Step 4: Article Tagging (`tag_articles.py`)

Classifies articles by configurable themes.

python scripts/tag_articles.py --config config/projects/my_project.yaml

Output: data/processed/tagged_articles.json

Step 5: Timeline Generation (`generate_timeline.py`)

Correlates reference events with local publications.

python scripts/generate_timeline.py --config config/projects/my_project.yaml

Output: data/processed/timeline.json

Step 6: Text Analysis (`analyze_text.py`)

Comparative linguistic analysis across article groups.

python scripts/analyze_text.py --config config/projects/my_project.yaml

Output: data/processed/text_analysis.json

Step 7: Entity Extraction & Network Analysis (`extract_entities_enhanced.py`)

Extract entities, build networks, calculate metrics, detect communities.

python scripts/extract_entities_enhanced.py --config config/projects/my_project.yaml

Outputs:

data/processed/entity_network.json - Complete dataset
data/processed/entities.json - Legacy format
data/processed/*.graphml - Network files for Gephi
data/processed/*_d3.json - D3.js visualization format

Step 8: Topic Modeling

See the topic_model.py script. In the config, set:

topic_modeling:
  n_topics: 8
  n_top_words: 15
  iterations: 1000

The topic model script also reads these elements from the config file:

text_analysis: comparison_groups This is the most critical section for the script.

id & label: Used to identify and name the resulting models (e.g., "Ripper Coverage").
filter: tags: The script uses this list to pull only articles that have these specific tags (e.g., whitechapel_ripper).
filter: exclude_tags: The script uses this to ensure certain articles are left out of a specific model (e.g., modeling "General News" by excluding anything tagged as "Advertisement").

text_analysis: custom_stopwords

custom_stopwords: The script pulls this list and merges it with the words from mallet.txt.

Step 9: Dashboard Generation (`generate_dashboard.py`)

Create interactive HTML dashboard.

python scripts/generate_dashboard.py --config config/projects/my_project.yaml

Output: dashboard/index.html

Example Projects

1. Generic Newspaper Analysis

Use default configuration for basic analysis:

python scripts/run_pipeline.py

2. Whitechapel in Shawville (Included)

Analyzes Jack the Ripper coverage in 1880s Canadian newspapers:

python scripts/run_pipeline.py --config config/projects/whitechapel_ripper.yaml

3. Create Your Own Project

Copy a config template:

cp config/projects/whitechapel_ripper.yaml config/projects/my_project.yaml

Edit my_project.yaml with your research topics, keywords, and events
Run the pipeline:

python scripts/run_pipeline.py --config config/projects/my_project.yaml

Dashboard Features

The generated dashboard (dashboard/index.html) includes:

Statistics: Quick stats (total articles, entities, time span)
Timeline: Interactive timeline correlating events and publications
Entity Network: D3.js force-directed graph showing entity relationships
- Drag nodes to rearrange
- Hover for details
- Zoom with scroll
Text Analysis: Comparative statistics across article groups
Article Browser: Searchable, filterable table of all articles

Advanced Features

Entity Normalization

Configure aliases to merge variant entity names:

entity_extraction:
  normalization:
    enabled: true
    aliases:
      "Jack the Ripper": ["the Ripper", "Whitechapel Fiend"]
    fuzzy_matching:
      enabled: true
      threshold: 0.85

Entity Filtering

Filter out unwanted entities and OCR artifacts from your network:

entity_extraction:
  filtering:
    min_mentions: 2                # Minimum times entity must appear
    min_entity_length: 3           # Minimum character length
    max_entity_length: 100         # Maximum character length
    skip_single_char: true         # Filter single characters
    skip_all_caps: true            # Filter all-caps (OCR errors)
    blacklist:                     # Custom blacklist
      - "Advertisement"
      - "Continued"

Default blacklist (automatically filtered):

Untitled Snippet - System-generated placeholder
Untitled - Generic placeholder
Unknown - Generic unknown value

These filters help clean your network visualization by removing common OCR artifacts and system-generated text.

Community Detection

Identify clusters of related entities:

network_analysis:
  community_detection:
    enabled: true
    algorithms:
      - name: "louvain"
        enabled: true
        resolution: 1.0

Temporal Networks

Track how entity relationships evolve over time:

network_analysis:
  graphs:
    - name: "temporal_network"
      enabled: true
      type: "temporal"
      parameters:
        time_slices: "month"  # day, week, month, quarter, year

Multiple Export Formats

Entity networks are exported in formats for various tools:

JSON: For dashboards and custom analysis
GraphML: Import into Gephi for advanced visualization
GEXF: Another Gephi-compatible format
CSV: Edge/node lists for R, Python, Excel
D3.js: Optimized for web visualization

Google Colab

Quick start using Google Colab:

# Clone repository
!git clone https://github.com/XLabCU/newspaper_utilities.git

# Install dependencies
!pip install -r /content/newspaper_utilities/scripts/requirements.txt
!apt-get install -y poppler-utils tesseract-ocr

# Download spaCy model
!python -m spacy download en_core_web_sm

# After runtime restart
%cd newspaper_utilities

# Run pipeline
!python scripts/run_pipeline.py --ocr-engine surya

Sample Data

The repository includes a 2-page sample from The Shawville Equity (1888) via BANQ.

Access the full archive: BANQ Shawville Equity Collection

Note: The original BANQ PDFs have poor OCR quality. This pipeline improves accuracy through preprocessing and modern OCR engines. See Ian Milligan's work on the importance of accurate newspaper OCR for historical research.

Tuning Preprocessing

For newspapers with unusual layouts, adjust preprocessing parameters. See scripts/readme.md for detailed tuning instructions.

Contributing

Contributions welcome! This pipeline was built for flexibility across different historical newspaper research projects.

License

MIT License - See LICENSE file for details

Citation

If you use this pipeline in your research, please cite:

XLabCU. (2026). Newspaper Utilities: Configurable Pipeline for Historical Newspaper Analysis.
https://github.com/XLabCU/newspaper_utilities

Acknowledgments

Original Whitechapel in Shawville research project
BANQ for historical newspaper archives
spaCy, NetworkX, D3.js, and other open-source libraries

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
config		config
pdfs		pdfs
scripts		scripts
README.md		README.md
dash.png		dash.png
installed_python_packages.txt		installed_python_packages.txt
netviz.png		netviz.png
serve.py		serve.py
snippetviz.png		snippetviz.png
topics.png		topics.png

XLabCU/newspaper_utilities

Folders and files

Latest commit

History

Repository files navigation

Newspaper Utilities

Features

Quick Start

Installation

Basic Usage

Using Custom Configurations

Running Analysis Only (Skip OCR)

Configuration

Project Configuration Structure

Pipeline Steps

Step 1: Preprocessing (preprocess.py)

Step 2: OCR Processing

Step 3: Article Segmentation (segment_articles.py)

Step 4: Article Tagging (tag_articles.py)

Step 5: Timeline Generation (generate_timeline.py)

Step 6: Text Analysis (analyze_text.py)

Step 7: Entity Extraction & Network Analysis (extract_entities_enhanced.py)

Step 8: Topic Modeling

Step 9: Dashboard Generation (generate_dashboard.py)