Skip to content

Giving-Tuesday/gtrex-scraper

Repository files navigation

GTRex Scraper

License: GPL v3

A Python package for extracting, enriching, and analyzing metadata from scholarly articles across various sources.

Overview

GTRex Scraper is a powerful tool designed to search for academic articles, extract metadata from various sources, and enrich it with additional information from external APIs. The package has been completely refactored from its previous version to provide a more modular, extensible, and maintainable architecture.

Key Features

  • Search for academic articles using Google Scholar
  • Extract metadata from various sources (HTML, JSON-LD, domain-specific extractors)
  • Enrich metadata with external APIs (Crossref, OpenAlex, JSTOR, SSRN)
  • Handle CAPTCHAs and JavaScript-heavy websites
  • Priority-based metadata merging for optimal data quality
  • Backward compatibility with previous version
  • Optimized fetching with domain-specific caching and fast Playwright mode
  • Improved error handling and resilience

Architecture

The package has been refactored from a flat structure to a modular, component-based architecture:

Core Components

  • MetadataService: The central facade that coordinates the entire workflow
  • FetcherPipeline: Manages the sequence of fetchers to retrieve content
  • ExtractionPipeline: Manages the sequence of extractors to process content
  • EnrichmentPipeline: Manages the sequence of enrichers to enhance metadata

Specialized Components

  • Fetchers: Retrieve content from URLs

    • HTTP Fetcher: Basic HTTP requests
    • Playwright Fetcher: For JavaScript-heavy sites and CAPTCHA handling
    • Fast Playwright Fetcher: Optimized version for bulk fetching from the same domain
    • Curl Fetcher: Uses curl_cffi for advanced HTTP requests
    • PDF Fetcher: For extracting content from PDFs
    • Fallback Fetcher: Tries multiple fetchers in sequence
    • Caching Fetcher Factory: Remembers which fetcher works best for each domain
  • Extractors: Extract metadata from content

    • Metatag Extractor: Extracts metadata from HTML meta tags
    • JSON-LD Extractor: Extracts metadata from JSON-LD scripts
    • Zotero Extractor: Uses Zotero translation server
    • Domain-specific Extractors: Specialized for specific websites
  • Enrichers: Enhance metadata with external APIs

    • Crossref Enricher: Academic citation metadata
    • OpenAlex Enricher: Open access academic data
    • JSTOR Enricher: Journal articles metadata
    • SSRN Enricher: Social Science Research Network data
  • Models: Define data structures for metadata

  • Adapters: Provide backward compatibility with the old API

  • Utils: Utility functions and logging setup

Installation

Clone the repository

git clone --recurse-submodules git@github.com:Giving-Tuesday/gtrex-scraper.git
cd gtrex-scraper

Install dependencies

pip install poetry
poetry install

Set up environment variables in .env file:

OXYLABS_USERNAME=
OXYLABS_PASSWORD=
OPENAI_API_KEY=

Apply the patch to fix compatibility with newer Node.js versions

./setup.sh

Start the zotero translation server

cd translation-server
npm start

Development Environment

macOS Quick Start

For macOS users, we provide a convenience script that automatically sets up a development environment:

./macos_dev_environment.sh

This script will:

  • Open a terminal (iTerm2 if installed, otherwise the default Terminal app)
  • Launch VSCode with the project
  • Start Jupyter notebook in the first tab/window
  • Start the translation server in the second tab/window
  • Set up log monitoring in the third tab/window

Note: This script requires macOS. It works with either iTerm2 (using tabs) or the default Terminal app (using separate windows).

Manual Setup (All Platforms)

For non-macOS users or those who prefer manual setup:

  1. Open the project in your preferred editor
  2. Start Jupyter notebook: poetry run jupyter-notebook
  3. In a separate terminal, start the translation server: cd translation-server && npm start
  4. Optionally, in another terminal, monitor logs: tail -f scraper.log

Improvements Over Previous Version

The refactored GTRex Scraper offers several improvements over the previous version:

  1. Modular Architecture: Clear separation of concerns with specialized components
  2. Extensibility: Easy to add new fetchers, extractors, and enrichers
  3. Priority-Based Merging: Intelligent merging of metadata from different sources
  4. Domain-Specific Extractors: Better accuracy for specific websites
  5. Improved Error Handling: More robust error handling and logging
  6. Backward Compatibility: Legacy adapters for existing code
  7. Better Documentation: Comprehensive documentation and examples
  8. Async Implementation: Fully async implementation for better performance
  9. Proxy Rotation: Automatic proxy rotation to avoid rate limiting
  10. Smart Fetcher Caching: Learns which fetcher works best for each domain
  11. Optimized Bulk Fetching: Fast mode for fetching multiple URLs from the same domain
  12. Resilient Retries: Automatic retries with exponential backoff using tenacity

Project Structure

src/gtrex_scraper/
├── __init__.py                      # Package exports
├── core/                            # Core components
│   ├── extraction_pipeline.py       # Manages extractors
│   ├── enrichment_pipeline.py       # Manages enrichers
│   ├── metadata_service.py          # Main service facade
│   └── priorities.py                # Metadata source priorities
├── enrichers/                       # Metadata enrichers
│   ├── base_enricher.py             # Base class for enrichers
│   ├── crossref_enricher.py         # Crossref API enricher
│   ├── openalex_enricher.py         # OpenAlex API enricher
│   ├── domain_specific/             # Domain-specific enrichers
├── extractors/                      # Metadata extractors
│   ├── base_extractor.py            # Base class for extractors
│   ├── google_scholar_cache_extractor.py  # Google Scholar cache extractor
│   ├── pdf_extractor.py             # PDF metadata extractor
│   ├── adobe_digital_extractor.py   # Adobe metadata extractor
│   ├── enhanced_abstract_extractor.py  # Enhanced abstract extractor
│   ├── datalayer_extractor.py       # Data layer extractor
│   ├── metatag_extractor.py         # HTML meta tag extractor
│   ├── jsonld_extractor.py          # JSON-LD extractor
│   ├── zotero_extractor.py          # Zotero translation server
│   └── domain_specific/             # Domain-specific extractors
├── fetchers/                        # Content fetchers
│   ├── base_fetcher.py              # Base class for fetchers
|   ├── curl_fetcher.py              # cURL-based fetcher
│   ├── http_fetcher.py              # Basic HTTP fetcher
│   ├── playwright_fetcher.py        # Browser automation
│   ├── fast_playwright_fetcher.py   # Optimized browser automation
│   ├── caching_fetcher_factory.py   # Smart fetcher selection
│   ├── pdf_fetcher.py               # PDF content fetcher
│   └── fallback_fetcher.py          # Multi-strategy fetcher
├── models/                          # Data models
│   ├── __init__.py                  # Package exports
│   ├── metadata.py                  # Metadata models
│   ├── response.py                  # Response models
│   └── schemas.py                   # Schema definitions
├── scholar/                         # Scholar functionality
│   ├── __init__.py                  # Package exports
│   ├── scholar_client.py            # Google Scholar HTTP client
│   ├── scholar_models.py            # Scholar data models
│   ├── scholar_service.py           # Scholar service facade
│   └── scholar_utils.py             # Scholar utilities
└── utils/                           # Utilities
    ├── __init__.py                  # Package exports
    ├── logging_setup.py             # Logging configuration
    └── utils.py                     # Common utility functions

About

A Python package for extracting, enriching, and analyzing metadata from scholarly articles across various sources.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages