A Python package for extracting, enriching, and analyzing metadata from scholarly articles across various sources.
GTRex Scraper is a powerful tool designed to search for academic articles, extract metadata from various sources, and enrich it with additional information from external APIs. The package has been completely refactored from its previous version to provide a more modular, extensible, and maintainable architecture.
- Search for academic articles using Google Scholar
- Extract metadata from various sources (HTML, JSON-LD, domain-specific extractors)
- Enrich metadata with external APIs (Crossref, OpenAlex, JSTOR, SSRN)
- Handle CAPTCHAs and JavaScript-heavy websites
- Priority-based metadata merging for optimal data quality
- Backward compatibility with previous version
- Optimized fetching with domain-specific caching and fast Playwright mode
- Improved error handling and resilience
The package has been refactored from a flat structure to a modular, component-based architecture:
- MetadataService: The central facade that coordinates the entire workflow
- FetcherPipeline: Manages the sequence of fetchers to retrieve content
- ExtractionPipeline: Manages the sequence of extractors to process content
- EnrichmentPipeline: Manages the sequence of enrichers to enhance metadata
-
Fetchers: Retrieve content from URLs
- HTTP Fetcher: Basic HTTP requests
- Playwright Fetcher: For JavaScript-heavy sites and CAPTCHA handling
- Fast Playwright Fetcher: Optimized version for bulk fetching from the same domain
- Curl Fetcher: Uses curl_cffi for advanced HTTP requests
- PDF Fetcher: For extracting content from PDFs
- Fallback Fetcher: Tries multiple fetchers in sequence
- Caching Fetcher Factory: Remembers which fetcher works best for each domain
-
Extractors: Extract metadata from content
- Metatag Extractor: Extracts metadata from HTML meta tags
- JSON-LD Extractor: Extracts metadata from JSON-LD scripts
- Zotero Extractor: Uses Zotero translation server
- Domain-specific Extractors: Specialized for specific websites
-
Enrichers: Enhance metadata with external APIs
- Crossref Enricher: Academic citation metadata
- OpenAlex Enricher: Open access academic data
- JSTOR Enricher: Journal articles metadata
- SSRN Enricher: Social Science Research Network data
-
Models: Define data structures for metadata
-
Adapters: Provide backward compatibility with the old API
-
Utils: Utility functions and logging setup
Clone the repository
git clone --recurse-submodules git@github.com:Giving-Tuesday/gtrex-scraper.git
cd gtrex-scraperInstall dependencies
pip install poetry
poetry install
Set up environment variables in .env file:
OXYLABS_USERNAME=
OXYLABS_PASSWORD=
OPENAI_API_KEY=
Apply the patch to fix compatibility with newer Node.js versions
./setup.shStart the zotero translation server
cd translation-server
npm startFor macOS users, we provide a convenience script that automatically sets up a development environment:
./macos_dev_environment.shThis script will:
- Open a terminal (iTerm2 if installed, otherwise the default Terminal app)
- Launch VSCode with the project
- Start Jupyter notebook in the first tab/window
- Start the translation server in the second tab/window
- Set up log monitoring in the third tab/window
Note: This script requires macOS. It works with either iTerm2 (using tabs) or the default Terminal app (using separate windows).
For non-macOS users or those who prefer manual setup:
- Open the project in your preferred editor
- Start Jupyter notebook:
poetry run jupyter-notebook - In a separate terminal, start the translation server:
cd translation-server && npm start - Optionally, in another terminal, monitor logs:
tail -f scraper.log
The refactored GTRex Scraper offers several improvements over the previous version:
- Modular Architecture: Clear separation of concerns with specialized components
- Extensibility: Easy to add new fetchers, extractors, and enrichers
- Priority-Based Merging: Intelligent merging of metadata from different sources
- Domain-Specific Extractors: Better accuracy for specific websites
- Improved Error Handling: More robust error handling and logging
- Backward Compatibility: Legacy adapters for existing code
- Better Documentation: Comprehensive documentation and examples
- Async Implementation: Fully async implementation for better performance
- Proxy Rotation: Automatic proxy rotation to avoid rate limiting
- Smart Fetcher Caching: Learns which fetcher works best for each domain
- Optimized Bulk Fetching: Fast mode for fetching multiple URLs from the same domain
- Resilient Retries: Automatic retries with exponential backoff using tenacity
src/gtrex_scraper/
├── __init__.py # Package exports
├── core/ # Core components
│ ├── extraction_pipeline.py # Manages extractors
│ ├── enrichment_pipeline.py # Manages enrichers
│ ├── metadata_service.py # Main service facade
│ └── priorities.py # Metadata source priorities
├── enrichers/ # Metadata enrichers
│ ├── base_enricher.py # Base class for enrichers
│ ├── crossref_enricher.py # Crossref API enricher
│ ├── openalex_enricher.py # OpenAlex API enricher
│ ├── domain_specific/ # Domain-specific enrichers
├── extractors/ # Metadata extractors
│ ├── base_extractor.py # Base class for extractors
│ ├── google_scholar_cache_extractor.py # Google Scholar cache extractor
│ ├── pdf_extractor.py # PDF metadata extractor
│ ├── adobe_digital_extractor.py # Adobe metadata extractor
│ ├── enhanced_abstract_extractor.py # Enhanced abstract extractor
│ ├── datalayer_extractor.py # Data layer extractor
│ ├── metatag_extractor.py # HTML meta tag extractor
│ ├── jsonld_extractor.py # JSON-LD extractor
│ ├── zotero_extractor.py # Zotero translation server
│ └── domain_specific/ # Domain-specific extractors
├── fetchers/ # Content fetchers
│ ├── base_fetcher.py # Base class for fetchers
| ├── curl_fetcher.py # cURL-based fetcher
│ ├── http_fetcher.py # Basic HTTP fetcher
│ ├── playwright_fetcher.py # Browser automation
│ ├── fast_playwright_fetcher.py # Optimized browser automation
│ ├── caching_fetcher_factory.py # Smart fetcher selection
│ ├── pdf_fetcher.py # PDF content fetcher
│ └── fallback_fetcher.py # Multi-strategy fetcher
├── models/ # Data models
│ ├── __init__.py # Package exports
│ ├── metadata.py # Metadata models
│ ├── response.py # Response models
│ └── schemas.py # Schema definitions
├── scholar/ # Scholar functionality
│ ├── __init__.py # Package exports
│ ├── scholar_client.py # Google Scholar HTTP client
│ ├── scholar_models.py # Scholar data models
│ ├── scholar_service.py # Scholar service facade
│ └── scholar_utils.py # Scholar utilities
└── utils/ # Utilities
├── __init__.py # Package exports
├── logging_setup.py # Logging configuration
└── utils.py # Common utility functions