GTRex Scraper

A Python package for extracting, enriching, and analyzing metadata from scholarly articles across various sources.

Overview

GTRex Scraper is a powerful tool designed to search for academic articles, extract metadata from various sources, and enrich it with additional information from external APIs. The package has been completely refactored from its previous version to provide a more modular, extensible, and maintainable architecture.

Key Features

Search for academic articles using Google Scholar
Extract metadata from various sources (HTML, JSON-LD, domain-specific extractors)
Enrich metadata with external APIs (Crossref, OpenAlex, JSTOR, SSRN)
Handle CAPTCHAs and JavaScript-heavy websites
Priority-based metadata merging for optimal data quality
Backward compatibility with previous version
Optimized fetching with domain-specific caching and fast Playwright mode
Improved error handling and resilience

Architecture

The package has been refactored from a flat structure to a modular, component-based architecture:

Core Components

MetadataService: The central facade that coordinates the entire workflow
FetcherPipeline: Manages the sequence of fetchers to retrieve content
ExtractionPipeline: Manages the sequence of extractors to process content
EnrichmentPipeline: Manages the sequence of enrichers to enhance metadata

Specialized Components

Fetchers: Retrieve content from URLs
- HTTP Fetcher: Basic HTTP requests
- Playwright Fetcher: For JavaScript-heavy sites and CAPTCHA handling
- Fast Playwright Fetcher: Optimized version for bulk fetching from the same domain
- Curl Fetcher: Uses curl_cffi for advanced HTTP requests
- PDF Fetcher: For extracting content from PDFs
- Fallback Fetcher: Tries multiple fetchers in sequence
- Caching Fetcher Factory: Remembers which fetcher works best for each domain
Extractors: Extract metadata from content
- Metatag Extractor: Extracts metadata from HTML meta tags
- JSON-LD Extractor: Extracts metadata from JSON-LD scripts
- Zotero Extractor: Uses Zotero translation server
- Domain-specific Extractors: Specialized for specific websites
Enrichers: Enhance metadata with external APIs
- Crossref Enricher: Academic citation metadata
- OpenAlex Enricher: Open access academic data
- JSTOR Enricher: Journal articles metadata
- SSRN Enricher: Social Science Research Network data
Models: Define data structures for metadata
Adapters: Provide backward compatibility with the old API
Utils: Utility functions and logging setup

Installation

Clone the repository

git clone --recurse-submodules git@github.com:Giving-Tuesday/gtrex-scraper.git
cd gtrex-scraper

Install dependencies

pip install poetry
poetry install

Set up environment variables in .env file:

OXYLABS_USERNAME=
OXYLABS_PASSWORD=
OPENAI_API_KEY=

Apply the patch to fix compatibility with newer Node.js versions

./setup.sh

Start the zotero translation server

cd translation-server
npm start

Development Environment

macOS Quick Start

For macOS users, we provide a convenience script that automatically sets up a development environment:

./macos_dev_environment.sh

This script will:

Open a terminal (iTerm2 if installed, otherwise the default Terminal app)
Launch VSCode with the project
Start Jupyter notebook in the first tab/window
Start the translation server in the second tab/window
Set up log monitoring in the third tab/window

Note: This script requires macOS. It works with either iTerm2 (using tabs) or the default Terminal app (using separate windows).

Manual Setup (All Platforms)

For non-macOS users or those who prefer manual setup:

Open the project in your preferred editor
Start Jupyter notebook: poetry run jupyter-notebook
In a separate terminal, start the translation server: cd translation-server && npm start
Optionally, in another terminal, monitor logs: tail -f scraper.log

Improvements Over Previous Version

The refactored GTRex Scraper offers several improvements over the previous version:

Modular Architecture: Clear separation of concerns with specialized components
Extensibility: Easy to add new fetchers, extractors, and enrichers
Priority-Based Merging: Intelligent merging of metadata from different sources
Domain-Specific Extractors: Better accuracy for specific websites
Improved Error Handling: More robust error handling and logging
Backward Compatibility: Legacy adapters for existing code
Better Documentation: Comprehensive documentation and examples
Async Implementation: Fully async implementation for better performance
Proxy Rotation: Automatic proxy rotation to avoid rate limiting
Smart Fetcher Caching: Learns which fetcher works best for each domain
Optimized Bulk Fetching: Fast mode for fetching multiple URLs from the same domain
Resilient Retries: Automatic retries with exponential backoff using tenacity

Project Structure

src/gtrex_scraper/
├── __init__.py                      # Package exports
├── core/                            # Core components
│   ├── extraction_pipeline.py       # Manages extractors
│   ├── enrichment_pipeline.py       # Manages enrichers
│   ├── metadata_service.py          # Main service facade
│   └── priorities.py                # Metadata source priorities
├── enrichers/                       # Metadata enrichers
│   ├── base_enricher.py             # Base class for enrichers
│   ├── crossref_enricher.py         # Crossref API enricher
│   ├── openalex_enricher.py         # OpenAlex API enricher
│   ├── domain_specific/             # Domain-specific enrichers
├── extractors/                      # Metadata extractors
│   ├── base_extractor.py            # Base class for extractors
│   ├── google_scholar_cache_extractor.py  # Google Scholar cache extractor
│   ├── pdf_extractor.py             # PDF metadata extractor
│   ├── adobe_digital_extractor.py   # Adobe metadata extractor
│   ├── enhanced_abstract_extractor.py  # Enhanced abstract extractor
│   ├── datalayer_extractor.py       # Data layer extractor
│   ├── metatag_extractor.py         # HTML meta tag extractor
│   ├── jsonld_extractor.py          # JSON-LD extractor
│   ├── zotero_extractor.py          # Zotero translation server
│   └── domain_specific/             # Domain-specific extractors
├── fetchers/                        # Content fetchers
│   ├── base_fetcher.py              # Base class for fetchers
|   ├── curl_fetcher.py              # cURL-based fetcher
│   ├── http_fetcher.py              # Basic HTTP fetcher
│   ├── playwright_fetcher.py        # Browser automation
│   ├── fast_playwright_fetcher.py   # Optimized browser automation
│   ├── caching_fetcher_factory.py   # Smart fetcher selection
│   ├── pdf_fetcher.py               # PDF content fetcher
│   └── fallback_fetcher.py          # Multi-strategy fetcher
├── models/                          # Data models
│   ├── __init__.py                  # Package exports
│   ├── metadata.py                  # Metadata models
│   ├── response.py                  # Response models
│   └── schemas.py                   # Schema definitions
├── scholar/                         # Scholar functionality
│   ├── __init__.py                  # Package exports
│   ├── scholar_client.py            # Google Scholar HTTP client
│   ├── scholar_models.py            # Scholar data models
│   ├── scholar_service.py           # Scholar service facade
│   └── scholar_utils.py             # Scholar utilities
└── utils/                           # Utilities
    ├── __init__.py                  # Package exports
    ├── logging_setup.py             # Logging configuration
    └── utils.py                     # Common utility functions

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
docs		docs
notebooks		notebooks
src/gtrex_scraper		src/gtrex_scraper
translation-server @ dd03dc1		translation-server @ dd03dc1
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
LICENSE.txt		LICENSE.txt
README.md		README.md
macos_dev_environment.sh		macos_dev_environment.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.sh		setup.sh
translation-server-config-fix.patch		translation-server-config-fix.patch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GTRex Scraper

Overview

Key Features

Architecture

Core Components

Specialized Components

Installation

Development Environment

macOS Quick Start

Manual Setup (All Platforms)

Improvements Over Previous Version

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GTRex Scraper

Overview

Key Features

Architecture

Core Components

Specialized Components

Installation

Development Environment

macOS Quick Start

Manual Setup (All Platforms)

Improvements Over Previous Version

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages