Scholar API

This project uses a home server cronjob to scrape Google Scholar data via the scholarly Python package and the Google Sheets to get the journal's impact factor (IF) and the publication's DOI.

The JSON can then be used, for example, by uploading the data to a publicly accessible server via Secure Copy (SCP) or rsync, which serves the JSON data via a Flask application.

Meltwater is the news gathering tool used by some universities. See also Isentia Medaiportal. See also Zotera, an open source citation manager.

altmetrics

python -m venv scholar && source scholar/bin/activate

Installation

Prerequisites

Python 3.6+
Flask
scholarly
Wikipedia

Setup

Clone this repository to your local machine.
Install the required packages.

Setup.

git clone https://github.com/Luen/scholarly-api
python -m venv scholar
source scholar/bin/activate
pip install -r requirements.txt

Test run.
```
python main.py ynWS968AAAAJ
```

Docker

The stack includes:

hero – Ulixee Hero Cloud (browser automation for scraping)
hero-scraper – HTTP API wrapper that sends URLs to Hero and returns HTML
web – Flask API serving scholar data
cron – Runs the main scraper on a schedule

Build the base image (required once; no container is created):

docker compose build base

Start all services (hero, hero-scraper, web, cron; base is build-only and does not run):

docker compose up -d

For browser-based scraping (DOIs, etc.), the hero-scraper service must be running. Set HERO_SCRAPER_URL (default: http://hero-scraper:3000 in Docker, http://localhost:3000 locally) to point at it.

Caching

HTTP responses are cached with requests-cache in cache/ (SQLite). This includes:

Scholarly (Google Scholar) requests
DOI API requests (doi.org, shortdoi.org)
Hero scraper (browser-fetched HTML)
Web page fetches for DOI extraction

Set CACHE_DIR to change the cache location; CACHE_EXPIRE_SECONDS (default: 30 days) to control expiry.

Wait for containers to be ready (check status with docker compose ps). Then to manually run the script:

# First check if containers are ready
docker compose ps

# If containers are running, execute the script
docker compose exec cron python main.py ynWS968AAAAJ

# If you get a "container is restarting" error, check logs
docker compose logs web

Project structure

main.py – Orchestration only: loads config, runs pipeline, handles idempotency
src/scholar_fetcher.py – Author, coauthors, publications from scholarly (with retries)
src/doi_resolver.py – DOI lookup and resolution (with retries)
src/output.py – Load/save JSON, schema_version, last_fetched, resume indices
src/config.py – Config loaded from .env
src/retry.py – Retry decorator with exponential backoff
src/logging_config.py – Structured logging (text or JSON)

Output JSON includes schema_version, last_fetched, and _last_successful_*_index for resume support.

Development

Linting and formatting

This project uses Ruff for linting and formatting. Run after code changes:

ruff check . --fix && ruff format .

Testing

pip install -r requirements.txt
pytest tests/ -v

Tests marked integration require network access. Tests that need google-credentials.json or the Hero scraper will skip when unavailable. Run lint and format before committing:

ruff check . && ruff format --check .

Starting the Flask server

Navigate to the project directory and run:

python server.py

Or with Docker:

docker compose up web -d

The API is available at http://localhost:8000 (Docker maps 8000→5000).

API Endpoints

URL	Method	Description
`/`	GET	Welcome message
`/health`	GET	Health check
`/scholars`	GET	List available scholar IDs
`/scholar/<id>`	GET	Get scholar data by ID (e.g. `/scholar/ynWS968AAAAJ`)
`/altmetric/<doi>`	GET	Altmetric score for a DOI (cached 2 weeks). 401 if not Rummer/Bergseth/Wu
`/scholar-citations/<doi>`	GET	Google Scholar citation count for a DOI (cached 2 weeks). 401 if not Rummer/Bergseth/Wu

Environment variables

FLASK_HOST – Bind host (default: 0.0.0.0)
FLASK_PORT – Bind port (default: 5000)
SCHOLAR_DATA_DIR – Path to scholar JSON files (default: scholar_data)
CACHE_DIR – HTTP cache directory (default: cache)
CACHE_EXPIRE_SECONDS – Cache expiry (default: 30 days)
FRESH_DATA_SECONDS – Skip full fetch if data is newer (default: 7 days)
MAX_RETRIES, RETRY_BASE_DELAY – Retry settings for Scholar/DOI APIs
COAUTHOR_DELAY, PUBLICATION_DELAY – Rate limiting (seconds)
LOG_FORMAT – Set to json for structured JSON logs (e.g. in Docker)

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.cursor/rules		.cursor/rules
.github		.github
.vscode		.vscode
base		base
cron		cron
hero-scraper		hero-scraper
hero		hero
scripts		scripts
src		src
tests		tests
web		web
.dockerignore		.dockerignore
.env.template		.env.template
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
cron.sh		cron.sh
docker-compose.yml		docker-compose.yml
favicon.ico		favicon.ico
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scholar API

Installation

Prerequisites

Setup

Docker

Caching

Project structure

Development

Linting and formatting

Testing

Starting the Flask server

API Endpoints

Environment variables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Luen/scholar

Folders and files

Latest commit

History

Repository files navigation

Scholar API

Installation

Prerequisites

Setup

Docker

Caching

Project structure

Development

Linting and formatting

Testing

Starting the Flask server

API Endpoints

Environment variables

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages