This project uses a home server cronjob to scrape Google Scholar data via the scholarly Python package and the Google Sheets to get the journal's impact factor (IF) and the publication's DOI.
The JSON can then be used, for example, by uploading the data to a publicly accessible server via Secure Copy (SCP) or rsync, which serves the JSON data via a Flask application.
Meltwater is the news gathering tool used by some universities. See also Isentia Medaiportal. See also Zotera, an open source citation manager.
altmetrics
python -m venv scholar && source scholar/bin/activate
- Python 3.6+
- Flask
- scholarly
- Wikipedia
-
Clone this repository to your local machine.
-
Install the required packages.
-
Setup.
git clone https://github.com/Luen/scholarly-api python -m venv scholar source scholar/bin/activate pip install -r requirements.txt -
Test run.
python main.py ynWS968AAAAJ
The stack includes:
- hero – Ulixee Hero Cloud (browser automation for scraping)
- hero-scraper – HTTP API wrapper that sends URLs to Hero and returns HTML
- web – Flask API serving scholar data
- cron – Runs the main scraper on a schedule
Build the base image (required once; no container is created):
docker compose build baseStart all services (hero, hero-scraper, web, cron; base is build-only and does not run):
docker compose up -dFor browser-based scraping (DOIs, etc.), the hero-scraper service must be running. Set HERO_SCRAPER_URL (default: http://hero-scraper:3000 in Docker, http://localhost:3000 locally) to point at it.
HTTP responses are cached with requests-cache in cache/ (SQLite). This includes:
- Scholarly (Google Scholar) requests
- DOI API requests (doi.org, shortdoi.org)
- Hero scraper (browser-fetched HTML)
- Web page fetches for DOI extraction
Set CACHE_DIR to change the cache location; CACHE_EXPIRE_SECONDS (default: 30 days) to control expiry.
Wait for containers to be ready (check status with docker compose ps). Then to manually run the script:
# First check if containers are ready
docker compose ps
# If containers are running, execute the script
docker compose exec cron python main.py ynWS968AAAAJ
# If you get a "container is restarting" error, check logs
docker compose logs webmain.py– Orchestration only: loads config, runs pipeline, handles idempotencysrc/scholar_fetcher.py– Author, coauthors, publications from scholarly (with retries)src/doi_resolver.py– DOI lookup and resolution (with retries)src/output.py– Load/save JSON,schema_version,last_fetched, resume indicessrc/config.py– Config loaded from.envsrc/retry.py– Retry decorator with exponential backoffsrc/logging_config.py– Structured logging (text or JSON)
Output JSON includes schema_version, last_fetched, and _last_successful_*_index for resume support.
This project uses Ruff for linting and formatting. Run after code changes:
ruff check . --fix && ruff format .pip install -r requirements.txt
pytest tests/ -vTests marked integration require network access. Tests that need google-credentials.json or the Hero scraper will skip when unavailable. Run lint and format before committing:
ruff check . && ruff format --check .Navigate to the project directory and run:
python server.pyOr with Docker:
docker compose up web -dThe API is available at http://localhost:8000 (Docker maps 8000→5000).
| URL | Method | Description |
|---|---|---|
/ |
GET | Welcome message |
/health |
GET | Health check |
/scholars |
GET | List available scholar IDs |
/scholar/<id> |
GET | Get scholar data by ID (e.g. /scholar/ynWS968AAAAJ) |
/altmetric/<doi> |
GET | Altmetric score for a DOI (cached 2 weeks). 401 if not Rummer/Bergseth/Wu |
/scholar-citations/<doi> |
GET | Google Scholar citation count for a DOI (cached 2 weeks). 401 if not Rummer/Bergseth/Wu |
FLASK_HOST– Bind host (default:0.0.0.0)FLASK_PORT– Bind port (default:5000)SCHOLAR_DATA_DIR– Path to scholar JSON files (default:scholar_data)CACHE_DIR– HTTP cache directory (default:cache)CACHE_EXPIRE_SECONDS– Cache expiry (default: 30 days)FRESH_DATA_SECONDS– Skip full fetch if data is newer (default: 7 days)MAX_RETRIES,RETRY_BASE_DELAY– Retry settings for Scholar/DOI APIsCOAUTHOR_DELAY,PUBLICATION_DELAY– Rate limiting (seconds)LOG_FORMAT– Set tojsonfor structured JSON logs (e.g. in Docker)