relabeledPDFs.py renames academic PDF files using metadata extracted directly from each PDF.
The new filenames follow a standardized, human-readable convention:
FirstAuthorLastName + Year + FirstSixContentWordsInCamelCase.pdf
For example, the file 029401_1_online.pdf becomes:
Thompson2022LAMMPSFlexibleSimulationToolParticleBased.pdf
This naming convention makes it possible to identify the paper at a glance in a file browser without opening it.
Downloaded academic PDFs arrive with opaque filenames: publisher-assigned hashes (029401_1_online.pdf), DOI-encoded strings (10.1515_bmc.2011.016.pdf), or truncated titles.
Manually renaming dozens or hundreds of papers is tedious, error-prone, and time-consuming.
Using AI for all repetitive tasks of this nature is unethical; use scripts instead.
relabeledPDFs.py automates the process by extracting author, year, and title metadata from each PDF and constructing a consistent filename.
The script uses a three-tier metadata extraction strategy, attempting each in order and stopping at the first success:
- CrossRef API lookup — If a DOI is found (in PDF metadata, the filename, or the page text), the script queries the CrossRef REST API for authoritative metadata.
- Embedded PDF metadata (pypdf) — The script reads the
/Title,/Author, and/CreationDatefields stored inside the PDF itself. - Text parsing fallback — As a last resort, the script extracts the first two pages of text (preferring the
pdftotextCLI from poppler-utils for its superior handling of multi-column layouts) and applies heuristic regex patterns to locate the title, first author, and publication year.
The CamelCase title conversion applies several domain-aware rules:
- Stop word filtering: articles, prepositions, and conjunctions (a, an, the, of, for, in, ...) are skipped so that only content words count toward the six-word limit.
- Acronym preservation: a
PRESERVE_CASEdictionary ensures that terms like RNA, DNA, 3D, LAMMPS, PyMOL, GROMACS, and SARS-CoV-2 retain their canonical casing rather than being lowered to Rna, Dna, Lammps, etc. - Single-letter hyphen prefix merging: "G-quadruplex" becomes
GQuadruplex(one word), because splitting it would waste one of the six word slots on the single letter "G". - Multi-character hyphenated word splitting: "Particle-Based" becomes
Particle+Based(two separate words).
Required:
- Python 3.10 or later
- pypdf (
pip install pypdf)
Optional (recommended):
pdftotextCLI from poppler-utils — produces much cleaner text extraction than pure-Python alternatives, especially for multi-column academic paper layouts. Install withbrew install poppler(macOS),sudo apt install poppler-utils(Debian/Ubuntu), orconda install -c conda-forge poppler.- pdfplumber (
pip install pdfplumber) — used as a tertiary fallback if bothpdftotextand pypdf fail to extract usable text.
For running tests:
- pytest (
pip install pytest)
Clone the repository and install the dependencies:
git clone https://github.com/YourUsername/relabeledPDFs.git
cd relabeledPDFs
pip install pypdf pytestOptionally install poppler-utils for improved text extraction:
# macOS
brew install poppler
# Debian / Ubuntu
sudo apt install poppler-utils
# Conda
conda install -c conda-forge popplerRename all PDFs in a directory:
python relabeledPDFs.py /path/to/pdf/folderThe script prints a progress log, renames files in place, and reports a summary at the end.
Preview the proposed filenames without modifying anything on disk:
python relabeledPDFs.py /path/to/pdf/folder --dry-runor equivalently:
python relabeledPDFs.py /path/to/pdf/folder -nSample output:
Found 48 PDF(s) in /Users/blaine/papers
[1/48] 029401_1_online.pdf
-> Thompson2022LAMMPSFlexibleSimulationToolParticleBased.pdf
[2/48] gkac123.pdf
-> Zok2022RNApolisStructuralBioinformaticsPlatformRNA.pdf
...
============================================================
Ready to rename: 45 | Need manual review: 3
(dry-run mode — no files were renamed)
Export the full metadata mapping as JSON (useful for scripting or post-processing):
python relabeledPDFs.py /path/to/pdf/folder --jsonJSON mode implies --dry-run. Redirect the output to a file if desired:
python relabeledPDFs.py /path/to/pdf/folder --json > rename_map.jsonPrint detailed extraction information (DOIs found, CrossRef responses):
python relabeledPDFs.py /path/to/pdf/folder --dry-run --verboseThe CrossRef API routes requests with a contact email to a faster "polite" pool. Provide your email for better performance when processing large batches:
python relabeledPDFs.py /path/to/pdf/folder --email you@university.eduFlags can be combined freely:
python relabeledPDFs.py ~/Downloads/papers --dry-run --verbose --email you@university.eduWhen called without a directory argument, the script processes the current working directory:
cd /path/to/pdf/folder
python relabeledPDFs.py --dry-runusage: relabeledPDFs.py [-h] [--dry-run] [--json] [--verbose] [--email EMAIL]
[directory]
Rename academic PDF files to LastNameYearFirst6TitleWordsCamelCase.pdf
positional arguments:
directory Directory containing PDF files (default: current directory)
optional arguments:
--dry-run, -n Preview renames without actually renaming files
--json, -j Output the mapping as JSON (implies --dry-run)
--verbose, -v Print detailed extraction information
--email EMAIL Email address for CrossRef polite API pool
(default: user@example.com)
To preserve the casing of additional domain-specific terms, add entries to the PRESERVE_CASE dictionary near the top of the script:
PRESERVE_CASE = {
...
'crispr': 'CRISPR',
'alphafold': 'AlphaFold',
'openai': 'OpenAI',
}The key must be the all-lowercase form. The value is the canonical casing that will appear in the filename.
To skip additional words during title conversion, add them (in lowercase) to the STOP_WORDS set:
STOP_WORDS = {
...
'using', 'based', 'toward',
}The default of six content words can be changed globally by modifying the n=6 default in the title_to_camel() function signature, or on a per-call basis:
camel = title_to_camel("Some Very Long Title ...", n=8)The file test_relabeledPDFs.py contains 95 tests organized into unit tests and integration tests.
Run the full suite with:
pytest test_relabeledPDFs.py -vor:
python -m pytest test_relabeledPDFs.py -v --tb=shortAll 95 tests pass in approximately 0.3 seconds.
The tests are grouped into 15 classes covering every public function. The table below summarizes the class, the function or feature it targets, and the count of test methods.
| Test class | Target function / feature | Tests | Type |
|---|---|---|---|
TestTitleToCamel |
title_to_camel() — stop words, acronyms, hyphens, punctuation, word count, case |
23 | Unit |
TestParseFirstAuthorLast |
parse_first_author_last() — name formats, separators, edge cases |
10 | Unit |
TestExtractYearFromText |
extract_year_from_text() — copyright, published/accepted, range boundaries |
9 | Unit |
TestExtractTitleFromText |
extract_title_from_text() — header skipping, continuation lines |
5 | Unit |
TestExtractAuthorFromText |
extract_author_from_text() — author line detection |
2 | Unit |
TestParseCrossref |
parse_crossref() — complete records, date fallbacks, corporate authors |
7 | Unit |
TestExtractDoi |
extract_doi() — doi: prefix, URLs, filenames, bare DOIs |
7 | Unit |
TestCrossrefEmailAccessors |
_get/_set_crossref_email() |
2 | Unit |
TestLookupCrossref |
lookup_crossref() — mocked HTTP success and failure |
4 | Unit |
TestHasPdftotext |
_has_pdftotext() — mocked shutil.which |
2 | Unit |
TestExtractText |
extract_text() — fallback chain (pdftotext → pypdf → pdfplumber) |
3 | Unit |
TestProcessPdfWithMock |
process_pdf() — CrossRef, pypdf, text, missing-metadata paths |
4 | Integration |
TestProcessDirectoryIntegration |
process_directory() — dry run, JSON, rename, dedup, skip, empty dir |
6 | Integration |
TestMainCLI |
main() — argparse flags, exit codes |
3 | Integration |
TestEdgeCases |
Boundary conditions, unicode, slashes, dictionary completeness | 8 | Edge case |
Each unit test class isolates a single function and feeds it controlled inputs.
External dependencies (network, file system, subprocess calls) are replaced with unittest.mock patches so that the tests run without network access, without PDF files on disk, and without poppler-utils installed.
Key areas of coverage:
- CamelCase conversion (
TestTitleToCamel): The largest test class, with 23 tests, because the naming logic is the heart of the script. Tests verify that stop words are skipped, that acronyms like RNA, DNA, 3D, LAMMPS, PyMOL, and SARS-CoV-2 retain canonical casing, that single-letter hyphen prefixes merge correctly (G-quadruplex → GQuadruplex), that multi-character hyphenated words split into separate word slots, that exactly six content words are selected, that punctuation (colons, em-dashes, parentheses) is stripped, and that empty orNoneinputs return empty strings. - Author parsing (
TestParseFirstAuthorLast): Covers "First Last", "Last, First", semicolon-separated lists, "and"-separated lists, affiliation marker stripping (†, ‡, superscript digits), compound last names (Garcia-Lopez), single corporate names (Formulatrix), and middle initials. - Year extraction (
TestExtractYearFromText): Exercises the priority order of regex patterns (copyright © first, then published/received/accepted keywords, then bare four-digit years) and verifies that years before 1980 or after 2026 are rejected. - DOI extraction (
TestExtractDoi): Tests DOIs found viadoi:prefix,https://doi.org/URLs,https://dx.doi.org/URLs, filename-encoded DOIs (e.g.,10.1515_bmc.2011.016.pdf), bare DOIs in body text, and trailing period stripping. - CrossRef parsing (
TestParseCrossref): Feeds mock CrossRef JSON records with different date field combinations (published-print,published-online,issued) and author field formats (familyvs.name) to verify the fallback chain. - Text extraction fallback (
TestExtractText): Mocks the three extraction backends to confirm thatpdftotextis preferred when available, thatpypdfis used whenpdftotextis absent or returns empty text, and thatpdfplumberis the last resort.
Integration tests create real (minimal) PDF files in a temporary directory using pypdf.PdfWriter, then call process_pdf() or process_directory() and verify observable outcomes:
- Dry-run preservation: The file listing before and after a
--dry-runinvocation is identical — no files are renamed, created, or deleted. - JSON output: The
--jsonflag produces valid JSON thatjson.loads()can parse, with one entry per PDF. - Actual rename: After a non-dry-run invocation, PDFs with extractable metadata are renamed to the new filenames and the original filenames no longer exist.
- Deduplication: When two PDFs would receive the same proposed filename, the second one gets a numeric suffix (e.g.,
Smith2023TestPaper2.pdf). - Target-exists skip: If the proposed target filename already exists on disk, the rename is skipped rather than overwriting.
- Empty directory: Processing a directory with no PDFs prints a message and returns an empty list without errors.
- CLI flags: The
--dry-run,--json,--verbose, and--emailflags are accepted by argparse. A nonexistent directory argument causessys.exit(1). - Metadata source priority: With mocked extraction functions, the tests verify that CrossRef results take priority over pypdf metadata, which in turn takes priority over heuristic text parsing.
Run only the CamelCase tests:
pytest test_relabeledPDFs.py -k "TestTitleToCamel" -vRun only the integration tests:
pytest test_relabeledPDFs.py -k "Integration or MainCLI or ProcessPdf" -vRun a single test by name:
pytest test_relabeledPDFs.py::TestTitleToCamel::test_sars_cov_preserved -vrelabeledPDFs/
├── relabeledPDFs.py # Main script
├── test_relabeledPDFs.py # Unit and integration tests (95 tests)
└── README.md # This file
- Text extraction quality varies: Some PDFs (scanned images, heavily formatted layouts, encrypted files) may yield poor text extraction, causing the heuristic fallbacks to produce incorrect metadata. Use
--dry-runto review before committing. - CrossRef API requires network access: The best metadata source depends on an internet connection and a DOI being present in the PDF.
- Heuristic author/title detection: The text-based fallback uses regex patterns tuned for common academic paper layouts. Unusual formatting (e.g., titles in images, authors in footers) may not be detected.
- Six-word limit: Extremely long titles may lose distinguishing information. Increase
nintitle_to_camel()if this is a concern.
Alpha. Does not handle all edge cases yet. The lead author's last name is sometimes misindentified replaced with the wrong word. This will be obvious. Some manual editing is still required. Nonetheless, a lot of work is saved.
- Fork the repository.
- Create a feature branch (
git checkout -b feature/improved-parsing). - Add tests for any new functionality.
- Run
pytest test_relabeledPDFs.py -vand confirm all tests pass. - Submit a pull request.
| Version | Changes | Date |
|---|---|---|
| Version 0.1 | Added badges, funding, and update table. Initial commit. | 2026 February 12 |
- NIH: R01 CA242845.
- NIH: R01 AI088011.
- NIH: P30 CA225520 (PI: R. Mannel).
- NIH: P20 GM103640 and P30 GM145423 (PI: A. West).