cc-getpage

cc-getpage is a lightweight Python utility for retrieving individual pages from the Common Crawl archive. It provides a simple way to fetch specific web pages using Common Crawl's index and downloads the corresponding WARC file segment.

For bulk downloads or entire snapshots, please use the official cc-downloader program.

Features

Fetches specific web pages from Common Crawl archives
Automatically probes crawls to find which ones contain your URL
Supports manual or automatic crawl selection
Displays archived versions of a URL for selection
Downloads only the necessary WARC segment
Includes automatic retries with backoff
--viewpage option to get a Common Crawl viewer URL instead of downloading

Usage

python cc-getpage.py [--viewpage] <URL> [CRAWL-ID]

Options

Option	Description
`--viewpage`	Print a Common Crawl viewer URL instead of downloading the WARC segment
`--version`	Show the program version

If CRAWL-ID is omitted, the program will probe all available crawls to find which ones contain the given URL. This is rate-limited to be polite to the index server, so it may take a while. Press Ctrl+C to stop early and work with whatever matches have been found so far.

Contribute

Pull requests are welcome. Feel free to improve features or fix bugs.

License

This project is licensed under the MIT Licence.

Contact

For support or questions, visit Common Crawl or open an issue on GitHub. You're also welcome to join our Discord server or Google Group.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cc-getpage.py		cc-getpage.py
mast.png		mast.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cc-getpage

Features

Usage

Options

Contribute

License

Contact

About

Uh oh!

Releases 2

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cc-getpage

Features

Usage

Options

Contribute

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Contributors

Uh oh!

Languages