cc-getpage is a lightweight Python utility for retrieving individual pages from the Common Crawl archive. It provides a simple way to fetch specific web pages using Common Crawl's index and downloads the corresponding WARC file segment.
For bulk downloads or entire snapshots, please use the official cc-downloader program.
- Fetches specific web pages from Common Crawl archives
- Automatically probes crawls to find which ones contain your URL
- Supports manual or automatic crawl selection
- Displays archived versions of a URL for selection
- Downloads only the necessary WARC segment
- Includes automatic retries with backoff
--viewpageoption to get a Common Crawl viewer URL instead of downloading
python cc-getpage.py [--viewpage] <URL> [CRAWL-ID]| Option | Description |
|---|---|
--viewpage |
Print a Common Crawl viewer URL instead of downloading the WARC segment |
--version |
Show the program version |
If CRAWL-ID is omitted, the program will probe all available crawls to find which ones contain the given URL. This is rate-limited to be polite to the index server, so it may take a while. Press Ctrl+C to stop early and work with whatever matches have been found so far.
Pull requests are welcome. Feel free to improve features or fix bugs.
This project is licensed under the MIT Licence.
For support or questions, visit Common Crawl or open an issue on GitHub. You're also welcome to join our Discord server or Google Group.
