Author: Yashwanth Aravind
GitHub: https://github.com/yash27007/web-scraper
A powerful and minimal command-line web scraper written in Go, designed to recursively extract HTML content and follow links with optional caching and export support (JSON or Markdown).
Built with concurrency and extensibility in mind, the scraper lets you easily crawl and parse websites from your terminal.
- 🔄 Caching: Automatically caches responses to reduce redundant network calls.
- 🌐 Recursive crawling: Follow links with depth control.
- 🏷️ Tag & Attribute extraction: Extract any HTML tag or attribute.
- 📄 Multiple export formats: Output to
.jsonor.md. - ⚙️ Command-line flags: Full customization over depth, tags, attributes, export type, and more.
- 🧹 Clearable cache: One-line command to wipe cached content.
- 💥 No third-party libraries used for scraping (pure Go standard +
golang.org/x/net/html).
web-scraper/
├── cmd/
│ └── scraper/ # CLI entry point
│ └── main.go
├── internal/
│ ├── cache/ # Caching logic
│ ├── export/ # Exporters (JSON, Markdown)
│ └── scraper/ # Core scraping logic
├── go.mod
├── Makefile # Shortcut commands
└── README.md
- Go 1.20+ installed
- Git (to clone the repo)
git clone https://github.com/yash27007/web-scraper.git
cd web-scrapergo run cmd/scraper/main.go [flags] <urls>go run cmd/scraper/main.go \
--tag a --attr href \
--depth 0 \
https://en.wikipedia.org/wiki/Web_developmentgo run cmd/scraper/main.go \
--tag p \
--depth 0 \
https://www.forrester.com/blogs/ \
https://www.elevenwriting.com/blog/the-pros-cons-of-blogging-on-mediumgo run cmd/scraper/main.go \
--tag img --attr src \
--depth 0 \
https://en.wikipedia.org/wiki/Web_development| Flag | Description | Default |
|---|---|---|
--tag |
HTML tag to extract (p, a, img, h1, etc.) |
p |
--attr |
Optional attribute to extract from tag (href, src, etc.) |
(empty) |
--follow-tag |
HTML tag to follow for links | a |
--follow-attr |
Attribute to extract the URL from in follow-tag | href |
--depth |
Max recursion depth for crawling | 2 |
--format |
Output format: json or md |
json |
--out |
Output filename prefix (no extension) | output |
--refresh |
Skip cache and force refetch | false |
--clear-cache |
Clear cache and exit | false |
go run cmd/scraper/main.go --clear-cachego run cmd/scraper/main.go --refresh --tag p --depth 0 https://example.comDepending on the --format flag, the scraper will save output to:
output.json(structured JSON of scraped data)output.md(human-readable Markdown summary)
You can customize the filename with --out.
To simplify development and usage, the project includes a Makefile with the following commands:
| Command | Description |
|---|---|
make run |
Run scraper with example config (Wikipedia, tag <p>, depth 0) |
make json |
Run scraper and export output in JSON format (from Wikipedia) |
make md |
Run scraper and export output in Markdown format (from Wikipedia) |
make clear-cache |
Wipe all cached HTML responses stored locally |
make build |
Compile the web scraper binary to ./scraper |
make clean |
Remove the compiled binary and clear cache directory |
make help |
Display all available Makefile commands with descriptions |
💡 You can modify the
Makefileto plug in your own URLs, tags, or output formats easily.
- Always pass
--depth 0if you only want the base page and not follow additional links. - Use
--refreshto avoid stale cache. - For automation and large-scale scraping, consider throttling requests or using custom user agents (can be added easily).
- The scraper avoids JavaScript-heavy content (not supported in pure HTML parsing).
Made with ❤️ by Yashwanth Aravind