Skip to content

A powerful and minimal command-line web scraper written in Go, designed to recursively extract HTML content and follow links with optional caching and export support (JSON or Markdown). Built with concurrency and extensibility in mind, the scraper lets you easily crawl and parse websites from your terminal.

Notifications You must be signed in to change notification settings

yash27007/web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🕸️ Parallel WebScraper

Author: Yashwanth Aravind
GitHub: https://github.com/yash27007/web-scraper


A powerful and minimal command-line web scraper written in Go, designed to recursively extract HTML content and follow links with optional caching and export support (JSON or Markdown).
Built with concurrency and extensibility in mind, the scraper lets you easily crawl and parse websites from your terminal.


🚀 Features

  • 🔄 Caching: Automatically caches responses to reduce redundant network calls.
  • 🌐 Recursive crawling: Follow links with depth control.
  • 🏷️ Tag & Attribute extraction: Extract any HTML tag or attribute.
  • 📄 Multiple export formats: Output to .json or .md.
  • ⚙️ Command-line flags: Full customization over depth, tags, attributes, export type, and more.
  • 🧹 Clearable cache: One-line command to wipe cached content.
  • 💥 No third-party libraries used for scraping (pure Go standard + golang.org/x/net/html).

📦 Project Structure


web-scraper/
├── cmd/
│   └── scraper/          # CLI entry point
│       └── main.go
├── internal/
│   ├── cache/            # Caching logic
│   ├── export/           # Exporters (JSON, Markdown)
│   └── scraper/          # Core scraping logic
├── go.mod
├── Makefile              # Shortcut commands
└── README.md


🛠️ Installation

✅ Prerequisites

  • Go 1.20+ installed
  • Git (to clone the repo)

💻 Clone the Repository

git clone https://github.com/yash27007/web-scraper.git
cd web-scraper

🔧 Usage

📌 Basic Syntax

go run cmd/scraper/main.go [flags] <urls>

🧪 Example 1: Scrape all <a href> from a single page

go run cmd/scraper/main.go \
  --tag a --attr href \
  --depth 0 \
  https://en.wikipedia.org/wiki/Web_development

🧪 Example 2: Scrape <p> tags from multiple blogs

go run cmd/scraper/main.go \
  --tag p \
  --depth 0 \
  https://www.forrester.com/blogs/ \
  https://www.elevenwriting.com/blog/the-pros-cons-of-blogging-on-medium

🧪 Example 3: Extract images

go run cmd/scraper/main.go \
  --tag img --attr src \
  --depth 0 \
  https://en.wikipedia.org/wiki/Web_development

🔍 CLI Flags

Flag Description Default
--tag HTML tag to extract (p, a, img, h1, etc.) p
--attr Optional attribute to extract from tag (href, src, etc.) (empty)
--follow-tag HTML tag to follow for links a
--follow-attr Attribute to extract the URL from in follow-tag href
--depth Max recursion depth for crawling 2
--format Output format: json or md json
--out Output filename prefix (no extension) output
--refresh Skip cache and force refetch false
--clear-cache Clear cache and exit false

🧹 Cache Management

Clear all cache files:

go run cmd/scraper/main.go --clear-cache

Force refresh (skip cache for fresh scraping):

go run cmd/scraper/main.go --refresh --tag p --depth 0 https://example.com

📄 Output

Depending on the --format flag, the scraper will save output to:

  • output.json (structured JSON of scraped data)
  • output.md (human-readable Markdown summary)

You can customize the filename with --out.

🧰 Makefile Commands

To simplify development and usage, the project includes a Makefile with the following commands:

Command Description
make run Run scraper with example config (Wikipedia, tag <p>, depth 0)
make json Run scraper and export output in JSON format (from Wikipedia)
make md Run scraper and export output in Markdown format (from Wikipedia)
make clear-cache Wipe all cached HTML responses stored locally
make build Compile the web scraper binary to ./scraper
make clean Remove the compiled binary and clear cache directory
make help Display all available Makefile commands with descriptions

💡 You can modify the Makefile to plug in your own URLs, tags, or output formats easily.


📌 Notes & Recommendations

  • Always pass --depth 0 if you only want the base page and not follow additional links.
  • Use --refresh to avoid stale cache.
  • For automation and large-scale scraping, consider throttling requests or using custom user agents (can be added easily).
  • The scraper avoids JavaScript-heavy content (not supported in pure HTML parsing).

🧑‍💻 Author

Made with ❤️ by Yashwanth Aravind

About

A powerful and minimal command-line web scraper written in Go, designed to recursively extract HTML content and follow links with optional caching and export support (JSON or Markdown). Built with concurrency and extensibility in mind, the scraper lets you easily crawl and parse websites from your terminal.

Topics

Resources

Stars

Watchers

Forks