🕸️ Parallel WebScraper

Author: Yashwanth Aravind
GitHub: https://github.com/yash27007/web-scraper

A powerful and minimal command-line web scraper written in Go, designed to recursively extract HTML content and follow links with optional caching and export support (JSON or Markdown).
Built with concurrency and extensibility in mind, the scraper lets you easily crawl and parse websites from your terminal.

🚀 Features

🔄 Caching: Automatically caches responses to reduce redundant network calls.
🌐 Recursive crawling: Follow links with depth control.
🏷️ Tag & Attribute extraction: Extract any HTML tag or attribute.
📄 Multiple export formats: Output to .json or .md.
⚙️ Command-line flags: Full customization over depth, tags, attributes, export type, and more.
🧹 Clearable cache: One-line command to wipe cached content.
💥 No third-party libraries used for scraping (pure Go standard + golang.org/x/net/html).

📦 Project Structure


web-scraper/
├── cmd/
│   └── scraper/          # CLI entry point
│       └── main.go
├── internal/
│   ├── cache/            # Caching logic
│   ├── export/           # Exporters (JSON, Markdown)
│   └── scraper/          # Core scraping logic
├── go.mod
├── Makefile              # Shortcut commands
└── README.md

🛠️ Installation

✅ Prerequisites

Go 1.20+ installed
Git (to clone the repo)

💻 Clone the Repository

git clone https://github.com/yash27007/web-scraper.git
cd web-scraper

🔧 Usage

📌 Basic Syntax

go run cmd/scraper/main.go [flags] <urls>

🧪 Example 1: Scrape all `<a href>` from a single page

go run cmd/scraper/main.go \
  --tag a --attr href \
  --depth 0 \
  https://en.wikipedia.org/wiki/Web_development

🧪 Example 2: Scrape `<p>` tags from multiple blogs

go run cmd/scraper/main.go \
  --tag p \
  --depth 0 \
  https://www.forrester.com/blogs/ \
  https://www.elevenwriting.com/blog/the-pros-cons-of-blogging-on-medium

🧪 Example 3: Extract images

go run cmd/scraper/main.go \
  --tag img --attr src \
  --depth 0 \
  https://en.wikipedia.org/wiki/Web_development

🔍 CLI Flags

Flag	Description	Default
`--tag`	HTML tag to extract (`p`, `a`, `img`, `h1`, etc.)	`p`
`--attr`	Optional attribute to extract from tag (`href`, `src`, etc.)	(empty)
`--follow-tag`	HTML tag to follow for links	`a`
`--follow-attr`	Attribute to extract the URL from in follow-tag	`href`
`--depth`	Max recursion depth for crawling	`2`
`--format`	Output format: `json` or `md`	`json`
`--out`	Output filename prefix (no extension)	`output`
`--refresh`	Skip cache and force refetch	`false`
`--clear-cache`	Clear cache and exit	`false`

🧹 Cache Management

Clear all cache files:

go run cmd/scraper/main.go --clear-cache

Force refresh (skip cache for fresh scraping):

go run cmd/scraper/main.go --refresh --tag p --depth 0 https://example.com

📄 Output

Depending on the --format flag, the scraper will save output to:

output.json (structured JSON of scraped data)
output.md (human-readable Markdown summary)

You can customize the filename with --out.

🧰 Makefile Commands

To simplify development and usage, the project includes a Makefile with the following commands:

Command	Description
`make run`	Run scraper with example config (Wikipedia, tag `<p>`, depth 0)
`make json`	Run scraper and export output in JSON format (from Wikipedia)
`make md`	Run scraper and export output in Markdown format (from Wikipedia)
`make clear-cache`	Wipe all cached HTML responses stored locally
`make build`	Compile the web scraper binary to `./scraper`
`make clean`	Remove the compiled binary and clear cache directory
`make help`	Display all available Makefile commands with descriptions

💡 You can modify the Makefile to plug in your own URLs, tags, or output formats easily.

📌 Notes & Recommendations

Always pass --depth 0 if you only want the base page and not follow additional links.
Use --refresh to avoid stale cache.
For automation and large-scale scraping, consider throttling requests or using custom user agents (can be added easily).
The scraper avoids JavaScript-heavy content (not supported in pure HTML parsing).

🧑‍💻 Author

Made with ❤️ by Yashwanth Aravind

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🕸️ Parallel WebScraper

🚀 Features

📦 Project Structure

🛠️ Installation

✅ Prerequisites

💻 Clone the Repository

🔧 Usage

📌 Basic Syntax

🧪 Example 1: Scrape all `<a href>` from a single page

🧪 Example 2: Scrape `<p>` tags from multiple blogs

🧪 Example 3: Extract images

🔍 CLI Flags

🧹 Cache Management

Clear all cache files:

Force refresh (skip cache for fresh scraping):

📄 Output

🧰 Makefile Commands

📌 Notes & Recommendations

🧑‍💻 Author

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
cmd/scraper		cmd/scraper
internal		internal
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

yash27007/web-scraper

Folders and files

Latest commit

History

Repository files navigation

🕸️ Parallel WebScraper

🚀 Features

📦 Project Structure

🛠️ Installation

✅ Prerequisites

💻 Clone the Repository

🔧 Usage

📌 Basic Syntax

🧪 Example 1: Scrape all <a href> from a single page

🧪 Example 2: Scrape <p> tags from multiple blogs

🧪 Example 3: Extract images

🔍 CLI Flags

🧹 Cache Management

Clear all cache files:

Force refresh (skip cache for fresh scraping):

📄 Output

🧰 Makefile Commands

📌 Notes & Recommendations

🧑‍💻 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages

🧪 Example 1: Scrape all `<a href>` from a single page

🧪 Example 2: Scrape `<p>` tags from multiple blogs