A concurrent, channel-pipeline web crawler in Go.
This release modernizes the project for current Go module workflows, testing expectations, and maintainability standards.
- Highlights
- Installation
- Quick start
- Architecture
- Configuration reference
- Output adapters
- Testing
- Versioning and release
- Compatibility notes
- Uses Go modules (
go.mod) instead of legacygo get-only workflow. - Includes automated unit tests for crawler defaults, validation behavior, pipeline helpers, and adapter output.
- Improved adapter safety around file handling and error paths.
- Expanded docs with architecture details and operational guidance.
go get github.com/rapidclock/web-octopus@v1.3.0Import packages:
import (
"github.com/rapidclock/web-octopus/adapter"
"github.com/rapidclock/web-octopus/octopus"
)package main
import (
"github.com/rapidclock/web-octopus/adapter"
"github.com/rapidclock/web-octopus/octopus"
)
func main() {
opAdapter := &adapter.StdOpAdapter{}
options := octopus.GetDefaultCrawlOptions()
options.MaxCrawlDepth = 3
options.TimeToQuit = 10
options.CrawlRatePerSec = 5
options.CrawlBurstLimitPerSec = 8
options.OpAdapter = opAdapter
crawler := octopus.New(options)
crawler.SetupSystem()
crawler.BeginCrawling("https://www.example.com")
}web-octopus uses a staged channel pipeline. Nodes (URLs + metadata) flow through filter and processing stages:
- Ingest
- Link absolution
- Protocol filter
- Duplicate filter
- URL validation (
HEAD) - Optional rate limiter
- Page requisition (
GET) - Distributor
- Output adapter stream
- Max delay watchdog stream
- Max crawled links limiter (optional)
- Crawl depth filter
- HTML parsing back into ingest
This design allows localized extension by replacing adapters and modifying options, while preserving high concurrency.
CrawlOptions controls crawler behavior:
MaxCrawlDepth int64— max depth for crawled nodes.MaxCrawledUrls int64— max total unique URLs;-1means unlimited.CrawlRatePerSec int64— request rate limit, negative to disable.CrawlBurstLimitPerSec int64— burst capacity for rate limiting.IncludeBody bool— include body in crawled node (currently internal pipeline behavior).OpAdapter OutputAdapter— required output sink.ValidProtocols []string— accepted URL schemes (e.g.,http,https).TimeToQuit int64— max idle seconds before automatic quit.
Use:
opts := octopus.GetDefaultCrawlOptions()Default values are tuned for local experimentation:
- Depth:
2 - Max links:
-1(unbounded) - Rate limit: disabled
- Protocols:
http,https - Timeout gap:
30s
The crawler emits processed nodes through the OutputAdapter interface:
type OutputAdapter interface {
Consume() *NodeChSet
}adapter.StdOpAdapter- Prints
count - depth - URLto stdout.
- Prints
adapter.FileWriterAdapter- Writes
depth - URLlines to a file.
- Writes
Create channels, return *octopus.NodeChSet, and consume nodes in a goroutine. Always handle quit signals to avoid goroutine leaks.
Run the full test suite:
go test ./...Recommended local checks before release:
go test ./... -race
go vet ./...This repository uses GitHub Actions (not Travis CI):
- CI workflow (
.github/workflows/ci.yml) runs automatically on PR open/sync/reopen and on pushes to the default branch. It validates module tidiness, formatting, vet/staticcheck, and test suites (including race detection). - Publish workflow (
.github/workflows/publish.yml) runs only when a GitHub Release is published (excluding prereleases), validates tag/version alignment, and triggers indexing on both the Go proxy and pkg.go.dev so new versions are discoverable quickly.
Release flow:
- Update
VERSIONandCHANGELOG.md. - Merge to default branch.
- Create and push tag
vX.Y.ZmatchingVERSION. - Publish a GitHub Release for that tag.
- GitHub Actions publish workflow handles Go portal refresh calls.
- Project follows semantic versioning.
- Current release in this repository: v1.3.0.
- See
CHANGELOG.mdfor release notes.
- Legacy examples using old
go getpackage paths still map to the same module path. - Existing adapters remain source-compatible.