Skip to content

Network Fetcher for Page Content #21

@charlieroth

Description

@charlieroth

Why

Must retrieve the source document for each saved item. Requires robust network behavior and normalization

Description of Done

  • Given an item identifier and a URL, the fetcher downloads the document with timeouts and redirects handled
  • Compressed responses are supported. Character encoding is detected and normalized to UTF-8
  • Robots and common anti-bot headers are respected where feasible
  • Failures are mapped to returnable vs non-retryable categories
  • Unit tests stub network calls and cover timeouts, redirects, bad certificates, and content encodings

Tasks

  • Add client with connect, request and total timeouts
  • Enable automatic redirect following with safe maximum
  • Set user agent and accept-encoding headers
  • Implement content decoding and character set detection
  • Classify errors: network, server, client, permanent not found
  • Return a typed result used by extractor and by the job runner
  • Write unit tests using a local stub server

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions