Skip to content

Persist Content in Database #23

@charlieroth

Description

@charlieroth

Why

Storing both original and cleaned representations enables re-extraction and auditing

Definition of Done

  • New columns exist for cleaned text, cleaned HTML, language, extracted_at timestamp, and checksum
  • Checksums prevent duplicate writes when content has not changed
  • Large bodies are stored efficiently and streamed to the database
  • Database constraints protect referential integrity
  • Migration is idempotent and reversible
  • Unit tests cover insert, update and no-op when checksum matches

Tasks

  • Database migration to add content table and needed columns
  • Define data access functions to upsert by item_id and checksum
  • Stream large payloads to avoid excessive memory usage
  • Compute checksum from normalized text
  • Add index on item identifier and on checksum
  • Write unit tests for persistence behavior

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions