Skip to content

Content acquisition and ingestion service for YouTube, RSS, and podcasts. FastAPI + SQLite with background subscription monitoring, text chunking, and integration with Engram (semantic search) and transcription services.

Notifications You must be signed in to change notification settings

eddiedunn/curator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Curator

Content acquisition and curation service for the PAI ecosystem.

Overview

Curator handles:

  • Content ingestion from multiple sources (YouTube, RSS, podcasts)
  • Subscription monitoring for new content
  • Content chunking for semantic search
  • Integration with Engram (embedding service) and Transcribe (transcription service)

Architecture

curator/
├── src/curator/
│   ├── api.py              # FastAPI REST API
│   ├── cli.py              # Click CLI
│   ├── config.py           # Pydantic Settings
│   ├── storage.py          # SQLite subscription database
│   ├── daemon.py           # APScheduler background daemon
│   ├── orchestrator.py     # Ingestion workflow orchestrator
│   ├── chunking.py         # Text chunking (from pai-corpus)
│   ├── models.py           # Pydantic models
│   └── plugins/
│       ├── base.py         # BasePlugin interface
│       ├── youtube.py      # YouTube plugin (from pai-ingest)
│       ├── rss.py          # RSS/Atom plugin (planned)
│       └── podcast.py      # Podcast plugin (planned)

Installation

cd ~/code/curator
pip install -e .

Usage

CLI

# Start API server
curator serve

# Ingest a single URL
curator ingest https://youtube.com/watch?v=VIDEO_ID

# Manage subscriptions
curator subscription list
curator subscription add "Channel Name" https://youtube.com/@channel
curator subscription remove 1

# List ingested items
curator items

# Run subscription daemon
curator daemon

API

Start the server:

curator serve

API endpoints:

  • GET /health - Health check
  • POST /subscriptions - Create subscription
  • GET /subscriptions - List subscriptions
  • GET /subscriptions/{id} - Get subscription
  • PATCH /subscriptions/{id} - Update subscription
  • DELETE /subscriptions/{id} - Delete subscription
  • POST /ingest - Ingest a URL
  • GET /ingest/{job_id} - Get ingestion job status
  • GET /items - List ingested items
  • GET /items/{id} - Get ingested item

Configuration

Environment variables (prefix with CURATOR_):

# Service
CURATOR_SERVICE_NAME=curator
CURATOR_ENVIRONMENT=development
CURATOR_DEBUG=false

# API
CURATOR_API_HOST=0.0.0.0
CURATOR_API_PORT=8003

# Database
CURATOR_DATABASE_URL=sqlite:///./curator.db

# Storage
CURATOR_DATA_DIR=./data

# External services
CURATOR_ENGRAM_URL=http://localhost:8001
CURATOR_TRANSCRIBE_URL=http://localhost:8002

# Plugin configuration
CURATOR_YOUTUBE_COOKIES_PATH=/path/to/cookies.txt

# Ingestion
CURATOR_DEFAULT_CHUNK_TOKENS=500
CURATOR_MAX_CONCURRENT_INGESTIONS=3

Development

Running tests

pytest tests/

Code structure

  • api.py: FastAPI REST endpoints
  • cli.py: Click command-line interface
  • storage.py: SQLite database operations
  • orchestrator.py: Ingestion workflow coordination
  • daemon.py: Background subscription monitoring
  • plugins/: Content source plugins (YouTube, RSS, etc.)

Deployment

See deploy/ directory for Docker and Podman configurations.

License

MIT

About

Content acquisition and ingestion service for YouTube, RSS, and podcasts. FastAPI + SQLite with background subscription monitoring, text chunking, and integration with Engram (semantic search) and transcription services.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors