Skip to content

Adaptive Knowledge Ingestion Pipeline #33

@Steake

Description

@Steake

GödelOS Adaptive Ingestion, Emphasized Chunking, Mid-Range CPU Efficiency, Selectable Analysis Levels

Role: Senior systems agent.
Mission: Deliver an adaptive, CPU-only ingestion pipeline optimized for mid-range hardware (≈8 cores, 16 GB RAM) that ingests large PDFs and diverse text files, embeds them into an existing custom vector DB, and builds a categorized knowledge graph used by the frontend. Ship a redesigned, persistent Jobs UI with granular progress and predictive ETAs. Make the custom vector DB effective (no duplicated ANN/search logic in the app).


Priorities (in order)

  1. Frontend Jobs UX (highest): persistent jobs (beyond modal), highly granular progress, predictive ETAs before starting, full job management, responsive across all viewports, clean visual design.

  2. Emphasized chunking strategy: layout/sentence-aware, semantically stable chunks tuned for downstream retrieval; parameterized by user-selectable analysis levels.

  3. Mid-range CPU efficiency: autotune threads/batches/queues for 4–16 cores and ~16 GB RAM; memory-safe; sustained throughput.

  4. Custom vector DB effectiveness: embeddings live and are searched in the DB; tighten schema & APIs; avoid re-implementing search/ANN client-side.

  5. End-to-end semantic integrity: PDF in → sensible chunking/embeddings → vectors stored → graph built from vector neighbors → frontend renders categorized nodes with labeled edges.


System Overview

graph TD
  U[User] --> UI[Frontend: Jobs & Graph Views]
  UI <-->|REST + WS| API[Ingestion & Graph API]
  API --> SCH[Scheduler + Autotuner]
  SCH --> EX[Extractor + Chunker]
  EX --> EMB[Embedder CPU]
  EMB --> VDB[(Custom Vector DB)]
  VDB --> KGB[Graph Builder kNN → edges]
  KGB --> KG[(Knowledge Graph Store)]
  KG --> UI
Loading

Selectable Analysis Levels (user picks before start)

Level Chunk Tokens Overlap Model (CPU) k (Top-K) Dedup Threshold Extra Processing Typical Use
Fast 650–800 60–90 all-MiniLM-L6-v2 (ONNX/Int8) 10 simhash ≥ 0.92 basic metadata quick loads
Balanced 750–900 100–120 all-MiniLM-L6-v2 (or MPNet if RAM > 12 GB) 15 simhash ≥ 0.88 heading/keyword tags → lightweight concepts most docs
Deep 500–700 120–160 all-mpnet-base-v2 (only if Autotuner OK) 20 simhash ≥ 0.85 richer tagging; tighter neighbor threshold high recall
  • Frontend: level selector in preflight; show ETA p50/p90 per level so users can choose speed vs depth.

  • Backend: level drives chunk sizing, embedding model, dedup, and kNN parameters.


Chunking Strategy (emphasis)

  1. Layout/sentence aware: PDFs: prefer block/heading/paragraph segmentation; fall back to text/sentence windows.

  2. Token windows: apply level-specific Chunk Tokens and Overlap (table above).

  3. Stability under edits: avoid straddling headings across chunks; keep references with their paragraph.

  4. Deduplication: simhash/minhash before embedding; skip duplicates, upsert metadata instead.

  5. Quality signals: per-chunk token count, punctuation ratio, heading proximity; store as metadata.

  6. Batching: dynamic batch sizing (16→64) via Autotuner.

flowchart TD
  A[File] --> B[Layout & Sentence Parse]
  B --> C{Chunk Windowing level params}
  C --> D[Dedup simhash/minhash]
  D -->|unique| E[Embed CPU]
  E --> F[(Vector DB Upsert)]
  F --> G[Top-K per Chunk]
  G --> H[Graph Builder nodes/edges]
Loading

Mid-Range CPU Efficiency (autotune)

  • Inputs: logical cores, OS/cgroup limits, free RAM, I/O, stage latencies.

  • Controls: workers, batch size, queue depth, spill thresholds.

  • Policy:

    • Start num_workers = min(cores−2, 8); adjust ±1 every 5–10 s.

    • Keep working set ≤12 GB; spill at 85% RSS.

    • Grow/shrink batch size dynamically to maintain throughput.

    • Backpressure producer queues if memory pressure high.


Custom Vector DB — make it effective

  • Contract: fixed dim & metric validated on connect, idempotent upsert (hash_sha1), batch upsert, Top-K search with filters, stats endpoints.

  • Performance: memory-mapped vectors, contiguous arrays, adjustable search params, thread-pool aware.

  • No duplication: ANN/search logic stays inside the DB.


Knowledge Graph from vector neighbors

  • Nodes: Document, Chunk, Concept.

  • Edges: CONTAINS, SIMILAR_TO, TAGGED_AS.

  • Threshold τ: adapt from similarity distribution.

  • Expose via: GET /api/graph/{docId}.


Frontend — Jobs UI

  • Jobs page: status pills, overall & per-stage bars, ETA, actions.

  • Job detail: preflight predictions (per level), live telemetry, outputs.

  • Responsive: grid layout ≥1024px, stacked cards <768px.

Preflight prediction

  • Tokenize sample (1–2 MB) to estimate tokens/MB and chunk count.

  • Micro-benchmark embedding to estimate chunks/sec and ETA p50/p90.

  • Present ETA for all levels before starting.


APIs

  • POST /api/import/preflight, POST /api/import/jobs, pause|resume|cancel, GET /api/import/jobs, GET /api/graph/{docId}, GET /api/health.

Testing & Acceptance

  • Import ≥300 MB PDF on 8-core/16 GB host completes without OOM.

  • Vectors stored in DB, graph built, frontend shows categorized nodes/edges.

  • Preflight ETA within ±25% after 2 min.

  • Jobs UI persists across reloads, responsive across devices.

  • Vector DB is the single source of embeddings/search.

  • Semantic sanity: self-query MRR@10 ≥ 0.6; spot-check relevant results.


Deliverables

  • Adaptive ingestion workers + Autotuner.

  • Level-driven chunking.

  • Tightened vector DB contract.

  • Graph builder using DB neighbors.

  • Persistent, responsive Jobs UI.

  • Markdown docs with mermaid diagrams.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions