-
Notifications
You must be signed in to change notification settings - Fork 0
Description
GödelOS Adaptive Ingestion, Emphasized Chunking, Mid-Range CPU Efficiency, Selectable Analysis Levels
Role: Senior systems agent.
Mission: Deliver an adaptive, CPU-only ingestion pipeline optimized for mid-range hardware (≈8 cores, 16 GB RAM) that ingests large PDFs and diverse text files, embeds them into an existing custom vector DB, and builds a categorized knowledge graph used by the frontend. Ship a redesigned, persistent Jobs UI with granular progress and predictive ETAs. Make the custom vector DB effective (no duplicated ANN/search logic in the app).
Priorities (in order)
-
Frontend Jobs UX (highest): persistent jobs (beyond modal), highly granular progress, predictive ETAs before starting, full job management, responsive across all viewports, clean visual design.
-
Emphasized chunking strategy: layout/sentence-aware, semantically stable chunks tuned for downstream retrieval; parameterized by user-selectable analysis levels.
-
Mid-range CPU efficiency: autotune threads/batches/queues for 4–16 cores and ~16 GB RAM; memory-safe; sustained throughput.
-
Custom vector DB effectiveness: embeddings live and are searched in the DB; tighten schema & APIs; avoid re-implementing search/ANN client-side.
-
End-to-end semantic integrity: PDF in → sensible chunking/embeddings → vectors stored → graph built from vector neighbors → frontend renders categorized nodes with labeled edges.
System Overview
graph TD
U[User] --> UI[Frontend: Jobs & Graph Views]
UI <-->|REST + WS| API[Ingestion & Graph API]
API --> SCH[Scheduler + Autotuner]
SCH --> EX[Extractor + Chunker]
EX --> EMB[Embedder CPU]
EMB --> VDB[(Custom Vector DB)]
VDB --> KGB[Graph Builder kNN → edges]
KGB --> KG[(Knowledge Graph Store)]
KG --> UI
Selectable Analysis Levels (user picks before start)
| Level | Chunk Tokens | Overlap | Model (CPU) | k (Top-K) | Dedup Threshold | Extra Processing | Typical Use |
|---|---|---|---|---|---|---|---|
| Fast | 650–800 | 60–90 | all-MiniLM-L6-v2 (ONNX/Int8) | 10 | simhash ≥ 0.92 | basic metadata | quick loads |
| Balanced | 750–900 | 100–120 | all-MiniLM-L6-v2 (or MPNet if RAM > 12 GB) | 15 | simhash ≥ 0.88 | heading/keyword tags → lightweight concepts | most docs |
| Deep | 500–700 | 120–160 | all-mpnet-base-v2 (only if Autotuner OK) | 20 | simhash ≥ 0.85 | richer tagging; tighter neighbor threshold | high recall |
-
Frontend: level selector in preflight; show ETA p50/p90 per level so users can choose speed vs depth.
-
Backend: level drives chunk sizing, embedding model, dedup, and kNN parameters.
Chunking Strategy (emphasis)
-
Layout/sentence aware: PDFs: prefer block/heading/paragraph segmentation; fall back to text/sentence windows.
-
Token windows: apply level-specific Chunk Tokens and Overlap (table above).
-
Stability under edits: avoid straddling headings across chunks; keep references with their paragraph.
-
Deduplication: simhash/minhash before embedding; skip duplicates, upsert metadata instead.
-
Quality signals: per-chunk token count, punctuation ratio, heading proximity; store as metadata.
-
Batching: dynamic batch sizing (16→64) via Autotuner.
flowchart TD
A[File] --> B[Layout & Sentence Parse]
B --> C{Chunk Windowing level params}
C --> D[Dedup simhash/minhash]
D -->|unique| E[Embed CPU]
E --> F[(Vector DB Upsert)]
F --> G[Top-K per Chunk]
G --> H[Graph Builder nodes/edges]
Mid-Range CPU Efficiency (autotune)
-
Inputs: logical cores, OS/cgroup limits, free RAM, I/O, stage latencies.
-
Controls: workers, batch size, queue depth, spill thresholds.
-
Policy:
-
Start
num_workers = min(cores−2, 8); adjust ±1 every 5–10 s. -
Keep working set ≤12 GB; spill at 85% RSS.
-
Grow/shrink batch size dynamically to maintain throughput.
-
Backpressure producer queues if memory pressure high.
-
Custom Vector DB — make it effective
-
Contract: fixed
dim&metricvalidated on connect, idempotent upsert (hash_sha1), batch upsert, Top-K search with filters, stats endpoints. -
Performance: memory-mapped vectors, contiguous arrays, adjustable search params, thread-pool aware.
-
No duplication: ANN/search logic stays inside the DB.
Knowledge Graph from vector neighbors
-
Nodes:
Document,Chunk,Concept. -
Edges:
CONTAINS,SIMILAR_TO,TAGGED_AS. -
Threshold τ: adapt from similarity distribution.
-
Expose via:
GET /api/graph/{docId}.
Frontend — Jobs UI
-
Jobs page: status pills, overall & per-stage bars, ETA, actions.
-
Job detail: preflight predictions (per level), live telemetry, outputs.
-
Responsive: grid layout ≥1024px, stacked cards <768px.
Preflight prediction
-
Tokenize sample (1–2 MB) to estimate tokens/MB and chunk count.
-
Micro-benchmark embedding to estimate chunks/sec and ETA p50/p90.
-
Present ETA for all levels before starting.
APIs
POST /api/import/preflight,POST /api/import/jobs,pause|resume|cancel,GET /api/import/jobs,GET /api/graph/{docId},GET /api/health.
Testing & Acceptance
-
Import ≥300 MB PDF on 8-core/16 GB host completes without OOM.
-
Vectors stored in DB, graph built, frontend shows categorized nodes/edges.
-
Preflight ETA within ±25% after 2 min.
-
Jobs UI persists across reloads, responsive across devices.
-
Vector DB is the single source of embeddings/search.
-
Semantic sanity: self-query MRR@10 ≥ 0.6; spot-check relevant results.
Deliverables
-
Adaptive ingestion workers + Autotuner.
-
Level-driven chunking.
-
Tightened vector DB contract.
-
Graph builder using DB neighbors.
-
Persistent, responsive Jobs UI.
-
Markdown docs with mermaid diagrams.