Adaptive Knowledge Ingestion Pipeline

# GödelOS Adaptive Ingestion, Emphasized Chunking, Mid-Range CPU Efficiency, Selectable Analysis Levels

**Role:** Senior systems agent.  
**Mission:** Deliver an adaptive, CPU-only ingestion pipeline optimized for mid-range hardware (≈8 cores, 16 GB RAM) that ingests large PDFs and diverse text files, embeds them into an existing custom vector DB, and builds a categorized knowledge graph used by the frontend. Ship a redesigned, persistent Jobs UI with granular progress and predictive ETAs. Make the custom vector DB effective (no duplicated ANN/search logic in the app).

- - -

## Priorities (in order)

1.  **Frontend Jobs UX (highest):** persistent jobs (beyond modal), highly granular progress, **predictive ETAs before starting**, full job management, responsive across all viewports, clean visual design.
    
2.  **Emphasized chunking strategy:** layout/sentence-aware, semantically stable chunks tuned for downstream retrieval; parameterized by **user-selectable analysis levels**.
    
3.  **Mid-range CPU efficiency:** autotune threads/batches/queues for 4–16 cores and ~16 GB RAM; memory-safe; sustained throughput.
    
4.  **Custom vector DB effectiveness:** embeddings live and are searched **in the DB**; tighten schema & APIs; avoid re-implementing search/ANN client-side.
    
5.  **End-to-end semantic integrity:** PDF in → sensible chunking/embeddings → vectors stored → graph built from vector neighbors → frontend renders categorized nodes with labeled edges.
    

- - -

## System Overview

```mermaid
graph TD
  U[User] --> UI[Frontend: Jobs & Graph Views]
  UI <-->|REST + WS| API[Ingestion & Graph API]
  API --> SCH[Scheduler + Autotuner]
  SCH --> EX[Extractor + Chunker]
  EX --> EMB[Embedder CPU]
  EMB --> VDB[(Custom Vector DB)]
  VDB --> KGB[Graph Builder kNN → edges]
  KGB --> KG[(Knowledge Graph Store)]
  KG --> UI
```

- - -

## Selectable Analysis Levels (user picks before start)


Level | Chunk Tokens | Overlap | Model (CPU) | k (Top-K) | Dedup Threshold | Extra Processing | Typical Use
-- | -- | -- | -- | -- | -- | -- | --
Fast | 650–800 | 60–90 | all-MiniLM-L6-v2 (ONNX/Int8) | 10 | simhash ≥ 0.92 | basic metadata | quick loads
Balanced | 750–900 | 100–120 | all-MiniLM-L6-v2 (or MPNet if RAM > 12 GB) | 15 | simhash ≥ 0.88 | heading/keyword tags → lightweight concepts | most docs
Deep | 500–700 | 120–160 | all-mpnet-base-v2 (only if Autotuner OK) | 20 | simhash ≥ 0.85 | richer tagging; tighter neighbor threshold | high recall



*   **Frontend:** level selector in preflight; show **ETA p50/p90 per level** so users can choose speed vs depth.
    
*   **Backend:** level drives chunk sizing, embedding model, dedup, and kNN parameters.
    

- - -

## Chunking Strategy (emphasis)

1.  **Layout/sentence aware:** PDFs: prefer block/heading/paragraph segmentation; fall back to text/sentence windows.
    
2.  **Token windows:** apply level-specific **Chunk Tokens** and **Overlap** (table above).
    
3.  **Stability under edits:** avoid straddling headings across chunks; keep references with their paragraph.
    
4.  **Deduplication:** simhash/minhash before embedding; skip duplicates, upsert metadata instead.
    
5.  **Quality signals:** per-chunk token count, punctuation ratio, heading proximity; store as metadata.
    
6.  **Batching:** dynamic batch sizing (16→64) via Autotuner.
    

```mermaid
flowchart TD
  A[File] --> B[Layout & Sentence Parse]
  B --> C{Chunk Windowing level params}
  C --> D[Dedup simhash/minhash]
  D -->|unique| E[Embed CPU]
  E --> F[(Vector DB Upsert)]
  F --> G[Top-K per Chunk]
  G --> H[Graph Builder nodes/edges]
```

- - -

## Mid-Range CPU Efficiency (autotune)

*   **Inputs:** logical cores, OS/cgroup limits, free RAM, I/O, stage latencies.
    
*   **Controls:** workers, batch size, queue depth, spill thresholds.
    
*   **Policy:**
    
    *   Start `num_workers = min(cores−2, 8)`; adjust ±1 every 5–10 s.
        
    *   Keep working set ≤12 GB; spill at 85% RSS.
        
    *   Grow/shrink batch size dynamically to maintain throughput.
        
    *   Backpressure producer queues if memory pressure high.
        

- - -

## Custom Vector DB — make it effective

*   **Contract:** fixed `dim` & `metric` validated on connect, idempotent upsert (hash\_sha1), batch upsert, Top-K search with filters, stats endpoints.
    
*   **Performance:** memory-mapped vectors, contiguous arrays, adjustable search params, thread-pool aware.
    
*   **No duplication:** ANN/search logic stays inside the DB.
    

- - -

## Knowledge Graph from vector neighbors

*   **Nodes:** `Document`, `Chunk`, `Concept`.
    
*   **Edges:** `CONTAINS`, `SIMILAR_TO`, `TAGGED_AS`.
    
*   **Threshold τ:** adapt from similarity distribution.
    
*   **Expose via:** `GET /api/graph/{docId}`.
    

- - -

## Frontend — Jobs UI

*   **Jobs page:** status pills, overall & per-stage bars, ETA, actions.
    
*   **Job detail:** preflight predictions (per level), live telemetry, outputs.
    
*   **Responsive:** grid layout ≥1024px, stacked cards <768px.
    

### Preflight prediction

*   Tokenize sample (1–2 MB) to estimate tokens/MB and chunk count.
    
*   Micro-benchmark embedding to estimate chunks/sec and ETA p50/p90.
    
*   Present ETA for all levels before starting.
    

- - -

## APIs

*   `POST /api/import/preflight`, `POST /api/import/jobs`, `pause|resume|cancel`, `GET /api/import/jobs`, `GET /api/graph/{docId}`, `GET /api/health`.
    

- - -

## Testing & Acceptance

*   Import ≥300 MB PDF on 8-core/16 GB host completes without OOM.
    
*   Vectors stored in DB, graph built, frontend shows categorized nodes/edges.
    
*   Preflight ETA within ±25% after 2 min.
    
*   Jobs UI persists across reloads, responsive across devices.
    
*   Vector DB is the single source of embeddings/search.
    
*   Semantic sanity: self-query MRR@10 ≥ 0.6; spot-check relevant results.
    

- - -

## Deliverables

*   Adaptive ingestion workers + Autotuner.
    
*   Level-driven chunking.
    
*   Tightened vector DB contract.
    
*   Graph builder using DB neighbors.
    
*   Persistent, responsive Jobs UI.
    
*   Markdown docs with mermaid diagrams.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive Knowledge Ingestion Pipeline #33

GödelOS Adaptive Ingestion, Emphasized Chunking, Mid-Range CPU Efficiency, Selectable Analysis Levels

Priorities (in order)

System Overview

Selectable Analysis Levels (user picks before start)

Chunking Strategy (emphasis)

Mid-Range CPU Efficiency (autotune)

Custom Vector DB — make it effective

Knowledge Graph from vector neighbors

Frontend — Jobs UI

Preflight prediction

APIs

Testing & Acceptance

Deliverables

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Level	Chunk Tokens	Overlap	Model (CPU)	k (Top-K)	Dedup Threshold	Extra Processing	Typical Use
Fast	650–800	60–90	all-MiniLM-L6-v2 (ONNX/Int8)	10	simhash ≥ 0.92	basic metadata	quick loads
Balanced	750–900	100–120	all-MiniLM-L6-v2 (or MPNet if RAM > 12 GB)	15	simhash ≥ 0.88	heading/keyword tags → lightweight concepts	most docs
Deep	500–700	120–160	all-mpnet-base-v2 (only if Autotuner OK)	20	simhash ≥ 0.85	richer tagging; tighter neighbor threshold	high recall

Adaptive Knowledge Ingestion Pipeline #33

Description

GödelOS Adaptive Ingestion, Emphasized Chunking, Mid-Range CPU Efficiency, Selectable Analysis Levels

Priorities (in order)

System Overview

Selectable Analysis Levels (user picks before start)

Chunking Strategy (emphasis)

Mid-Range CPU Efficiency (autotune)

Custom Vector DB — make it effective

Knowledge Graph from vector neighbors

Frontend — Jobs UI

Preflight prediction

APIs

Testing & Acceptance

Deliverables

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions