Add local semantic search MVP for Smart Notes #35

Sanjana-m55 · 2026-02-08T16:14:32Z

This PR introduces a minimal, local-first semantic search pipeline for markdown notes.

It includes:

Markdown chunking
Local embeddings using Sentence Transformers (MiniLM)
FAISS-based vector similarity search
Simple CLI-based semantic querying

The implementation runs fully offline after the initial model download and is intended as a foundational building block for the Smart Notes project’s semantic search and future RAG pipeline.

No existing code paths are modified.

Summary by CodeRabbit

New Features
- Introduced semantic search functionality enabling intelligent querying of markdown document collections
- Implements offline-first approach using locally-hosted language models, eliminating dependency on external APIs
Documentation
- Added semantic search module documentation with complete setup instructions and command-line usage examples
Chores
- Added project configuration files and Python package dependencies for semantic search

coderabbitai · 2026-02-08T16:14:51Z

📝 Walkthrough

Walkthrough

This pull request introduces a complete offline-first semantic search module to the codebase. It includes markdown file ingestion and chunking, local embedding generation via SentenceTransformer, FAISS-based vector storage, and a CLI tool for searching markdown content through semantic similarity.

Changes

Cohort / File(s)	Summary
Setup & Configuration `.gitignore`, `semantic_search/requirements.txt`	Added gitignore patterns for Python artifacts, virtual environments, and caches. Introduced dependencies: sentence-transformers, faiss-cpu, and markdown.
Core Pipeline Components `semantic_search/chunker.py`, `semantic_search/embedder.py`, `semantic_search/vector_store.py`	Implemented markdown chunking with paragraph-based accumulation (max 800 chars), SentenceTransformer-based embedding generation with the MiniLM model, and FAISS IndexFlatIP vector store with metadata tracking and similarity search.
Pipeline Orchestration `semantic_search/search.py`	Added entry point with CLI argument parsing, markdown file discovery, end-to-end chunking-embedding-indexing workflow, and query-time search with formatted result output including sources and snippets.
Documentation `semantic_search/README.md`	Added module README describing the offline-first pipeline architecture and usage instructions for invoking via `python -m semantic_search.search`.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as search.py
    participant FileSystem
    participant Chunker as chunker.py
    participant Embedder as embedder.py
    participant VectorStore as vector_store.py
    participant FAISS

    User->>CLI: python -m semantic_search.search notes_folder "query"
    CLI->>FileSystem: load_markdown_files(notes_folder)
    FileSystem-->>CLI: list of .md file paths
    
    loop For each markdown file
        CLI->>Chunker: chunk_markdown_file(file_path)
        Chunker->>FileSystem: read file
        FileSystem-->>Chunker: markdown content
        Chunker->>Chunker: _clean_markdown() → plain text
        Chunker->>Chunker: split & accumulate into chunks
        Chunker-->>CLI: List[Dict(text, source)]
    end
    
    CLI->>Embedder: embed_texts(all_chunks)
    Embedder->>Embedder: SentenceTransformer.encode()
    Embedder-->>CLI: List[List[float]] embeddings
    
    CLI->>VectorStore: __init__(embedding_dim)
    VectorStore->>FAISS: create IndexFlatIP
    
    CLI->>VectorStore: add(embeddings, metadatas)
    VectorStore->>FAISS: add vectors
    
    CLI->>Embedder: embed_texts([query])
    Embedder-->>CLI: query_embedding
    
    CLI->>VectorStore: search(query_embedding, k=5)
    VectorStore->>FAISS: search(query_vector)
    FAISS-->>VectorStore: top-k indices & scores
    VectorStore-->>CLI: List[Dict(text, source, score)]
    
    CLI->>User: print results with sources & snippets

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A search through notes with local grace,
No clouds needed in this place—
Embeddings dance, chunks align,
FAISS finds semantics so fine!
Offline wisdom, hop and gleam!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add local semantic search MVP for Smart Notes' directly and clearly describes the main change: introduction of a local semantic search minimum viable product.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@semantic_search/chunker.py`:
- Around line 43-45: The loop building chunks can exceed max_chars by one
because it always concatenates " " + para but checks len(current_chunk) +
len(para) <= max_chars; update the logic in the loop that iterates over
paragraphs to account for the extra space only when current_chunk is non-empty
(e.g., check len(current_chunk) + (1 if current_chunk else 0) + len(para) <=
max_chars), and change the concatenation to add the space conditionally (append
para directly if current_chunk is empty, otherwise append " " + para) so you
also avoid leading spaces.
- Around line 5-16: The _clean_markdown function only strips a few specific
tags, leaving other HTML from markdown.markdown() in the output; update
_clean_markdown(md_text: str) to fully remove all HTML tags and decode HTML
entities before returning plain text (e.g., run the markdown output through an
HTML-to-text step such as using an HTML parser or a robust regex to strip tags
and html.unescape to decode entities) so that headings, lists, links, emphasis,
code tags, etc. are removed and the returned text is clean for embedding; ensure
you modify the function containing markdown.markdown(md_text) and return the
fully cleaned plain text.

In `@semantic_search/README.md`:
- Around line 17-18: The README has an unclosed fenced code block around the
example command; close the triple-backtick fence after the shown command (the
code block that starts before the line containing "python -m
semantic_search.search semantic_search/test_notes \"your query here\"") so the
rest of the document renders correctly, and optionally add a short installation
note above or below the example such as a pip install instruction (e.g.,
reference to requirements.txt) to show how to install dependencies before
running the command.

In `@semantic_search/search.py`:
- Around line 46-48: The loop printing search results always appends "..." even
for short texts; update the block that iterates over results (the for res in
results loop, using res['text'], res['source'], res['score']) to compute a
snippet = res['text'] if len(res['text']) <= 200 else res['text'][:200] + "..."
and print snippet instead of always printing res['text'][:200] + "...", so the
ellipsis is only added when the original text was truncated.
- Around line 34-39: The current code assumes embeddings exist and does
len(embeddings[0]) which raises IndexError when all_chunks/texts are empty;
after calling Embedder().embed_texts(texts) check if embeddings is empty and
handle it: either return early (skip building the VectorStore) or instantiate
VectorStore using a safe default embedding dimension obtained from the embedder
(e.g. add/use Embedder.embedding_dim or Embedder.get_embedding_dim()) instead of
indexing embeddings[0]; ensure you never call len(embeddings[0]) without first
verifying embeddings is non-empty and update the VectorStore creation site to
use the safe dimension or the early-return path.

🧹 Nitpick comments (2)

semantic_search/requirements.txt (1)

1-3: Consider pinning dependency versions or adding upper bounds for reproducibility.

Using only >= constraints risks breaking changes from future major releases. For a project meant to be a foundational building block, consider pinning to compatible ranges (e.g., ~=2.2.2 or >=2.2.2,<3).
semantic_search/vector_store.py (1)
34-40: Consider strict=True on zip() for defensive coding.

Per Ruff B905, adding strict=True to zip(scores[0], indices[0]) would catch unexpected length mismatches, though in practice FAISS guarantees equal-length arrays.
Proposed fix
-        for score, idx in zip(scores[0], indices[0]):
+        for score, idx in zip(scores[0], indices[0], strict=True):

coderabbitai · 2026-02-08T16:19:09Z

semantic_search/chunker.py

+def _clean_markdown(md_text: str) -> str:
+    """
+    Convert markdown text to plain text.
+    """
+    html = markdown.markdown(md_text)
+    # Very lightweight HTML tag removal
+    text = (
+        html.replace("<p>", "")
+        .replace("</p>", "\n")
+        .replace("<br />", "\n")
+    )
+    return text


⚠️ Potential issue | 🟠 Major

Incomplete HTML tag stripping will pollute embeddings with HTML markup.

Only <p>, </p>, and <br /> are stripped. All other tags produced by markdown.markdown() — headings (<h1>–<h6>), lists (<ul>, <li>), emphasis (<strong>, <em>), links (<a href="...">), code (<pre>, <code>), etc. — will remain in the text fed to the embedder, degrading search quality.

Proposed fix using regex-based tag stripping

import os +import re from typing import List, Dict import markdown def _clean_markdown(md_text: str) -> str: """ Convert markdown text to plain text. """ html = markdown.markdown(md_text) - # Very lightweight HTML tag removal - text = ( - html.replace("<p>", "") - .replace("</p>", "\n") - .replace("<br />", "\n") - ) + # Strip all HTML tags + text = re.sub(r"<[^>]+>", "", html) return text

🤖 Prompt for AI Agents

In `@semantic_search/chunker.py` around lines 5 - 16, The _clean_markdown function only strips a few specific tags, leaving other HTML from markdown.markdown() in the output; update _clean_markdown(md_text: str) to fully remove all HTML tags and decode HTML entities before returning plain text (e.g., run the markdown output through an HTML-to-text step such as using an HTML parser or a robust regex to strip tags and html.unescape to decode entities) so that headings, lists, links, emphasis, code tags, etc. are removed and the returned text is clean for embedding; ensure you modify the function containing markdown.markdown(md_text) and return the fully cleaned plain text.

coderabbitai · 2026-02-08T16:19:09Z

semantic_search/chunker.py

+    for para in paragraphs:
+        if len(current_chunk) + len(para) <= max_chars:
+            current_chunk += " " + para


⚠️ Potential issue | 🟡 Minor

Off-by-one: the joining space is not accounted for in the length check.

Line 44 checks len(current_chunk) + len(para) <= max_chars, but Line 45 concatenates with " " + para, adding an extra character. This means chunks can exceed max_chars by 1.

Proposed fix

for para in paragraphs: - if len(current_chunk) + len(para) <= max_chars: + separator = " " if current_chunk else "" + if len(current_chunk) + len(separator) + len(para) <= max_chars: - current_chunk += " " + para + current_chunk += separator + para else:

This also avoids the leading space on the first paragraph of each chunk.

🤖 Prompt for AI Agents

In `@semantic_search/chunker.py` around lines 43 - 45, The loop building chunks can exceed max_chars by one because it always concatenates " " + para but checks len(current_chunk) + len(para) <= max_chars; update the logic in the loop that iterates over paragraphs to account for the extra space only when current_chunk is non-empty (e.g., check len(current_chunk) + (1 if current_chunk else 0) + len(para) <= max_chars), and change the concatenation to add the space conditionally (append para directly if current_chunk is empty, otherwise append " " + para) so you also avoid leading spaces.

coderabbitai · 2026-02-08T16:19:09Z

semantic_search/README.md

+```bash
+python -m semantic_search.search semantic_search/test_notes "your query here"


⚠️ Potential issue | 🟡 Minor

Unclosed code fence will break README rendering.

The fenced code block opened on Line 18 is never closed with a matching ```. This will cause the rest of the README (and any content added below) to render as a code block on GitHub. Also consider adding installation instructions (pip install -r requirements.txt).

🤖 Prompt for AI Agents

In `@semantic_search/README.md` around lines 17 - 18, The README has an unclosed fenced code block around the example command; close the triple-backtick fence after the shown command (the code block that starts before the line containing "python -m semantic_search.search semantic_search/test_notes \"your query here\"") so the rest of the document renders correctly, and optionally add a short installation note above or below the example such as a pip install instruction (e.g., reference to requirements.txt) to show how to install dependencies before running the command.

coderabbitai · 2026-02-08T16:19:09Z

semantic_search/search.py

+    texts = [c["text"] for c in all_chunks]
+
+    embedder = Embedder()
+    embeddings = embedder.embed_texts(texts)
+
+    vector_store = VectorStore(embedding_dim=len(embeddings[0]))


⚠️ Potential issue | 🔴 Critical

IndexError crash when markdown files exist but yield no text chunks.

If all .md files are empty or whitespace-only, all_chunks will be [], texts will be [], embedder.embed_texts([]) returns [], and len(embeddings[0]) on Line 39 raises an IndexError.

Proposed fix

texts = [c["text"] for c in all_chunks] + if not texts: + print("No text chunks extracted from markdown files.") + sys.exit(1) + embedder = Embedder() embeddings = embedder.embed_texts(texts)

🤖 Prompt for AI Agents

In `@semantic_search/search.py` around lines 34 - 39, The current code assumes embeddings exist and does len(embeddings[0]) which raises IndexError when all_chunks/texts are empty; after calling Embedder().embed_texts(texts) check if embeddings is empty and handle it: either return early (skip building the VectorStore) or instantiate VectorStore using a safe default embedding dimension obtained from the embedder (e.g. add/use Embedder.embedding_dim or Embedder.get_embedding_dim()) instead of indexing embeddings[0]; ensure you never call len(embeddings[0]) without first verifying embeddings is non-empty and update the VectorStore creation site to use the safe dimension or the early-return path.

coderabbitai · 2026-02-08T16:19:09Z

semantic_search/search.py

+    for res in results:
+        print(f"- {res['source']} (score: {res['score']:.3f})")
+        print(f"  {res['text'][:200]}...\n")


⚠️ Potential issue | 🟡 Minor

Trailing "..." is always appended, even for short texts.

Line 48 always appends ... after truncation, even when the text is shorter than 200 characters.

Proposed fix

- print(f" {res['text'][:200]}...\n") + snippet = res['text'][:200] + ellipsis = "..." if len(res['text']) > 200 else "" + print(f" {snippet}{ellipsis}\n")

🤖 Prompt for AI Agents

In `@semantic_search/search.py` around lines 46 - 48, The loop printing search results always appends "..." even for short texts; update the block that iterates over results (the for res in results loop, using res['text'], res['source'], res['score']) to compute a snippet = res['text'] if len(res['text']) <= 200 else res['text'][:200] + "..." and print snippet instead of always printing res['text'][:200] + "...", so the ellipsis is only added when the original text was truncated.

Add local semantic search MVP for Smart Notes

49e4ff3

coderabbitai bot reviewed Feb 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add local semantic search MVP for Smart Notes #35

Add local semantic search MVP for Smart Notes #35

Uh oh!

Sanjana-m55 commented Feb 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 8, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		```bash
		python -m semantic_search.search semantic_search/test_notes "your query here"

Add local semantic search MVP for Smart Notes #35

Are you sure you want to change the base?

Add local semantic search MVP for Smart Notes #35

Uh oh!

Conversation

Sanjana-m55 commented Feb 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sanjana-m55 commented Feb 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 8, 2026 •

edited

Loading