-
Notifications
You must be signed in to change notification settings - Fork 45
Add local semantic search MVP for Smart Notes #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis pull request introduces a complete offline-first semantic search module to the codebase. It includes markdown file ingestion and chunking, local embedding generation via SentenceTransformer, FAISS-based vector storage, and a CLI tool for searching markdown content through semantic similarity. Changes
Sequence DiagramsequenceDiagram
actor User
participant CLI as search.py
participant FileSystem
participant Chunker as chunker.py
participant Embedder as embedder.py
participant VectorStore as vector_store.py
participant FAISS
User->>CLI: python -m semantic_search.search notes_folder "query"
CLI->>FileSystem: load_markdown_files(notes_folder)
FileSystem-->>CLI: list of .md file paths
loop For each markdown file
CLI->>Chunker: chunk_markdown_file(file_path)
Chunker->>FileSystem: read file
FileSystem-->>Chunker: markdown content
Chunker->>Chunker: _clean_markdown() → plain text
Chunker->>Chunker: split & accumulate into chunks
Chunker-->>CLI: List[Dict(text, source)]
end
CLI->>Embedder: embed_texts(all_chunks)
Embedder->>Embedder: SentenceTransformer.encode()
Embedder-->>CLI: List[List[float]] embeddings
CLI->>VectorStore: __init__(embedding_dim)
VectorStore->>FAISS: create IndexFlatIP
CLI->>VectorStore: add(embeddings, metadatas)
VectorStore->>FAISS: add vectors
CLI->>Embedder: embed_texts([query])
Embedder-->>CLI: query_embedding
CLI->>VectorStore: search(query_embedding, k=5)
VectorStore->>FAISS: search(query_vector)
FAISS-->>VectorStore: top-k indices & scores
VectorStore-->>CLI: List[Dict(text, source, score)]
CLI->>User: print results with sources & snippets
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@semantic_search/chunker.py`:
- Around line 43-45: The loop building chunks can exceed max_chars by one
because it always concatenates " " + para but checks len(current_chunk) +
len(para) <= max_chars; update the logic in the loop that iterates over
paragraphs to account for the extra space only when current_chunk is non-empty
(e.g., check len(current_chunk) + (1 if current_chunk else 0) + len(para) <=
max_chars), and change the concatenation to add the space conditionally (append
para directly if current_chunk is empty, otherwise append " " + para) so you
also avoid leading spaces.
- Around line 5-16: The _clean_markdown function only strips a few specific
tags, leaving other HTML from markdown.markdown() in the output; update
_clean_markdown(md_text: str) to fully remove all HTML tags and decode HTML
entities before returning plain text (e.g., run the markdown output through an
HTML-to-text step such as using an HTML parser or a robust regex to strip tags
and html.unescape to decode entities) so that headings, lists, links, emphasis,
code tags, etc. are removed and the returned text is clean for embedding; ensure
you modify the function containing markdown.markdown(md_text) and return the
fully cleaned plain text.
In `@semantic_search/README.md`:
- Around line 17-18: The README has an unclosed fenced code block around the
example command; close the triple-backtick fence after the shown command (the
code block that starts before the line containing "python -m
semantic_search.search semantic_search/test_notes \"your query here\"") so the
rest of the document renders correctly, and optionally add a short installation
note above or below the example such as a pip install instruction (e.g.,
reference to requirements.txt) to show how to install dependencies before
running the command.
In `@semantic_search/search.py`:
- Around line 46-48: The loop printing search results always appends "..." even
for short texts; update the block that iterates over results (the for res in
results loop, using res['text'], res['source'], res['score']) to compute a
snippet = res['text'] if len(res['text']) <= 200 else res['text'][:200] + "..."
and print snippet instead of always printing res['text'][:200] + "...", so the
ellipsis is only added when the original text was truncated.
- Around line 34-39: The current code assumes embeddings exist and does
len(embeddings[0]) which raises IndexError when all_chunks/texts are empty;
after calling Embedder().embed_texts(texts) check if embeddings is empty and
handle it: either return early (skip building the VectorStore) or instantiate
VectorStore using a safe default embedding dimension obtained from the embedder
(e.g. add/use Embedder.embedding_dim or Embedder.get_embedding_dim()) instead of
indexing embeddings[0]; ensure you never call len(embeddings[0]) without first
verifying embeddings is non-empty and update the VectorStore creation site to
use the safe dimension or the early-return path.
🧹 Nitpick comments (2)
semantic_search/requirements.txt (1)
1-3: Consider pinning dependency versions or adding upper bounds for reproducibility.Using only
>=constraints risks breaking changes from future major releases. For a project meant to be a foundational building block, consider pinning to compatible ranges (e.g.,~=2.2.2or>=2.2.2,<3).semantic_search/vector_store.py (1)
34-40: Considerstrict=Trueonzip()for defensive coding.Per Ruff B905, adding
strict=Truetozip(scores[0], indices[0])would catch unexpected length mismatches, though in practice FAISS guarantees equal-length arrays.Proposed fix
- for score, idx in zip(scores[0], indices[0]): + for score, idx in zip(scores[0], indices[0], strict=True):
| def _clean_markdown(md_text: str) -> str: | ||
| """ | ||
| Convert markdown text to plain text. | ||
| """ | ||
| html = markdown.markdown(md_text) | ||
| # Very lightweight HTML tag removal | ||
| text = ( | ||
| html.replace("<p>", "") | ||
| .replace("</p>", "\n") | ||
| .replace("<br />", "\n") | ||
| ) | ||
| return text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete HTML tag stripping will pollute embeddings with HTML markup.
Only <p>, </p>, and <br /> are stripped. All other tags produced by markdown.markdown() — headings (<h1>–<h6>), lists (<ul>, <li>), emphasis (<strong>, <em>), links (<a href="...">), code (<pre>, <code>), etc. — will remain in the text fed to the embedder, degrading search quality.
Proposed fix using regex-based tag stripping
import os
+import re
from typing import List, Dict
import markdown
def _clean_markdown(md_text: str) -> str:
"""
Convert markdown text to plain text.
"""
html = markdown.markdown(md_text)
- # Very lightweight HTML tag removal
- text = (
- html.replace("<p>", "")
- .replace("</p>", "\n")
- .replace("<br />", "\n")
- )
+ # Strip all HTML tags
+ text = re.sub(r"<[^>]+>", "", html)
return text🤖 Prompt for AI Agents
In `@semantic_search/chunker.py` around lines 5 - 16, The _clean_markdown function
only strips a few specific tags, leaving other HTML from markdown.markdown() in
the output; update _clean_markdown(md_text: str) to fully remove all HTML tags
and decode HTML entities before returning plain text (e.g., run the markdown
output through an HTML-to-text step such as using an HTML parser or a robust
regex to strip tags and html.unescape to decode entities) so that headings,
lists, links, emphasis, code tags, etc. are removed and the returned text is
clean for embedding; ensure you modify the function containing
markdown.markdown(md_text) and return the fully cleaned plain text.
| for para in paragraphs: | ||
| if len(current_chunk) + len(para) <= max_chars: | ||
| current_chunk += " " + para |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Off-by-one: the joining space is not accounted for in the length check.
Line 44 checks len(current_chunk) + len(para) <= max_chars, but Line 45 concatenates with " " + para, adding an extra character. This means chunks can exceed max_chars by 1.
Proposed fix
for para in paragraphs:
- if len(current_chunk) + len(para) <= max_chars:
+ separator = " " if current_chunk else ""
+ if len(current_chunk) + len(separator) + len(para) <= max_chars:
- current_chunk += " " + para
+ current_chunk += separator + para
else:This also avoids the leading space on the first paragraph of each chunk.
🤖 Prompt for AI Agents
In `@semantic_search/chunker.py` around lines 43 - 45, The loop building chunks
can exceed max_chars by one because it always concatenates " " + para but checks
len(current_chunk) + len(para) <= max_chars; update the logic in the loop that
iterates over paragraphs to account for the extra space only when current_chunk
is non-empty (e.g., check len(current_chunk) + (1 if current_chunk else 0) +
len(para) <= max_chars), and change the concatenation to add the space
conditionally (append para directly if current_chunk is empty, otherwise append
" " + para) so you also avoid leading spaces.
| ```bash | ||
| python -m semantic_search.search semantic_search/test_notes "your query here" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclosed code fence will break README rendering.
The fenced code block opened on Line 18 is never closed with a matching ```. This will cause the rest of the README (and any content added below) to render as a code block on GitHub. Also consider adding installation instructions (pip install -r requirements.txt).
🤖 Prompt for AI Agents
In `@semantic_search/README.md` around lines 17 - 18, The README has an unclosed
fenced code block around the example command; close the triple-backtick fence
after the shown command (the code block that starts before the line containing
"python -m semantic_search.search semantic_search/test_notes \"your query
here\"") so the rest of the document renders correctly, and optionally add a
short installation note above or below the example such as a pip install
instruction (e.g., reference to requirements.txt) to show how to install
dependencies before running the command.
| texts = [c["text"] for c in all_chunks] | ||
|
|
||
| embedder = Embedder() | ||
| embeddings = embedder.embed_texts(texts) | ||
|
|
||
| vector_store = VectorStore(embedding_dim=len(embeddings[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndexError crash when markdown files exist but yield no text chunks.
If all .md files are empty or whitespace-only, all_chunks will be [], texts will be [], embedder.embed_texts([]) returns [], and len(embeddings[0]) on Line 39 raises an IndexError.
Proposed fix
texts = [c["text"] for c in all_chunks]
+ if not texts:
+ print("No text chunks extracted from markdown files.")
+ sys.exit(1)
+
embedder = Embedder()
embeddings = embedder.embed_texts(texts)🤖 Prompt for AI Agents
In `@semantic_search/search.py` around lines 34 - 39, The current code assumes
embeddings exist and does len(embeddings[0]) which raises IndexError when
all_chunks/texts are empty; after calling Embedder().embed_texts(texts) check if
embeddings is empty and handle it: either return early (skip building the
VectorStore) or instantiate VectorStore using a safe default embedding dimension
obtained from the embedder (e.g. add/use Embedder.embedding_dim or
Embedder.get_embedding_dim()) instead of indexing embeddings[0]; ensure you
never call len(embeddings[0]) without first verifying embeddings is non-empty
and update the VectorStore creation site to use the safe dimension or the
early-return path.
| for res in results: | ||
| print(f"- {res['source']} (score: {res['score']:.3f})") | ||
| print(f" {res['text'][:200]}...\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing "..." is always appended, even for short texts.
Line 48 always appends ... after truncation, even when the text is shorter than 200 characters.
Proposed fix
- print(f" {res['text'][:200]}...\n")
+ snippet = res['text'][:200]
+ ellipsis = "..." if len(res['text']) > 200 else ""
+ print(f" {snippet}{ellipsis}\n")🤖 Prompt for AI Agents
In `@semantic_search/search.py` around lines 46 - 48, The loop printing search
results always appends "..." even for short texts; update the block that
iterates over results (the for res in results loop, using res['text'],
res['source'], res['score']) to compute a snippet = res['text'] if
len(res['text']) <= 200 else res['text'][:200] + "..." and print snippet instead
of always printing res['text'][:200] + "...", so the ellipsis is only added when
the original text was truncated.
This PR introduces a minimal, local-first semantic search pipeline for markdown notes.
It includes:
The implementation runs fully offline after the initial model download and is intended as a foundational building block for the Smart Notes project’s semantic search and future RAG pipeline.
No existing code paths are modified.
Summary by CodeRabbit
New Features
Documentation
Chores