Skip to content

Created web search agent tools for more detailed questions asked#9

Merged
jreakin merged 1 commit intomainfrom
agent-search-tool-creation
Jan 30, 2026
Merged

Created web search agent tools for more detailed questions asked#9
jreakin merged 1 commit intomainfrom
agent-search-tool-creation

Conversation

@jreakin
Copy link
Member

@jreakin jreakin commented Jan 29, 2026

TL;DR

Added semantic search capabilities to the chatbot with vector embeddings for scraped web pages.

What changed?

  • Added a new database migration (006_scraped_pages_index.sql) that creates tables for storing scraped pages and vector embeddings
  • Created a new embedding service to generate vector embeddings for text chunks
  • Enhanced the scraper to store per-page metadata and content in addition to the combined reference document
  • Added a search_pages tool to the agent that performs semantic search over the indexed pages
  • Updated the setup CLI to index pages with embeddings during the scraping process
  • Added reference_doc_id to AgentContext to enable semantic search within the correct document scope

How to test?

  1. Run the database migration to create the new tables
  2. Use the setup CLI to scrape a website - it will now index pages with embeddings
  3. Ask the chatbot a question that requires detailed information not in the reference document
  4. The agent should use the search_pages tool to find relevant information from the indexed pages

Why make this change?

The previous implementation only used a single consolidated reference document, which limited the amount of information available to the chatbot. By indexing individual pages with vector embeddings, the chatbot can now perform semantic search to find specific information across all scraped pages, providing more accurate and detailed responses without exceeding context limits.

@coderabbitai
Copy link

coderabbitai bot commented Jan 29, 2026

Warning

Rate limit exceeded

@jreakin has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 14 minutes and 54 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch agent-search-tool-creation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Member Author

jreakin commented Jan 29, 2026

@sentry
Copy link

sentry bot commented Jan 29, 2026

Codecov Report

❌ Patch coverage is 65.34091% with 61 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/db/repository.py 20.00% 24 Missing ⚠️
src/services/agent_service.py 25.00% 18 Missing ⚠️
src/cli/setup_cli.py 66.66% 13 Missing and 2 partials ⚠️
src/services/scraper.py 88.57% 1 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

@jreakin jreakin marked this pull request as ready for review January 29, 2026 23:01
Copilot AI review requested due to automatic review settings January 29, 2026 23:01
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds semantic search over individually scraped web pages by storing page metadata and vector embeddings, enabling the agent to retrieve more detailed information than a single consolidated reference document.

Changes:

  • Introduces new DB tables + RPC (scraped_pages, page_chunks, search_page_chunks) to store per-page content and perform vector similarity search.
  • Updates the scraper/CLI flow to return structured scrape results and to index scraped pages into the new tables with embeddings.
  • Adds an agent tool (search_pages) and extends AgentContext with reference_doc_id to scope searches to the correct document.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
migrations/006_scraped_pages_index.sql Adds pgvector-backed tables, indexes, and an RPC function for semantic search.
src/services/embedding_service.py New embedding generation helpers (generate_embeddings, embed_query) using PydanticAI Gateway.
src/services/scraper.py Returns structured scrape results (ScrapeResult), captures per-page metadata, and adds chunk_text.
src/models/scraper_models.py New dataclasses for ScrapedPage and ScrapeResult.
src/cli/setup_cli.py Indexes scraped pages/chunks with embeddings during setup and supports “resume” behavior.
src/db/repository.py Adds persistence + search functions for scraped pages and embedded chunks.
src/services/agent_service.py Adds search_pages tool and wires reference_doc_id into agent deps.
src/models/agent_models.py Adds reference_doc_id to AgentContext.
src/api/webhook.py Passes reference_doc_id into AgentContext at runtime.
src/config.py Adds embedding/search-related settings defaults.
tests/unit/test_embedding_service.py Adds unit tests for the embedding service.
tests/unit/test_scraper.py Updates scraper tests for the new ScrapeResult return type and adds chunk_text tests.
tests/unit/test_setup_cli.py Updates CLI tests for new scrape return type and indexing/resume paths.
tests/unit/test_models.py Extends AgentContext property tests to include reference_doc_id.
tests/unit/test_logging.py Adjusts logging tests for new context field and scraper return type.
tests/unit/test_agent_service.py Updates agent service tests to include reference_doc_id.
tests/stateful/test_agent_conversation.py Updates stateful conversation tests to include reference_doc_id.
tests/conftest.py Updates shared AgentContext fixture to include reference_doc_id.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +387 to +428
# If no page index exists yet, scrape and index pages only (do not modify reference doc)
existing_pages = get_scraped_pages_by_reference_doc(reference_doc_id)
if not existing_pages:
typer.echo(" No page index found. Scraping pages for search index only...")
try:
scrape_result = _run_async_with_cleanup(scrape_website(normalized_url))
typer.echo(f" ✓ Scraped {len(scrape_result.pages)} pages")
typer.echo(" Indexing pages and generating embeddings...")
async def _index_pages_and_chunks():
for page in scrape_result.pages:
scraped_page_id = create_scraped_page(
reference_doc_id=reference_doc_id,
url=page.url,
normalized_url=page.normalized_url,
title=page.title,
raw_content=page.content,
word_count=page.word_count,
scraped_at=page.scraped_at,
)
page_chunk_tuples = chunk_text(page.content)
if not page_chunk_tuples:
continue
chunk_texts = [t[0] for t in page_chunk_tuples]
embeddings = await generate_embeddings(chunk_texts)
chunks_with_embeddings = [
(chunk_texts[i], embeddings[i], page_chunk_tuples[i][1])
for i in range(len(chunk_texts))
]
create_page_chunks(scraped_page_id, chunks_with_embeddings)
return len(scrape_result.pages)
page_count = _run_async_with_cleanup(_index_pages_and_chunks())
typer.echo(f" ✓ Indexed {page_count} pages with embeddings")
except Exception as e:
typer.echo(
typer.style(
f" ⚠ Page indexing failed (search_pages tool will be empty): {e}",
fg=typer.colors.YELLOW,
),
err=True,
)
else:
typer.echo(f" Page index already has {len(existing_pages)} pages.")
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the “existing reference doc” resume path, the decision to skip indexing is based only on get_scraped_pages_by_reference_doc(reference_doc_id). If indexing fails after inserting some scraped_pages rows (but before inserting page_chunks), subsequent runs will see existing_pages as non-empty and will skip indexing, leaving search_pages permanently empty for that doc unless the DB is manually cleaned up. Consider checking for existing chunks (not just pages), or making the indexing step idempotent/re-runnable (e.g., upsert pages + replace chunks, or delete partially-created pages on failure).

Suggested change
# If no page index exists yet, scrape and index pages only (do not modify reference doc)
existing_pages = get_scraped_pages_by_reference_doc(reference_doc_id)
if not existing_pages:
typer.echo(" No page index found. Scraping pages for search index only...")
try:
scrape_result = _run_async_with_cleanup(scrape_website(normalized_url))
typer.echo(f" ✓ Scraped {len(scrape_result.pages)} pages")
typer.echo(" Indexing pages and generating embeddings...")
async def _index_pages_and_chunks():
for page in scrape_result.pages:
scraped_page_id = create_scraped_page(
reference_doc_id=reference_doc_id,
url=page.url,
normalized_url=page.normalized_url,
title=page.title,
raw_content=page.content,
word_count=page.word_count,
scraped_at=page.scraped_at,
)
page_chunk_tuples = chunk_text(page.content)
if not page_chunk_tuples:
continue
chunk_texts = [t[0] for t in page_chunk_tuples]
embeddings = await generate_embeddings(chunk_texts)
chunks_with_embeddings = [
(chunk_texts[i], embeddings[i], page_chunk_tuples[i][1])
for i in range(len(chunk_texts))
]
create_page_chunks(scraped_page_id, chunks_with_embeddings)
return len(scrape_result.pages)
page_count = _run_async_with_cleanup(_index_pages_and_chunks())
typer.echo(f" ✓ Indexed {page_count} pages with embeddings")
except Exception as e:
typer.echo(
typer.style(
f" ⚠ Page indexing failed (search_pages tool will be empty): {e}",
fg=typer.colors.YELLOW,
),
err=True,
)
else:
typer.echo(f" Page index already has {len(existing_pages)} pages.")
# Always (re-)scrape and index pages for the search index to ensure idempotency.
typer.echo(" Scraping pages and building search index (this may take a moment)...")
try:
scrape_result = _run_async_with_cleanup(scrape_website(normalized_url))
typer.echo(f" ✓ Scraped {len(scrape_result.pages)} pages")
typer.echo(" Indexing pages and generating embeddings...")
async def _index_pages_and_chunks():
for page in scrape_result.pages:
scraped_page_id = create_scraped_page(
reference_doc_id=reference_doc_id,
url=page.url,
normalized_url=page.normalized_url,
title=page.title,
raw_content=page.content,
word_count=page.word_count,
scraped_at=page.scraped_at,
)
page_chunk_tuples = chunk_text(page.content)
if not page_chunk_tuples:
continue
chunk_texts = [t[0] for t in page_chunk_tuples]
embeddings = await generate_embeddings(chunk_texts)
chunks_with_embeddings = [
(chunk_texts[i], embeddings[i], page_chunk_tuples[i][1])
for i in range(len(chunk_texts))
]
create_page_chunks(scraped_page_id, chunks_with_embeddings)
return len(scrape_result.pages)
page_count = _run_async_with_cleanup(_index_pages_and_chunks())
typer.echo(f" ✓ Indexed {page_count} pages with embeddings")
except Exception as e:
typer.echo(
typer.style(
f" ⚠ Page indexing failed (search_pages tool may be incomplete or empty): {e}",
fg=typer.colors.YELLOW,
),
err=True,
)

Copilot uses AI. Check for mistakes.
Comment on lines +395 to +416
async def _index_pages_and_chunks():
for page in scrape_result.pages:
scraped_page_id = create_scraped_page(
reference_doc_id=reference_doc_id,
url=page.url,
normalized_url=page.normalized_url,
title=page.title,
raw_content=page.content,
word_count=page.word_count,
scraped_at=page.scraped_at,
)
page_chunk_tuples = chunk_text(page.content)
if not page_chunk_tuples:
continue
chunk_texts = [t[0] for t in page_chunk_tuples]
embeddings = await generate_embeddings(chunk_texts)
chunks_with_embeddings = [
(chunk_texts[i], embeddings[i], page_chunk_tuples[i][1])
for i in range(len(chunk_texts))
]
create_page_chunks(scraped_page_id, chunks_with_embeddings)
return len(scrape_result.pages)
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The page indexing implementation (_index_pages_and_chunks) is duplicated in both the existing-doc resume path and the new-doc path. This duplication makes it easy for the two flows to drift (e.g., different chunking/embedding behavior, different error handling). Consider extracting a single helper (e.g., index_scrape_result(reference_doc_id, scrape_result)) and calling it from both branches.

Copilot uses AI. Check for mistakes.
Comment on lines +58 to +61
embedding_dimensions: int = Field(
default=1536,
description="Embedding vector dimension (matches text-embedding-3-small)",
)
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embedding_dimensions is introduced in settings, but the rest of the implementation hard-codes 1536 in the DB schema and tests, and the runtime code doesn’t validate that the embedder actually returns vectors of this size. This creates a sharp edge if embedding_model is changed (or if the gateway returns a different dimension): inserts/search RPC casts can start failing at runtime. Either (a) validate embedding lengths against settings.embedding_dimensions in generate_embeddings/embed_query and keep schema in sync, or (b) remove embedding_dimensions to avoid implying it’s configurable.

Suggested change
embedding_dimensions: int = Field(
default=1536,
description="Embedding vector dimension (matches text-embedding-3-small)",
)

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +30
async def test_generate_embeddings_returns_vectors(self):
"""generate_embeddings returns list of vectors (list of floats)."""
mock_result = MagicMock()
mock_result.embeddings = [[0.1] * 1536, [0.2] * 1536]
with patch(
"src.services.embedding_service.Embedder"
) as mock_embedder_class:
mock_embedder = MagicMock()
mock_embedder.embed_documents = AsyncMock(return_value=mock_result)
mock_embedder_class.return_value = mock_embedder
result = await generate_embeddings(["text one", "text two"])
assert len(result) == 2
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test patches Embedder but not get_settings(). generate_embeddings() calls get_settings() for embedding_model, and Settings has required fields that won’t be present in unit test env by default. Patch src.services.embedding_service.get_settings (as you do in test_generate_embeddings_calls_embed_documents) or take a mock_settings fixture dependency so the test doesn’t raise a settings validation error.

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +88
async def test_embed_query_returns_vector(self):
"""embed_query returns a single vector."""
mock_result = MagicMock()
mock_result.embeddings = [[0.5] * 1536]
with patch("src.services.embedding_service.Embedder") as mock_embedder_class:
mock_embedder = MagicMock()
mock_embedder.embed_query = AsyncMock(return_value=mock_result)
mock_embedder_class.return_value = mock_embedder
result = await embed_query("search query")
assert len(result) == 1536
assert result[0] == 0.5

@pytest.mark.asyncio
async def test_embed_query_no_embeddings_returns_empty(self):
"""When embed_query returns no embeddings, return empty list."""
mock_result = MagicMock()
mock_result.embeddings = []
with patch("src.services.embedding_service.Embedder") as mock_embedder_class:
mock_embedder = MagicMock()
mock_embedder.embed_query = AsyncMock(return_value=mock_result)
mock_embedder_class.return_value = mock_embedder
result = await embed_query("query")
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests patch Embedder but not get_settings(). embed_query() calls get_settings() to choose the embedding model, and will fail in unit tests unless get_settings is patched (or a mock_settings fixture is used). Please patch src.services.embedding_service.get_settings in these test cases so they don’t depend on real environment variables.

Copilot uses AI. Check for mistakes.
Comment on lines +240 to +248
def create_scraped_page(
reference_doc_id: str,
url: str,
normalized_url: str,
title: str,
raw_content: str,
word_count: int,
scraped_at: datetime,
) -> str:
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New repository surface area was added (create_scraped_page, create_page_chunks, search_page_chunks, get_scraped_pages_by_reference_doc) but there are no corresponding unit tests, while other functions in this module are covered (see tests/unit/test_repository.py). Please add tests that validate the expected Supabase calls/parameters and basic behaviors (e.g., empty results handling in search_page_chunks/get_scraped_pages_by_reference_doc).

Copilot uses AI. Check for mistakes.
Comment on lines +136 to +145
@self.agent.tool
async def search_pages(
ctx: RunContext[MessengerAgentDeps], query: str
) -> str:
"""Search scraped website pages for specific information.

Use this when you need to find detailed information that may not be
in the overview, such as specific policies, contact details, or
facts about particular topics.
"""
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new search_pages tool enables using scraped-page content beyond the reference doc, but the system prompt (prompts/agent_system_instructions.md) currently says “Use ONLY the following reference document as your source of truth” and to escalate when something isn’t covered. As-is, the agent may avoid calling this tool, or if it does call it it may violate its own instructions. Please update the agent’s system prompt/guardrails to explicitly allow using search_pages as an additional (untrusted) source, and include guidance to ignore any instructions found in scraped content and to avoid exposing page URLs/citations directly to end users (since the tool currently returns [Source: ...]).

Copilot uses AI. Check for mistakes.
Copy link
Member Author

jreakin commented Jan 29, 2026

Merge activity

  • Jan 29, 11:55 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Jan 30, 12:12 AM UTC: Graphite rebased this pull request as part of a merge.
  • Jan 30, 12:12 AM UTC: @jreakin merged this pull request with Graphite.

@jreakin jreakin changed the base branch from facebook-cli-setup-enhancement to graphite-base/9 January 30, 2026 00:09
@jreakin jreakin changed the base branch from graphite-base/9 to main January 30, 2026 00:10
@jreakin jreakin force-pushed the agent-search-tool-creation branch from b297e9b to ad895ce Compare January 30, 2026 00:11
@jreakin jreakin merged commit 97aad2a into main Jan 30, 2026
6 checks passed
@jreakin jreakin deleted the agent-search-tool-creation branch January 30, 2026 00:19
@notion-workspace
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants