Created web search agent tools for more detailed questions asked#9
Created web search agent tools for more detailed questions asked#9
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Adds semantic search over individually scraped web pages by storing page metadata and vector embeddings, enabling the agent to retrieve more detailed information than a single consolidated reference document.
Changes:
- Introduces new DB tables + RPC (
scraped_pages,page_chunks,search_page_chunks) to store per-page content and perform vector similarity search. - Updates the scraper/CLI flow to return structured scrape results and to index scraped pages into the new tables with embeddings.
- Adds an agent tool (
search_pages) and extendsAgentContextwithreference_doc_idto scope searches to the correct document.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| migrations/006_scraped_pages_index.sql | Adds pgvector-backed tables, indexes, and an RPC function for semantic search. |
| src/services/embedding_service.py | New embedding generation helpers (generate_embeddings, embed_query) using PydanticAI Gateway. |
| src/services/scraper.py | Returns structured scrape results (ScrapeResult), captures per-page metadata, and adds chunk_text. |
| src/models/scraper_models.py | New dataclasses for ScrapedPage and ScrapeResult. |
| src/cli/setup_cli.py | Indexes scraped pages/chunks with embeddings during setup and supports “resume” behavior. |
| src/db/repository.py | Adds persistence + search functions for scraped pages and embedded chunks. |
| src/services/agent_service.py | Adds search_pages tool and wires reference_doc_id into agent deps. |
| src/models/agent_models.py | Adds reference_doc_id to AgentContext. |
| src/api/webhook.py | Passes reference_doc_id into AgentContext at runtime. |
| src/config.py | Adds embedding/search-related settings defaults. |
| tests/unit/test_embedding_service.py | Adds unit tests for the embedding service. |
| tests/unit/test_scraper.py | Updates scraper tests for the new ScrapeResult return type and adds chunk_text tests. |
| tests/unit/test_setup_cli.py | Updates CLI tests for new scrape return type and indexing/resume paths. |
| tests/unit/test_models.py | Extends AgentContext property tests to include reference_doc_id. |
| tests/unit/test_logging.py | Adjusts logging tests for new context field and scraper return type. |
| tests/unit/test_agent_service.py | Updates agent service tests to include reference_doc_id. |
| tests/stateful/test_agent_conversation.py | Updates stateful conversation tests to include reference_doc_id. |
| tests/conftest.py | Updates shared AgentContext fixture to include reference_doc_id. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # If no page index exists yet, scrape and index pages only (do not modify reference doc) | ||
| existing_pages = get_scraped_pages_by_reference_doc(reference_doc_id) | ||
| if not existing_pages: | ||
| typer.echo(" No page index found. Scraping pages for search index only...") | ||
| try: | ||
| scrape_result = _run_async_with_cleanup(scrape_website(normalized_url)) | ||
| typer.echo(f" ✓ Scraped {len(scrape_result.pages)} pages") | ||
| typer.echo(" Indexing pages and generating embeddings...") | ||
| async def _index_pages_and_chunks(): | ||
| for page in scrape_result.pages: | ||
| scraped_page_id = create_scraped_page( | ||
| reference_doc_id=reference_doc_id, | ||
| url=page.url, | ||
| normalized_url=page.normalized_url, | ||
| title=page.title, | ||
| raw_content=page.content, | ||
| word_count=page.word_count, | ||
| scraped_at=page.scraped_at, | ||
| ) | ||
| page_chunk_tuples = chunk_text(page.content) | ||
| if not page_chunk_tuples: | ||
| continue | ||
| chunk_texts = [t[0] for t in page_chunk_tuples] | ||
| embeddings = await generate_embeddings(chunk_texts) | ||
| chunks_with_embeddings = [ | ||
| (chunk_texts[i], embeddings[i], page_chunk_tuples[i][1]) | ||
| for i in range(len(chunk_texts)) | ||
| ] | ||
| create_page_chunks(scraped_page_id, chunks_with_embeddings) | ||
| return len(scrape_result.pages) | ||
| page_count = _run_async_with_cleanup(_index_pages_and_chunks()) | ||
| typer.echo(f" ✓ Indexed {page_count} pages with embeddings") | ||
| except Exception as e: | ||
| typer.echo( | ||
| typer.style( | ||
| f" ⚠ Page indexing failed (search_pages tool will be empty): {e}", | ||
| fg=typer.colors.YELLOW, | ||
| ), | ||
| err=True, | ||
| ) | ||
| else: | ||
| typer.echo(f" Page index already has {len(existing_pages)} pages.") |
There was a problem hiding this comment.
In the “existing reference doc” resume path, the decision to skip indexing is based only on get_scraped_pages_by_reference_doc(reference_doc_id). If indexing fails after inserting some scraped_pages rows (but before inserting page_chunks), subsequent runs will see existing_pages as non-empty and will skip indexing, leaving search_pages permanently empty for that doc unless the DB is manually cleaned up. Consider checking for existing chunks (not just pages), or making the indexing step idempotent/re-runnable (e.g., upsert pages + replace chunks, or delete partially-created pages on failure).
| # If no page index exists yet, scrape and index pages only (do not modify reference doc) | |
| existing_pages = get_scraped_pages_by_reference_doc(reference_doc_id) | |
| if not existing_pages: | |
| typer.echo(" No page index found. Scraping pages for search index only...") | |
| try: | |
| scrape_result = _run_async_with_cleanup(scrape_website(normalized_url)) | |
| typer.echo(f" ✓ Scraped {len(scrape_result.pages)} pages") | |
| typer.echo(" Indexing pages and generating embeddings...") | |
| async def _index_pages_and_chunks(): | |
| for page in scrape_result.pages: | |
| scraped_page_id = create_scraped_page( | |
| reference_doc_id=reference_doc_id, | |
| url=page.url, | |
| normalized_url=page.normalized_url, | |
| title=page.title, | |
| raw_content=page.content, | |
| word_count=page.word_count, | |
| scraped_at=page.scraped_at, | |
| ) | |
| page_chunk_tuples = chunk_text(page.content) | |
| if not page_chunk_tuples: | |
| continue | |
| chunk_texts = [t[0] for t in page_chunk_tuples] | |
| embeddings = await generate_embeddings(chunk_texts) | |
| chunks_with_embeddings = [ | |
| (chunk_texts[i], embeddings[i], page_chunk_tuples[i][1]) | |
| for i in range(len(chunk_texts)) | |
| ] | |
| create_page_chunks(scraped_page_id, chunks_with_embeddings) | |
| return len(scrape_result.pages) | |
| page_count = _run_async_with_cleanup(_index_pages_and_chunks()) | |
| typer.echo(f" ✓ Indexed {page_count} pages with embeddings") | |
| except Exception as e: | |
| typer.echo( | |
| typer.style( | |
| f" ⚠ Page indexing failed (search_pages tool will be empty): {e}", | |
| fg=typer.colors.YELLOW, | |
| ), | |
| err=True, | |
| ) | |
| else: | |
| typer.echo(f" Page index already has {len(existing_pages)} pages.") | |
| # Always (re-)scrape and index pages for the search index to ensure idempotency. | |
| typer.echo(" Scraping pages and building search index (this may take a moment)...") | |
| try: | |
| scrape_result = _run_async_with_cleanup(scrape_website(normalized_url)) | |
| typer.echo(f" ✓ Scraped {len(scrape_result.pages)} pages") | |
| typer.echo(" Indexing pages and generating embeddings...") | |
| async def _index_pages_and_chunks(): | |
| for page in scrape_result.pages: | |
| scraped_page_id = create_scraped_page( | |
| reference_doc_id=reference_doc_id, | |
| url=page.url, | |
| normalized_url=page.normalized_url, | |
| title=page.title, | |
| raw_content=page.content, | |
| word_count=page.word_count, | |
| scraped_at=page.scraped_at, | |
| ) | |
| page_chunk_tuples = chunk_text(page.content) | |
| if not page_chunk_tuples: | |
| continue | |
| chunk_texts = [t[0] for t in page_chunk_tuples] | |
| embeddings = await generate_embeddings(chunk_texts) | |
| chunks_with_embeddings = [ | |
| (chunk_texts[i], embeddings[i], page_chunk_tuples[i][1]) | |
| for i in range(len(chunk_texts)) | |
| ] | |
| create_page_chunks(scraped_page_id, chunks_with_embeddings) | |
| return len(scrape_result.pages) | |
| page_count = _run_async_with_cleanup(_index_pages_and_chunks()) | |
| typer.echo(f" ✓ Indexed {page_count} pages with embeddings") | |
| except Exception as e: | |
| typer.echo( | |
| typer.style( | |
| f" ⚠ Page indexing failed (search_pages tool may be incomplete or empty): {e}", | |
| fg=typer.colors.YELLOW, | |
| ), | |
| err=True, | |
| ) |
| async def _index_pages_and_chunks(): | ||
| for page in scrape_result.pages: | ||
| scraped_page_id = create_scraped_page( | ||
| reference_doc_id=reference_doc_id, | ||
| url=page.url, | ||
| normalized_url=page.normalized_url, | ||
| title=page.title, | ||
| raw_content=page.content, | ||
| word_count=page.word_count, | ||
| scraped_at=page.scraped_at, | ||
| ) | ||
| page_chunk_tuples = chunk_text(page.content) | ||
| if not page_chunk_tuples: | ||
| continue | ||
| chunk_texts = [t[0] for t in page_chunk_tuples] | ||
| embeddings = await generate_embeddings(chunk_texts) | ||
| chunks_with_embeddings = [ | ||
| (chunk_texts[i], embeddings[i], page_chunk_tuples[i][1]) | ||
| for i in range(len(chunk_texts)) | ||
| ] | ||
| create_page_chunks(scraped_page_id, chunks_with_embeddings) | ||
| return len(scrape_result.pages) |
There was a problem hiding this comment.
The page indexing implementation (_index_pages_and_chunks) is duplicated in both the existing-doc resume path and the new-doc path. This duplication makes it easy for the two flows to drift (e.g., different chunking/embedding behavior, different error handling). Consider extracting a single helper (e.g., index_scrape_result(reference_doc_id, scrape_result)) and calling it from both branches.
| embedding_dimensions: int = Field( | ||
| default=1536, | ||
| description="Embedding vector dimension (matches text-embedding-3-small)", | ||
| ) |
There was a problem hiding this comment.
embedding_dimensions is introduced in settings, but the rest of the implementation hard-codes 1536 in the DB schema and tests, and the runtime code doesn’t validate that the embedder actually returns vectors of this size. This creates a sharp edge if embedding_model is changed (or if the gateway returns a different dimension): inserts/search RPC casts can start failing at runtime. Either (a) validate embedding lengths against settings.embedding_dimensions in generate_embeddings/embed_query and keep schema in sync, or (b) remove embedding_dimensions to avoid implying it’s configurable.
| embedding_dimensions: int = Field( | |
| default=1536, | |
| description="Embedding vector dimension (matches text-embedding-3-small)", | |
| ) |
| async def test_generate_embeddings_returns_vectors(self): | ||
| """generate_embeddings returns list of vectors (list of floats).""" | ||
| mock_result = MagicMock() | ||
| mock_result.embeddings = [[0.1] * 1536, [0.2] * 1536] | ||
| with patch( | ||
| "src.services.embedding_service.Embedder" | ||
| ) as mock_embedder_class: | ||
| mock_embedder = MagicMock() | ||
| mock_embedder.embed_documents = AsyncMock(return_value=mock_result) | ||
| mock_embedder_class.return_value = mock_embedder | ||
| result = await generate_embeddings(["text one", "text two"]) | ||
| assert len(result) == 2 |
There was a problem hiding this comment.
This test patches Embedder but not get_settings(). generate_embeddings() calls get_settings() for embedding_model, and Settings has required fields that won’t be present in unit test env by default. Patch src.services.embedding_service.get_settings (as you do in test_generate_embeddings_calls_embed_documents) or take a mock_settings fixture dependency so the test doesn’t raise a settings validation error.
| async def test_embed_query_returns_vector(self): | ||
| """embed_query returns a single vector.""" | ||
| mock_result = MagicMock() | ||
| mock_result.embeddings = [[0.5] * 1536] | ||
| with patch("src.services.embedding_service.Embedder") as mock_embedder_class: | ||
| mock_embedder = MagicMock() | ||
| mock_embedder.embed_query = AsyncMock(return_value=mock_result) | ||
| mock_embedder_class.return_value = mock_embedder | ||
| result = await embed_query("search query") | ||
| assert len(result) == 1536 | ||
| assert result[0] == 0.5 | ||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_embed_query_no_embeddings_returns_empty(self): | ||
| """When embed_query returns no embeddings, return empty list.""" | ||
| mock_result = MagicMock() | ||
| mock_result.embeddings = [] | ||
| with patch("src.services.embedding_service.Embedder") as mock_embedder_class: | ||
| mock_embedder = MagicMock() | ||
| mock_embedder.embed_query = AsyncMock(return_value=mock_result) | ||
| mock_embedder_class.return_value = mock_embedder | ||
| result = await embed_query("query") |
There was a problem hiding this comment.
These tests patch Embedder but not get_settings(). embed_query() calls get_settings() to choose the embedding model, and will fail in unit tests unless get_settings is patched (or a mock_settings fixture is used). Please patch src.services.embedding_service.get_settings in these test cases so they don’t depend on real environment variables.
| def create_scraped_page( | ||
| reference_doc_id: str, | ||
| url: str, | ||
| normalized_url: str, | ||
| title: str, | ||
| raw_content: str, | ||
| word_count: int, | ||
| scraped_at: datetime, | ||
| ) -> str: |
There was a problem hiding this comment.
New repository surface area was added (create_scraped_page, create_page_chunks, search_page_chunks, get_scraped_pages_by_reference_doc) but there are no corresponding unit tests, while other functions in this module are covered (see tests/unit/test_repository.py). Please add tests that validate the expected Supabase calls/parameters and basic behaviors (e.g., empty results handling in search_page_chunks/get_scraped_pages_by_reference_doc).
| @self.agent.tool | ||
| async def search_pages( | ||
| ctx: RunContext[MessengerAgentDeps], query: str | ||
| ) -> str: | ||
| """Search scraped website pages for specific information. | ||
|
|
||
| Use this when you need to find detailed information that may not be | ||
| in the overview, such as specific policies, contact details, or | ||
| facts about particular topics. | ||
| """ |
There was a problem hiding this comment.
The new search_pages tool enables using scraped-page content beyond the reference doc, but the system prompt (prompts/agent_system_instructions.md) currently says “Use ONLY the following reference document as your source of truth” and to escalate when something isn’t covered. As-is, the agent may avoid calling this tool, or if it does call it it may violate its own instructions. Please update the agent’s system prompt/guardrails to explicitly allow using search_pages as an additional (untrusted) source, and include guidance to ignore any instructions found in scraped content and to avoid exposing page URLs/citations directly to end users (since the tool currently returns [Source: ...]).
b297e9b to
ad895ce
Compare

TL;DR
Added semantic search capabilities to the chatbot with vector embeddings for scraped web pages.
What changed?
006_scraped_pages_index.sql) that creates tables for storing scraped pages and vector embeddingssearch_pagestool to the agent that performs semantic search over the indexed pagesHow to test?
search_pagestool to find relevant information from the indexed pagesWhy make this change?
The previous implementation only used a single consolidated reference document, which limited the amount of information available to the chatbot. By indexing individual pages with vector embeddings, the chatbot can now perform semantic search to find specific information across all scraped pages, providing more accurate and detailed responses without exceeding context limits.