Created web search agent tools for more detailed questions asked by jreakin · Pull Request #9 · Abstract-Data/go-crea-fb-msg-fast-api

jreakin · 2026-01-29T22:25:30Z

TL;DR

Added semantic search capabilities to the chatbot with vector embeddings for scraped web pages.

What changed?

Added a new database migration (006_scraped_pages_index.sql) that creates tables for storing scraped pages and vector embeddings
Created a new embedding service to generate vector embeddings for text chunks
Enhanced the scraper to store per-page metadata and content in addition to the combined reference document
Added a search_pages tool to the agent that performs semantic search over the indexed pages
Updated the setup CLI to index pages with embeddings during the scraping process
Added reference_doc_id to AgentContext to enable semantic search within the correct document scope

How to test?

Run the database migration to create the new tables
Use the setup CLI to scrape a website - it will now index pages with embeddings
Ask the chatbot a question that requires detailed information not in the reference document
The agent should use the search_pages tool to find relevant information from the indexed pages

Why make this change?

The previous implementation only used a single consolidated reference document, which limited the amount of information available to the chatbot. By indexing individual pages with vector embeddings, the chatbot can now perform semantic search to find specific information across all scraped pages, providing more accurate and detailed responses without exceeding context limits.

coderabbitai · 2026-01-29T22:25:36Z

Warning

Rate limit exceeded

@jreakin has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 14 minutes and 54 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch agent-search-tool-creation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jreakin · 2026-01-29T22:26:14Z

Refactored to reduce coupling and redundency to maintain DRY more robustly. #10
Created web search agent tools for more detailed questions asked #9 👈 (View in Graphite)
Updated CLI walkthrough for setting up with Facebook #8
Added additional facebook data extracted from webhook #7
Summary of what was implemented: 1. Migration – test_sessions: id, reference_doc_id, source_url, tone, created_at (FK → reference_documents). test_messages: id, test_session_id, user_message, response_text, confidence, requires_escalation, escalation_reas #6
Externalize prompts Added src/prompts/ with agent_prompts.py and copilot_prompts.py. build_agent_system_prompt(tone, reference_doc, max_chars) and build_synthesis_system_prompt() / build_synthesis_user_prompt() are used by agent_service and copilot_servic #5
Core implementation All 12 todos completed PydanticAI Gateway integration implemented CopilotService removed New agent service with structured outputs Multi-tenant database migration created #4
Completed tasks Dependency installation — Added logfire[fastapi,pydantic]>=1.0.0 to pyproject.toml Logging configuration — Created src/logging_config.py with environment-aware setup Settings model — Updated src/config.py with Logfire configuration options #3
Set up initial test suite #2
Initial project setup #1
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

sentry · 2026-01-29T22:26:33Z

Codecov Report

❌ Patch coverage is 65.34091% with 61 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/db/repository.py	20.00%	24 Missing ⚠️
src/services/agent_service.py	25.00%	18 Missing ⚠️
src/cli/setup_cli.py	66.66%	13 Missing and 2 partials ⚠️
src/services/scraper.py	88.57%	1 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Adds semantic search over individually scraped web pages by storing page metadata and vector embeddings, enabling the agent to retrieve more detailed information than a single consolidated reference document.

Changes:

Introduces new DB tables + RPC (scraped_pages, page_chunks, search_page_chunks) to store per-page content and perform vector similarity search.
Updates the scraper/CLI flow to return structured scrape results and to index scraped pages into the new tables with embeddings.
Adds an agent tool (search_pages) and extends AgentContext with reference_doc_id to scope searches to the correct document.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
migrations/006_scraped_pages_index.sql	Adds pgvector-backed tables, indexes, and an RPC function for semantic search.
src/services/embedding_service.py	New embedding generation helpers (`generate_embeddings`, `embed_query`) using PydanticAI Gateway.
src/services/scraper.py	Returns structured scrape results (`ScrapeResult`), captures per-page metadata, and adds `chunk_text`.
src/models/scraper_models.py	New dataclasses for `ScrapedPage` and `ScrapeResult`.
src/cli/setup_cli.py	Indexes scraped pages/chunks with embeddings during setup and supports “resume” behavior.
src/db/repository.py	Adds persistence + search functions for scraped pages and embedded chunks.
src/services/agent_service.py	Adds `search_pages` tool and wires `reference_doc_id` into agent deps.
src/models/agent_models.py	Adds `reference_doc_id` to `AgentContext`.
src/api/webhook.py	Passes `reference_doc_id` into `AgentContext` at runtime.
src/config.py	Adds embedding/search-related settings defaults.
tests/unit/test_embedding_service.py	Adds unit tests for the embedding service.
tests/unit/test_scraper.py	Updates scraper tests for the new `ScrapeResult` return type and adds `chunk_text` tests.
tests/unit/test_setup_cli.py	Updates CLI tests for new scrape return type and indexing/resume paths.
tests/unit/test_models.py	Extends AgentContext property tests to include `reference_doc_id`.
tests/unit/test_logging.py	Adjusts logging tests for new context field and scraper return type.
tests/unit/test_agent_service.py	Updates agent service tests to include `reference_doc_id`.
tests/stateful/test_agent_conversation.py	Updates stateful conversation tests to include `reference_doc_id`.
tests/conftest.py	Updates shared `AgentContext` fixture to include `reference_doc_id`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-29T23:12:16Z

src/cli/setup_cli.py

+        # If no page index exists yet, scrape and index pages only (do not modify reference doc)
+        existing_pages = get_scraped_pages_by_reference_doc(reference_doc_id)
+        if not existing_pages:
+            typer.echo("  No page index found. Scraping pages for search index only...")
+            try:
+                scrape_result = _run_async_with_cleanup(scrape_website(normalized_url))
+                typer.echo(f"  ✓ Scraped {len(scrape_result.pages)} pages")
+                typer.echo("  Indexing pages and generating embeddings...")
+                async def _index_pages_and_chunks():
+                    for page in scrape_result.pages:
+                        scraped_page_id = create_scraped_page(
+                            reference_doc_id=reference_doc_id,
+                            url=page.url,
+                            normalized_url=page.normalized_url,
+                            title=page.title,
+                            raw_content=page.content,
+                            word_count=page.word_count,
+                            scraped_at=page.scraped_at,
+                        )
+                        page_chunk_tuples = chunk_text(page.content)
+                        if not page_chunk_tuples:
+                            continue
+                        chunk_texts = [t[0] for t in page_chunk_tuples]
+                        embeddings = await generate_embeddings(chunk_texts)
+                        chunks_with_embeddings = [
+                            (chunk_texts[i], embeddings[i], page_chunk_tuples[i][1])
+                            for i in range(len(chunk_texts))
+                        ]
+                        create_page_chunks(scraped_page_id, chunks_with_embeddings)
+                    return len(scrape_result.pages)
+                page_count = _run_async_with_cleanup(_index_pages_and_chunks())
+                typer.echo(f"  ✓ Indexed {page_count} pages with embeddings")
+            except Exception as e:
+                typer.echo(
+                    typer.style(
+                        f"  ⚠ Page indexing failed (search_pages tool will be empty): {e}",
+                        fg=typer.colors.YELLOW,
+                    ),
+                    err=True,
+                )
+        else:
+            typer.echo(f"  Page index already has {len(existing_pages)} pages.")


In the “existing reference doc” resume path, the decision to skip indexing is based only on get_scraped_pages_by_reference_doc(reference_doc_id). If indexing fails after inserting some scraped_pages rows (but before inserting page_chunks), subsequent runs will see existing_pages as non-empty and will skip indexing, leaving search_pages permanently empty for that doc unless the DB is manually cleaned up. Consider checking for existing chunks (not just pages), or making the indexing step idempotent/re-runnable (e.g., upsert pages + replace chunks, or delete partially-created pages on failure).

Suggested change

# If no page index exists yet, scrape and index pages only (do not modify reference doc)

existing_pages = get_scraped_pages_by_reference_doc(reference_doc_id)

if not existing_pages:

typer.echo(" No page index found. Scraping pages for search index only...")

try:

scrape_result = _run_async_with_cleanup(scrape_website(normalized_url))

typer.echo(f" ✓ Scraped {len(scrape_result.pages)} pages")

typer.echo(" Indexing pages and generating embeddings...")

async def _index_pages_and_chunks():

for page in scrape_result.pages:

scraped_page_id = create_scraped_page(

reference_doc_id=reference_doc_id,

url=page.url,

normalized_url=page.normalized_url,

title=page.title,

raw_content=page.content,

word_count=page.word_count,

scraped_at=page.scraped_at,

)

page_chunk_tuples = chunk_text(page.content)

if not page_chunk_tuples:

continue

chunk_texts = [t[0] for t in page_chunk_tuples]

embeddings = await generate_embeddings(chunk_texts)

chunks_with_embeddings = [

(chunk_texts[i], embeddings[i], page_chunk_tuples[i][1])

for i in range(len(chunk_texts))

]

create_page_chunks(scraped_page_id, chunks_with_embeddings)

return len(scrape_result.pages)

page_count = _run_async_with_cleanup(_index_pages_and_chunks())

typer.echo(f" ✓ Indexed {page_count} pages with embeddings")

except Exception as e:

typer.echo(

typer.style(

f" ⚠ Page indexing failed (search_pages tool will be empty): {e}",

fg=typer.colors.YELLOW,

),

err=True,

)

else:

typer.echo(f" Page index already has {len(existing_pages)} pages.")

# Always (re-)scrape and index pages for the search index to ensure idempotency.

typer.echo(" Scraping pages and building search index (this may take a moment)...")

try:

scrape_result = _run_async_with_cleanup(scrape_website(normalized_url))

typer.echo(f" ✓ Scraped {len(scrape_result.pages)} pages")

typer.echo(" Indexing pages and generating embeddings...")

async def _index_pages_and_chunks():

for page in scrape_result.pages:

scraped_page_id = create_scraped_page(

reference_doc_id=reference_doc_id,

url=page.url,

normalized_url=page.normalized_url,

title=page.title,

raw_content=page.content,

word_count=page.word_count,

scraped_at=page.scraped_at,

)

page_chunk_tuples = chunk_text(page.content)

if not page_chunk_tuples:

continue

chunk_texts = [t[0] for t in page_chunk_tuples]

embeddings = await generate_embeddings(chunk_texts)

chunks_with_embeddings = [

(chunk_texts[i], embeddings[i], page_chunk_tuples[i][1])

for i in range(len(chunk_texts))

]

create_page_chunks(scraped_page_id, chunks_with_embeddings)

return len(scrape_result.pages)

page_count = _run_async_with_cleanup(_index_pages_and_chunks())

typer.echo(f" ✓ Indexed {page_count} pages with embeddings")

except Exception as e:

typer.echo(

typer.style(

f" ⚠ Page indexing failed (search_pages tool may be incomplete or empty): {e}",

fg=typer.colors.YELLOW,

),

err=True,

)

Copilot · 2026-01-29T23:12:16Z

src/cli/setup_cli.py

+                async def _index_pages_and_chunks():
+                    for page in scrape_result.pages:
+                        scraped_page_id = create_scraped_page(
+                            reference_doc_id=reference_doc_id,
+                            url=page.url,
+                            normalized_url=page.normalized_url,
+                            title=page.title,
+                            raw_content=page.content,
+                            word_count=page.word_count,
+                            scraped_at=page.scraped_at,
+                        )
+                        page_chunk_tuples = chunk_text(page.content)
+                        if not page_chunk_tuples:
+                            continue
+                        chunk_texts = [t[0] for t in page_chunk_tuples]
+                        embeddings = await generate_embeddings(chunk_texts)
+                        chunks_with_embeddings = [
+                            (chunk_texts[i], embeddings[i], page_chunk_tuples[i][1])
+                            for i in range(len(chunk_texts))
+                        ]
+                        create_page_chunks(scraped_page_id, chunks_with_embeddings)
+                    return len(scrape_result.pages)


The page indexing implementation (_index_pages_and_chunks) is duplicated in both the existing-doc resume path and the new-doc path. This duplication makes it easy for the two flows to drift (e.g., different chunking/embedding behavior, different error handling). Consider extracting a single helper (e.g., index_scrape_result(reference_doc_id, scrape_result)) and calling it from both branches.

Copilot · 2026-01-29T23:12:16Z

src/config.py

+    embedding_dimensions: int = Field(
+        default=1536,
+        description="Embedding vector dimension (matches text-embedding-3-small)",
+    )


embedding_dimensions is introduced in settings, but the rest of the implementation hard-codes 1536 in the DB schema and tests, and the runtime code doesn’t validate that the embedder actually returns vectors of this size. This creates a sharp edge if embedding_model is changed (or if the gateway returns a different dimension): inserts/search RPC casts can start failing at runtime. Either (a) validate embedding lengths against settings.embedding_dimensions in generate_embeddings/embed_query and keep schema in sync, or (b) remove embedding_dimensions to avoid implying it’s configurable.

Suggested change

embedding_dimensions: int = Field(

default=1536,

description="Embedding vector dimension (matches text-embedding-3-small)",

)

Copilot · 2026-01-29T23:12:17Z

tests/unit/test_embedding_service.py

+    async def test_generate_embeddings_returns_vectors(self):
+        """generate_embeddings returns list of vectors (list of floats)."""
+        mock_result = MagicMock()
+        mock_result.embeddings = [[0.1] * 1536, [0.2] * 1536]
+        with patch(
+            "src.services.embedding_service.Embedder"
+        ) as mock_embedder_class:
+            mock_embedder = MagicMock()
+            mock_embedder.embed_documents = AsyncMock(return_value=mock_result)
+            mock_embedder_class.return_value = mock_embedder
+            result = await generate_embeddings(["text one", "text two"])
+        assert len(result) == 2


This test patches Embedder but not get_settings(). generate_embeddings() calls get_settings() for embedding_model, and Settings has required fields that won’t be present in unit test env by default. Patch src.services.embedding_service.get_settings (as you do in test_generate_embeddings_calls_embed_documents) or take a mock_settings fixture dependency so the test doesn’t raise a settings validation error.

Copilot · 2026-01-29T23:12:17Z

tests/unit/test_embedding_service.py

+    async def test_embed_query_returns_vector(self):
+        """embed_query returns a single vector."""
+        mock_result = MagicMock()
+        mock_result.embeddings = [[0.5] * 1536]
+        with patch("src.services.embedding_service.Embedder") as mock_embedder_class:
+            mock_embedder = MagicMock()
+            mock_embedder.embed_query = AsyncMock(return_value=mock_result)
+            mock_embedder_class.return_value = mock_embedder
+            result = await embed_query("search query")
+        assert len(result) == 1536
+        assert result[0] == 0.5
+
+    @pytest.mark.asyncio
+    async def test_embed_query_no_embeddings_returns_empty(self):
+        """When embed_query returns no embeddings, return empty list."""
+        mock_result = MagicMock()
+        mock_result.embeddings = []
+        with patch("src.services.embedding_service.Embedder") as mock_embedder_class:
+            mock_embedder = MagicMock()
+            mock_embedder.embed_query = AsyncMock(return_value=mock_result)
+            mock_embedder_class.return_value = mock_embedder
+            result = await embed_query("query")


These tests patch Embedder but not get_settings(). embed_query() calls get_settings() to choose the embedding model, and will fail in unit tests unless get_settings is patched (or a mock_settings fixture is used). Please patch src.services.embedding_service.get_settings in these test cases so they don’t depend on real environment variables.

Copilot · 2026-01-29T23:12:17Z

src/db/repository.py

+def create_scraped_page(
+    reference_doc_id: str,
+    url: str,
+    normalized_url: str,
+    title: str,
+    raw_content: str,
+    word_count: int,
+    scraped_at: datetime,
+) -> str:


New repository surface area was added (create_scraped_page, create_page_chunks, search_page_chunks, get_scraped_pages_by_reference_doc) but there are no corresponding unit tests, while other functions in this module are covered (see tests/unit/test_repository.py). Please add tests that validate the expected Supabase calls/parameters and basic behaviors (e.g., empty results handling in search_page_chunks/get_scraped_pages_by_reference_doc).

Copilot · 2026-01-29T23:12:17Z

src/services/agent_service.py

+        @self.agent.tool
+        async def search_pages(
+            ctx: RunContext[MessengerAgentDeps], query: str
+        ) -> str:
+            """Search scraped website pages for specific information.
+
+            Use this when you need to find detailed information that may not be
+            in the overview, such as specific policies, contact details, or
+            facts about particular topics.
+            """


The new search_pages tool enables using scraped-page content beyond the reference doc, but the system prompt (prompts/agent_system_instructions.md) currently says “Use ONLY the following reference document as your source of truth” and to escalate when something isn’t covered. As-is, the agent may avoid calling this tool, or if it does call it it may violate its own instructions. Please update the agent’s system prompt/guardrails to explicitly allow using search_pages as an additional (untrusted) source, and include guidance to ignore any instructions found in scraped content and to avoid exposing page URLs/citations directly to end users (since the tool currently returns [Source: ...]).

jreakin · 2026-01-29T23:55:12Z

Merge activity

Jan 29, 11:55 PM UTC: A user started a stack merge that includes this pull request via Graphite.
Jan 30, 12:12 AM UTC: Graphite rebased this pull request as part of a merge.
Jan 30, 12:12 AM UTC: @jreakin merged this pull request with Graphite.

notion-workspace · 2026-03-15T04:20:05Z

Created web search agent tools for more detailed questions asked

jreakin mentioned this pull request Jan 29, 2026

Updated CLI walkthrough for setting up with Facebook #8

Merged

jreakin marked this pull request as ready for review January 29, 2026 23:01

Copilot AI review requested due to automatic review settings January 29, 2026 23:01

Copilot started reviewing on behalf of jreakin January 29, 2026 23:01 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

jreakin mentioned this pull request Jan 29, 2026

Refactored to reduce coupling and redundency to maintain DRY more robustly. #10

Merged

jreakin changed the base branch from facebook-cli-setup-enhancement to graphite-base/9 January 30, 2026 00:09

jreakin changed the base branch from graphite-base/9 to main January 30, 2026 00:10

Created web search agent tools for more detailed questions asked

ad895ce

jreakin force-pushed the agent-search-tool-creation branch from b297e9b to ad895ce Compare January 30, 2026 00:11

jreakin merged commit 97aad2a into main Jan 30, 2026
6 checks passed

jreakin deleted the agent-search-tool-creation branch January 30, 2026 00:19

Conversation

jreakin commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

What changed?

How to test?

Why make this change?

Uh oh!

coderabbitai bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

jreakin commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sentry bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

jreakin commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

notion-workspace bot commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jreakin commented Jan 29, 2026 •

edited

Loading

coderabbitai bot commented Jan 29, 2026 •

edited

Loading

jreakin commented Jan 29, 2026 •

edited

Loading

sentry bot commented Jan 29, 2026 •

edited

Loading

jreakin commented Jan 29, 2026 •

edited

Loading