Skip to content

Externalize prompts Added src/prompts/ with agent_prompts.py and copilot_prompts.py. build_agent_system_prompt(tone, reference_doc, max_chars) and build_synthesis_system_prompt() / build_synthesis_user_prompt() are used by agent_service and copilot_servic#5

Merged
jreakin merged 1 commit intomainfrom
class-refactor-added-supa
Jan 30, 2026

Conversation

@jreakin
Copy link
Member

@jreakin jreakin commented Jan 28, 2026

Enhanced Scraper and Database Migrations

This PR adds several key improvements to the project:

  1. Improved Website Scraper:

    • Added browser fallback for sites with Cloudflare/bot protection
    • Multi-page crawling for richer reference documents
    • Chrome version configuration option for compatibility
  2. Database Migrations:

    • Added migration system with src/db/migrate.py
    • New migration to make bot_id nullable in reference_documents table
    • Added DATABASE_URL configuration for direct Postgres connections
  3. Setup CLI Improvements:

    • Resume capability: if you restart setup with same URL, it continues from where you left off
    • URL normalization for consistent lookups
    • Better error handling and user feedback
  4. PydanticAI Gateway Integration:

    • Removed deprecated CopilotService in favor of PydanticAI Gateway
    • Updated agent and reference document builders to use PydanticAI
    • Enhanced reference document model with detailed_content field
  5. Test Fixes:

    • Added TEST_FIX_PLAN.md documenting test issues and solutions
    • Fixed Facebook service tests to use request.content instead of request.read()
    • Added proper mocking of settings with pydantic_ai_gateway_api_key
    • Improved logfire mocking to prevent test failures
  6. Dependency Updates:

    • Added undetected-chromedriver for browser automation
    • Added psycopg for database migrations
    • Added setuptools for compatibility

This PR completes the migration to PydanticAI Gateway and improves the robustness of the scraper and setup process.

@vercel
Copy link

vercel bot commented Jan 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
go-crea-fb-msg-fast-api Error Error Jan 29, 2026 0:21am

Request Review

@coderabbitai
Copy link

coderabbitai bot commented Jan 28, 2026

Warning

Rate limit exceeded

@jreakin has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 23 minutes and 56 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch class-refactor-added-supa

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Member Author

jreakin commented Jan 28, 2026

@sentry
Copy link

sentry bot commented Jan 28, 2026

Codecov Report

❌ Patch coverage is 59.27152% with 123 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/services/scraper.py 53.17% 52 Missing and 7 partials ⚠️
src/db/migrate.py 0.00% 34 Missing ⚠️
src/cli/setup_cli.py 76.14% 24 Missing and 2 partials ⚠️
src/services/agent_service.py 80.00% 1 Missing and 2 partials ⚠️
src/middleware/correlation_id.py 83.33% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a significant migration from a custom CopilotService to PydanticAI Gateway, along with substantial improvements to the website scraper and database migration system. The changes enable better AI agent functionality, multi-page web crawling with Cloudflare bypass, and database schema evolution.

Changes:

  • Migrated from CopilotService to PydanticAI Gateway for LLM interactions, updating all agent and reference document builders
  • Enhanced website scraper with multi-page crawling (up to 20 pages), browser fallback for bot-protected sites, and improved content extraction
  • Added database migration system with psycopg for schema evolution and setup CLI resume capability

Reviewed changes

Copilot reviewed 29 out of 30 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
uv.lock Added dependencies: psycopg, undetected-chromedriver, selenium, setuptools, and related packages
src/services/scraper.py Implemented multi-page crawling with browser fallback for Cloudflare/bot protection
src/services/reference_doc.py Migrated to PydanticAI Agent API, added detailed_content field
src/services/agent_service.py Updated for PydanticAI output_type and dynamic system prompt registration
src/db/migrate.py New migration runner script with helpful error messages
src/db/repository.py Added get_reference_document_by_source_url for resume capability
src/cli/setup_cli.py Implemented resume capability and URL normalization
src/config.py Added database_url field and .env.local support
src/middleware/correlation_id.py Updated logfire integration to use span method
migrations/003_reference_doc_nullable_bot_id.sql Makes bot_id nullable for setup flow
pyproject.toml Added new dependencies with version constraints
tests/* Comprehensive test updates for PydanticAI migration and mock improvements

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +179 to +218
while to_visit and len(visited) < max_pages:
current = to_visit.pop(0)
current_normalized = _normalize_url_for_crawl(current)
if current_normalized in visited:
continue
visited.add(current_normalized)

try:
html = await _fetch_one_page(current, headers)
except ValueError:
if not page_texts:
raise
logfire.warning("Skipping page after fetch error", url=current)
continue

text, new_links = _parse_page_text_and_links(html, current)
# If first page has very little text (likely JS-rendered SPA), refetch with browser
if len(visited) == 1 and len(text.split()) < 400:
logfire.info(
"First page has little text, refetching with browser (likely JS-rendered)",
url=current,
word_count=len(text.split()),
)
try:
html = await asyncio.to_thread(_fetch_with_browser_sync, current, 45.0)
text, new_links = _parse_page_text_and_links(html, current)
except Exception as e:
logfire.warning("Browser refetch failed, using initial content", url=current, error=str(e))
if text:
page_texts.append(text)

for link in new_links:
link_norm = _normalize_url_for_crawl(link)
if link_norm not in visited and link_norm not in in_queue:
in_queue.add(link_norm)
to_visit.append(link)

# Polite delay between requests
if to_visit and len(visited) < max_pages:
await asyncio.sleep(0.5)
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scraper could get stuck in an infinite loop if a website has circular redirects or if link extraction consistently produces URLs that normalize to visited URLs after the check on line 212. While the max_pages limit provides some protection, consider adding a timeout mechanism or a maximum runtime limit to prevent the scraper from running indefinitely in edge cases.

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +47
with conn.cursor() as cur:
for path in sql_files:
sql = path.read_text()
print(f"Applying {path.name}...")
cur.execute(sql)
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration system lacks idempotency tracking. Running this script multiple times will re-execute all migrations, which could fail or cause data inconsistencies. Consider adding a schema_migrations table to track which migrations have been applied, and check it before executing each migration. This is a standard pattern in database migration systems (e.g., Django, Alembic, Flyway).

Suggested change
with conn.cursor() as cur:
for path in sql_files:
sql = path.read_text()
print(f"Applying {path.name}...")
cur.execute(sql)
with conn.cursor() as cur:
# Ensure schema_migrations table exists to track applied migrations
cur.execute(
"""
CREATE TABLE IF NOT EXISTS schema_migrations (
filename text PRIMARY KEY,
applied_at timestamptz NOT NULL DEFAULT now()
)
"""
)
# Load already applied migrations into a set for quick lookup
cur.execute("SELECT filename FROM schema_migrations")
applied_migrations = {row[0] for row in cur.fetchall()}
for path in sql_files:
if path.name in applied_migrations:
print(f"Skipping {path.name}, already applied.")
continue
sql = path.read_text()
print(f"Applying {path.name}...")
cur.execute(sql)
# Record successful application of this migration
cur.execute(
"""
INSERT INTO schema_migrations (filename)
VALUES (%s)
ON CONFLICT (filename) DO NOTHING
""",
(path.name,),
)

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +10
import os
os.environ.setdefault("LOGFIRE_IGNORE_NO_CONFIG", "1")

from pathlib import Path
_project_root = Path(__file__).resolve().parent.parent.parent
from dotenv import load_dotenv
load_dotenv(_project_root / ".env")
load_dotenv(_project_root / ".env.local")
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting environment variables and loading .env files at module level (lines 3-10) can cause issues with test isolation and makes the module have side effects when imported. This is particularly problematic because these side effects happen before the imports below. Consider moving this initialization into the CLI command function or using a dedicated initialization function that's called explicitly from main.

Copilot uses AI. Check for mistakes.
Fetch one URL. Tries httpx first; on 403/503 falls back to undetected Chrome.
Raises ValueError on failure.
"""
html: str | None = None
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint str | None (PEP 604 union syntax) requires Python 3.10+. Same for tuple[str, List[str]] on line 130. If your project supports Python 3.9, you'll need to use Optional[str] and Tuple[str, List[str]] from typing instead. Verify the minimum Python version in pyproject.toml.

Copilot uses AI. Check for mistakes.
Comment on lines +54 to 66
# Add to Logfire span for automatic correlation (if available)
# Use span with correlation_id in attributes for tracing
span_method = getattr(logfire, "span", None)
if span_method:
with span_method("request", correlation_id=correlation_id):
response = await call_next(request)
response.headers[self.header_name] = correlation_id
return response
else:
# Fallback if logfire.span is not available (e.g., in tests)
response = await call_next(request)

# Add correlation ID to response headers
response.headers[self.header_name] = correlation_id

return response
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code duplication between the span and fallback branches. Both branches execute identical code after the context manager. Consider refactoring to avoid duplication, for example by using a no-op context manager when span is not available, or by extracting the common logic into a separate function.

Copilot uses AI. Check for mistakes.
common_questions: list[str] = Field(..., description="Anticipated FAQs")
important_details: str = Field(..., description="Critical information to remember")
detailed_content: str = Field(
"",
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description mentions that 'detailed_content' is added to the reference document model, but this field has a default value of empty string. Consider whether this default is appropriate - if this field is meant to capture important detailed content, an empty default might result in less comprehensive reference documents if the LLM doesn't populate it. Consider making it required or adjusting the prompt to ensure this field is always populated with meaningful content.

Suggested change
"",
...,

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +90
driver = uc.Chrome(**kwargs)
try:
driver.set_page_load_timeout(timeout_seconds)
driver.get(url)
return driver.page_source
finally:
driver.quit()
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver cleanup in the finally block may not execute if the process is forcefully terminated (SIGKILL). Additionally, undetected_chromedriver can leave zombie Chrome processes if cleanup fails. Consider adding a context manager wrapper or using a process cleanup mechanism to ensure Chrome processes are terminated even if the Python process exits unexpectedly. Also consider setting a timeout on the Chrome process itself.

Copilot uses AI. Check for mistakes.

def _extract_same_domain_links(soup: BeautifulSoup, current_url: str) -> List[str]:
"""Return absolute same-domain URLs from <a href>, normalized and deduped. Skips non-HTML resources."""
seen: set[str] = set()
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint set[str] (lowercase set) requires Python 3.9+. While this may be acceptable for your project, consider verifying the minimum Python version is documented and enforced in pyproject.toml. If supporting Python 3.8, use Set[str] from typing instead.

Copilot uses AI. Check for mistakes.
if not url or not url.strip():
return url
u = url.strip()
return u.rstrip("/") if u != "https://" and u != "http://" else u
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL normalization function has an edge case issue. When the input is exactly "https://" or "http://", the function returns it unchanged. However, these are invalid URLs that will likely cause failures downstream in the scraping logic. Consider either raising a ValueError for invalid URLs or handling them more explicitly before they reach the normalization function.

Suggested change
return u.rstrip("/") if u != "https://" and u != "http://" else u
if u in ("https://", "http://"):
raise ValueError("Invalid URL: missing hostname in 'http://' or 'https://'.")
return u.rstrip("/")

Copilot uses AI. Check for mistakes.
if version_main is not None:
try:
kwargs["version_main"] = int(version_main)
except ValueError:
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except ValueError:
except ValueError:
# Ignore invalid CHROME_VERSION_MAIN; allow undetected_chromedriver to auto-detect version.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

jreakin commented Jan 29, 2026

Merge activity

  • Jan 29, 11:55 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Jan 30, 12:02 AM UTC: Graphite rebased this pull request as part of a merge.
  • Jan 30, 12:03 AM UTC: @jreakin merged this pull request with Graphite.

@jreakin jreakin changed the base branch from integrate-pydanticai-gateway to graphite-base/5 January 30, 2026 00:00
@jreakin jreakin changed the base branch from graphite-base/5 to main January 30, 2026 00:01
Added src/prompts/ with agent_prompts.py and copilot_prompts.py.
build_agent_system_prompt(tone, reference_doc, max_chars) and build_synthesis_system_prompt() / build_synthesis_user_prompt() are used by agent_service and copilot_service.
Added tests/unit/test_prompts.py.
Configurable settings
In Settings: chunk_target_words, confidence_threshold, max_response_chars, llm_timeout_seconds.
Wired into scraper, agent_service, and copilot_service.
Documented in .env.example.
Replace print with logfire
In src/main.py, replaced the print(...) with logfire.warn(...) and structured fields (enabled, base_url).
OpenAI fallback
Skipped per plan; fallbacks left to PydanticAI Gateway.
Repository → class-based
BotRepository in src/db/repository.py with __init__(client) and methods: create_reference_document, create_bot_configuration, get_bot_configuration_by_page_id, get_reference_document, save_message_history, plus _link_reference_document_to_bot.
get_bot_repository() factory that uses get_supase_client().
Call sites updated
setup_cli: uses get_bot_repository() and repo.create_reference_document / repo.create_bot_configuration.
main: sets app.state.repository = get_bot_repository() in lifespan.
Supabase client caching
get_supabase_client() is cached with @lru_cache(maxsize=1).
Tests updated
tests/unit/test_repository.py: uses BotRepository(mock_client) and instance methods.
tests/unit/test_setup_cli.py: patches get_bot_repository, mock repo with create_reference_document / create_bot_configuration.
tests/unit/test_logging.py: BotRepository(mock_supabase_client) and repo methods for logging tests.
tests/integration/test_repository_db.py: BotRepository(mock_client) and repo methods.
@jreakin jreakin force-pushed the class-refactor-added-supa branch from 212244f to 6a5a270 Compare January 30, 2026 00:02
@jreakin jreakin merged commit 13e39e9 into main Jan 30, 2026
6 checks passed
Comment on lines +57 to +63
def _load_system_prompt_template(self) -> str:
"""Load system prompt from prompts/agent_system_instructions.md."""
if not _AGENT_SYSTEM_PROMPT_PATH.exists():
raise FileNotFoundError(
f"Agent system prompt not found: {_AGENT_SYSTEM_PROMPT_PATH}"
)
return _AGENT_SYSTEM_PROMPT_PATH.read_text(encoding="utf-8")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The agent's system prompt file is located outside the src directory and is not included in the package, which will cause a FileNotFoundError in production.
Severity: CRITICAL

Suggested Fix

Move the prompts directory inside the src directory (e.g., to src/prompts) so it is included in the package distribution. Alternatively, update the packaging configuration in pyproject.toml or MANIFEST.in to explicitly include the top-level prompts directory in the built wheel.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/services/agent_service.py#L57-L63

Potential issue: The `_load_system_prompt_template` method loads the agent's system
prompt from `prompts/agent_system_instructions.md`. The path is calculated relative to
the `src` directory, assuming the `prompts` directory is at the project root. However,
the packaging configuration only includes the `src` directory in the final package. As a
result, the prompt file will be missing in a deployed environment. This will raise a
`FileNotFoundError` and crash the agent when it processes its first message.

Did we get this right? 👍 / 👎 to inform future reviews.

Comment on lines +66 to +73
def _fetch_with_browser_sync(url: str, timeout_seconds: float = 30.0) -> str:
"""
Scrape website and return text chunks.

Args:
url: Root URL to scrape
max_pages: Maximum number of pages to scrape

Returns:
List of text chunks (500-800 words each)
Fetch page HTML using undetected headless Chrome (bypasses Cloudflare/bot blocks).
Runs synchronously; call via asyncio.to_thread() from async code.
Set CHROME_VERSION_MAIN to your Chrome major version (e.g. 143) if you see
"This version of ChromeDriver only supports Chrome version X" and your Chrome is different.
"""
start_time = time.time()

logfire.info(
"Starting website scrape",
url=url,
max_pages=max_pages,
)

async with httpx.AsyncClient(timeout=30.0, follow_redirects=True) as client:
fetch_start = time.time()
import undetected_chromedriver as uc
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The scraper's browser fallback will crash if Chrome is not installed in the deployment environment because the call is not wrapped in a try-except block.
Severity: HIGH

Suggested Fix

Wrap the call to the browser fallback mechanism within a try-except block to gracefully handle exceptions that occur when undetected_chromedriver fails to initialize, for instance, due to a missing Chrome installation. Log the error and allow the process to continue with the original failed content instead of crashing.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/services/scraper.py#L66-L73

Potential issue: The scraper service uses `undetected_chromedriver` as a fallback
mechanism when a request fails with a 403/503 status code. This library requires Google
Chrome to be installed on the system. The deployment environment likely lacks a Chrome
installation. The call to the browser fallback in `_fetch_one_page` is not wrapped in a
`try-except` block, so if Chrome is missing, an unhandled exception will be raised,
crashing the `setup` CLI command when it encounters a bot-protected site.

Did we get this right? 👍 / 👎 to inform future reviews.

tone=tone,
recent_messages=[],
)
agent = MessengerAgentService()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The setup_cli test command crashes if the PYDANTIC_AI_GATEWAY_API_KEY environment variable is not set, due to an unhandled ValidationError during agent initialization.
Severity: HIGH

Suggested Fix

Wrap the MessengerAgentService() instantiation within the _run_test_repl function in a try-except block. Catch the pydantic.ValidationError and print a user-friendly error message instructing the user to set the required PYDANTIC_AI_GATEWAY_API_KEY environment variable in their .env file, then exit gracefully.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/cli/setup_cli.py#L134

Potential issue: The `setup` CLI's test REPL mode, invoked via the `test` command,
instantiates `MessengerAgentService` directly. This service's initialization depends on
`get_settings()`, which validates that the `pydantic_ai_gateway_api_key` environment
variable is set. If a user runs the test command without this variable configured in
their `.env` file, the application will crash with a Pydantic `ValidationError`. The
agent instantiation in `_run_test_repl` is not wrapped in a `try-catch` block to handle
this configuration error gracefully.

Did we get this right? 👍 / 👎 to inform future reviews.

@jreakin jreakin deleted the class-refactor-added-supa branch January 30, 2026 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants