Externalize prompts Added src/prompts/ with agent_prompts.py and copilot_prompts.py. build_agent_system_prompt(tone, reference_doc, max_chars) and build_synthesis_system_prompt() / build_synthesis_user_prompt() are used by agent_service and copilot_servic#5
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR implements a significant migration from a custom CopilotService to PydanticAI Gateway, along with substantial improvements to the website scraper and database migration system. The changes enable better AI agent functionality, multi-page web crawling with Cloudflare bypass, and database schema evolution.
Changes:
- Migrated from CopilotService to PydanticAI Gateway for LLM interactions, updating all agent and reference document builders
- Enhanced website scraper with multi-page crawling (up to 20 pages), browser fallback for bot-protected sites, and improved content extraction
- Added database migration system with psycopg for schema evolution and setup CLI resume capability
Reviewed changes
Copilot reviewed 29 out of 30 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Added dependencies: psycopg, undetected-chromedriver, selenium, setuptools, and related packages |
| src/services/scraper.py | Implemented multi-page crawling with browser fallback for Cloudflare/bot protection |
| src/services/reference_doc.py | Migrated to PydanticAI Agent API, added detailed_content field |
| src/services/agent_service.py | Updated for PydanticAI output_type and dynamic system prompt registration |
| src/db/migrate.py | New migration runner script with helpful error messages |
| src/db/repository.py | Added get_reference_document_by_source_url for resume capability |
| src/cli/setup_cli.py | Implemented resume capability and URL normalization |
| src/config.py | Added database_url field and .env.local support |
| src/middleware/correlation_id.py | Updated logfire integration to use span method |
| migrations/003_reference_doc_nullable_bot_id.sql | Makes bot_id nullable for setup flow |
| pyproject.toml | Added new dependencies with version constraints |
| tests/* | Comprehensive test updates for PydanticAI migration and mock improvements |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| while to_visit and len(visited) < max_pages: | ||
| current = to_visit.pop(0) | ||
| current_normalized = _normalize_url_for_crawl(current) | ||
| if current_normalized in visited: | ||
| continue | ||
| visited.add(current_normalized) | ||
|
|
||
| try: | ||
| html = await _fetch_one_page(current, headers) | ||
| except ValueError: | ||
| if not page_texts: | ||
| raise | ||
| logfire.warning("Skipping page after fetch error", url=current) | ||
| continue | ||
|
|
||
| text, new_links = _parse_page_text_and_links(html, current) | ||
| # If first page has very little text (likely JS-rendered SPA), refetch with browser | ||
| if len(visited) == 1 and len(text.split()) < 400: | ||
| logfire.info( | ||
| "First page has little text, refetching with browser (likely JS-rendered)", | ||
| url=current, | ||
| word_count=len(text.split()), | ||
| ) | ||
| try: | ||
| html = await asyncio.to_thread(_fetch_with_browser_sync, current, 45.0) | ||
| text, new_links = _parse_page_text_and_links(html, current) | ||
| except Exception as e: | ||
| logfire.warning("Browser refetch failed, using initial content", url=current, error=str(e)) | ||
| if text: | ||
| page_texts.append(text) | ||
|
|
||
| for link in new_links: | ||
| link_norm = _normalize_url_for_crawl(link) | ||
| if link_norm not in visited and link_norm not in in_queue: | ||
| in_queue.add(link_norm) | ||
| to_visit.append(link) | ||
|
|
||
| # Polite delay between requests | ||
| if to_visit and len(visited) < max_pages: | ||
| await asyncio.sleep(0.5) |
There was a problem hiding this comment.
The scraper could get stuck in an infinite loop if a website has circular redirects or if link extraction consistently produces URLs that normalize to visited URLs after the check on line 212. While the max_pages limit provides some protection, consider adding a timeout mechanism or a maximum runtime limit to prevent the scraper from running indefinitely in edge cases.
| with conn.cursor() as cur: | ||
| for path in sql_files: | ||
| sql = path.read_text() | ||
| print(f"Applying {path.name}...") | ||
| cur.execute(sql) |
There was a problem hiding this comment.
The migration system lacks idempotency tracking. Running this script multiple times will re-execute all migrations, which could fail or cause data inconsistencies. Consider adding a schema_migrations table to track which migrations have been applied, and check it before executing each migration. This is a standard pattern in database migration systems (e.g., Django, Alembic, Flyway).
| with conn.cursor() as cur: | |
| for path in sql_files: | |
| sql = path.read_text() | |
| print(f"Applying {path.name}...") | |
| cur.execute(sql) | |
| with conn.cursor() as cur: | |
| # Ensure schema_migrations table exists to track applied migrations | |
| cur.execute( | |
| """ | |
| CREATE TABLE IF NOT EXISTS schema_migrations ( | |
| filename text PRIMARY KEY, | |
| applied_at timestamptz NOT NULL DEFAULT now() | |
| ) | |
| """ | |
| ) | |
| # Load already applied migrations into a set for quick lookup | |
| cur.execute("SELECT filename FROM schema_migrations") | |
| applied_migrations = {row[0] for row in cur.fetchall()} | |
| for path in sql_files: | |
| if path.name in applied_migrations: | |
| print(f"Skipping {path.name}, already applied.") | |
| continue | |
| sql = path.read_text() | |
| print(f"Applying {path.name}...") | |
| cur.execute(sql) | |
| # Record successful application of this migration | |
| cur.execute( | |
| """ | |
| INSERT INTO schema_migrations (filename) | |
| VALUES (%s) | |
| ON CONFLICT (filename) DO NOTHING | |
| """, | |
| (path.name,), | |
| ) |
| import os | ||
| os.environ.setdefault("LOGFIRE_IGNORE_NO_CONFIG", "1") | ||
|
|
||
| from pathlib import Path | ||
| _project_root = Path(__file__).resolve().parent.parent.parent | ||
| from dotenv import load_dotenv | ||
| load_dotenv(_project_root / ".env") | ||
| load_dotenv(_project_root / ".env.local") |
There was a problem hiding this comment.
Setting environment variables and loading .env files at module level (lines 3-10) can cause issues with test isolation and makes the module have side effects when imported. This is particularly problematic because these side effects happen before the imports below. Consider moving this initialization into the CLI command function or using a dedicated initialization function that's called explicitly from main.
| Fetch one URL. Tries httpx first; on 403/503 falls back to undetected Chrome. | ||
| Raises ValueError on failure. | ||
| """ | ||
| html: str | None = None |
There was a problem hiding this comment.
The type hint str | None (PEP 604 union syntax) requires Python 3.10+. Same for tuple[str, List[str]] on line 130. If your project supports Python 3.9, you'll need to use Optional[str] and Tuple[str, List[str]] from typing instead. Verify the minimum Python version in pyproject.toml.
| # Add to Logfire span for automatic correlation (if available) | ||
| # Use span with correlation_id in attributes for tracing | ||
| span_method = getattr(logfire, "span", None) | ||
| if span_method: | ||
| with span_method("request", correlation_id=correlation_id): | ||
| response = await call_next(request) | ||
| response.headers[self.header_name] = correlation_id | ||
| return response | ||
| else: | ||
| # Fallback if logfire.span is not available (e.g., in tests) | ||
| response = await call_next(request) | ||
|
|
||
| # Add correlation ID to response headers | ||
| response.headers[self.header_name] = correlation_id | ||
|
|
||
| return response |
There was a problem hiding this comment.
Code duplication between the span and fallback branches. Both branches execute identical code after the context manager. Consider refactoring to avoid duplication, for example by using a no-op context manager when span is not available, or by extracting the common logic into a separate function.
| common_questions: list[str] = Field(..., description="Anticipated FAQs") | ||
| important_details: str = Field(..., description="Critical information to remember") | ||
| detailed_content: str = Field( | ||
| "", |
There was a problem hiding this comment.
The description mentions that 'detailed_content' is added to the reference document model, but this field has a default value of empty string. Consider whether this default is appropriate - if this field is meant to capture important detailed content, an empty default might result in less comprehensive reference documents if the LLM doesn't populate it. Consider making it required or adjusting the prompt to ensure this field is always populated with meaningful content.
| "", | |
| ..., |
| driver = uc.Chrome(**kwargs) | ||
| try: | ||
| driver.set_page_load_timeout(timeout_seconds) | ||
| driver.get(url) | ||
| return driver.page_source | ||
| finally: | ||
| driver.quit() |
There was a problem hiding this comment.
The driver cleanup in the finally block may not execute if the process is forcefully terminated (SIGKILL). Additionally, undetected_chromedriver can leave zombie Chrome processes if cleanup fails. Consider adding a context manager wrapper or using a process cleanup mechanism to ensure Chrome processes are terminated even if the Python process exits unexpectedly. Also consider setting a timeout on the Chrome process itself.
|
|
||
| def _extract_same_domain_links(soup: BeautifulSoup, current_url: str) -> List[str]: | ||
| """Return absolute same-domain URLs from <a href>, normalized and deduped. Skips non-HTML resources.""" | ||
| seen: set[str] = set() |
There was a problem hiding this comment.
The type hint set[str] (lowercase set) requires Python 3.9+. While this may be acceptable for your project, consider verifying the minimum Python version is documented and enforced in pyproject.toml. If supporting Python 3.8, use Set[str] from typing instead.
| if not url or not url.strip(): | ||
| return url | ||
| u = url.strip() | ||
| return u.rstrip("/") if u != "https://" and u != "http://" else u |
There was a problem hiding this comment.
The URL normalization function has an edge case issue. When the input is exactly "https://" or "http://", the function returns it unchanged. However, these are invalid URLs that will likely cause failures downstream in the scraping logic. Consider either raising a ValueError for invalid URLs or handling them more explicitly before they reach the normalization function.
| return u.rstrip("/") if u != "https://" and u != "http://" else u | |
| if u in ("https://", "http://"): | |
| raise ValueError("Invalid URL: missing hostname in 'http://' or 'https://'.") | |
| return u.rstrip("/") |
| if version_main is not None: | ||
| try: | ||
| kwargs["version_main"] = int(version_main) | ||
| except ValueError: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except ValueError: | |
| except ValueError: | |
| # Ignore invalid CHROME_VERSION_MAIN; allow undetected_chromedriver to auto-detect version. |
cbcf37f to
212244f
Compare
Added src/prompts/ with agent_prompts.py and copilot_prompts.py. build_agent_system_prompt(tone, reference_doc, max_chars) and build_synthesis_system_prompt() / build_synthesis_user_prompt() are used by agent_service and copilot_service. Added tests/unit/test_prompts.py. Configurable settings In Settings: chunk_target_words, confidence_threshold, max_response_chars, llm_timeout_seconds. Wired into scraper, agent_service, and copilot_service. Documented in .env.example. Replace print with logfire In src/main.py, replaced the print(...) with logfire.warn(...) and structured fields (enabled, base_url). OpenAI fallback Skipped per plan; fallbacks left to PydanticAI Gateway. Repository → class-based BotRepository in src/db/repository.py with __init__(client) and methods: create_reference_document, create_bot_configuration, get_bot_configuration_by_page_id, get_reference_document, save_message_history, plus _link_reference_document_to_bot. get_bot_repository() factory that uses get_supase_client(). Call sites updated setup_cli: uses get_bot_repository() and repo.create_reference_document / repo.create_bot_configuration. main: sets app.state.repository = get_bot_repository() in lifespan. Supabase client caching get_supabase_client() is cached with @lru_cache(maxsize=1). Tests updated tests/unit/test_repository.py: uses BotRepository(mock_client) and instance methods. tests/unit/test_setup_cli.py: patches get_bot_repository, mock repo with create_reference_document / create_bot_configuration. tests/unit/test_logging.py: BotRepository(mock_supabase_client) and repo methods for logging tests. tests/integration/test_repository_db.py: BotRepository(mock_client) and repo methods.
212244f to
6a5a270
Compare
| def _load_system_prompt_template(self) -> str: | ||
| """Load system prompt from prompts/agent_system_instructions.md.""" | ||
| if not _AGENT_SYSTEM_PROMPT_PATH.exists(): | ||
| raise FileNotFoundError( | ||
| f"Agent system prompt not found: {_AGENT_SYSTEM_PROMPT_PATH}" | ||
| ) | ||
| return _AGENT_SYSTEM_PROMPT_PATH.read_text(encoding="utf-8") |
There was a problem hiding this comment.
Bug: The agent's system prompt file is located outside the src directory and is not included in the package, which will cause a FileNotFoundError in production.
Severity: CRITICAL
Suggested Fix
Move the prompts directory inside the src directory (e.g., to src/prompts) so it is included in the package distribution. Alternatively, update the packaging configuration in pyproject.toml or MANIFEST.in to explicitly include the top-level prompts directory in the built wheel.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/services/agent_service.py#L57-L63
Potential issue: The `_load_system_prompt_template` method loads the agent's system
prompt from `prompts/agent_system_instructions.md`. The path is calculated relative to
the `src` directory, assuming the `prompts` directory is at the project root. However,
the packaging configuration only includes the `src` directory in the final package. As a
result, the prompt file will be missing in a deployed environment. This will raise a
`FileNotFoundError` and crash the agent when it processes its first message.
Did we get this right? 👍 / 👎 to inform future reviews.
| def _fetch_with_browser_sync(url: str, timeout_seconds: float = 30.0) -> str: | ||
| """ | ||
| Scrape website and return text chunks. | ||
|
|
||
| Args: | ||
| url: Root URL to scrape | ||
| max_pages: Maximum number of pages to scrape | ||
|
|
||
| Returns: | ||
| List of text chunks (500-800 words each) | ||
| Fetch page HTML using undetected headless Chrome (bypasses Cloudflare/bot blocks). | ||
| Runs synchronously; call via asyncio.to_thread() from async code. | ||
| Set CHROME_VERSION_MAIN to your Chrome major version (e.g. 143) if you see | ||
| "This version of ChromeDriver only supports Chrome version X" and your Chrome is different. | ||
| """ | ||
| start_time = time.time() | ||
|
|
||
| logfire.info( | ||
| "Starting website scrape", | ||
| url=url, | ||
| max_pages=max_pages, | ||
| ) | ||
|
|
||
| async with httpx.AsyncClient(timeout=30.0, follow_redirects=True) as client: | ||
| fetch_start = time.time() | ||
| import undetected_chromedriver as uc |
There was a problem hiding this comment.
Bug: The scraper's browser fallback will crash if Chrome is not installed in the deployment environment because the call is not wrapped in a try-except block.
Severity: HIGH
Suggested Fix
Wrap the call to the browser fallback mechanism within a try-except block to gracefully handle exceptions that occur when undetected_chromedriver fails to initialize, for instance, due to a missing Chrome installation. Log the error and allow the process to continue with the original failed content instead of crashing.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/services/scraper.py#L66-L73
Potential issue: The scraper service uses `undetected_chromedriver` as a fallback
mechanism when a request fails with a 403/503 status code. This library requires Google
Chrome to be installed on the system. The deployment environment likely lacks a Chrome
installation. The call to the browser fallback in `_fetch_one_page` is not wrapped in a
`try-except` block, so if Chrome is missing, an unhandled exception will be raised,
crashing the `setup` CLI command when it encounters a bot-protected site.
Did we get this right? 👍 / 👎 to inform future reviews.
| tone=tone, | ||
| recent_messages=[], | ||
| ) | ||
| agent = MessengerAgentService() |
There was a problem hiding this comment.
Bug: The setup_cli test command crashes if the PYDANTIC_AI_GATEWAY_API_KEY environment variable is not set, due to an unhandled ValidationError during agent initialization.
Severity: HIGH
Suggested Fix
Wrap the MessengerAgentService() instantiation within the _run_test_repl function in a try-except block. Catch the pydantic.ValidationError and print a user-friendly error message instructing the user to set the required PYDANTIC_AI_GATEWAY_API_KEY environment variable in their .env file, then exit gracefully.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/cli/setup_cli.py#L134
Potential issue: The `setup` CLI's test REPL mode, invoked via the `test` command,
instantiates `MessengerAgentService` directly. This service's initialization depends on
`get_settings()`, which validates that the `pydantic_ai_gateway_api_key` environment
variable is set. If a user runs the test command without this variable configured in
their `.env` file, the application will crash with a Pydantic `ValidationError`. The
agent instantiation in `_run_test_repl` is not wrapped in a `try-catch` block to handle
this configuration error gracefully.
Did we get this right? 👍 / 👎 to inform future reviews.

Enhanced Scraper and Database Migrations
This PR adds several key improvements to the project:
Improved Website Scraper:
Database Migrations:
src/db/migrate.pybot_idnullable inreference_documentstableSetup CLI Improvements:
PydanticAI Gateway Integration:
Test Fixes:
Dependency Updates:
This PR completes the migration to PydanticAI Gateway and improves the robustness of the scraper and setup process.