-
Notifications
You must be signed in to change notification settings - Fork 7
refactor: Enhance sitemap extraction and summarization features #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Enhance sitemap extraction and summarization features #185
Conversation
- Added support for configurable sitemap parsing via environment variable `SITEMAP_PARSER` with options: `docusaurus`, `astro`, and `generic`. - Introduced new config map for extractor sitemap settings. - Updated `PageSummaryEnhancer` to group documents by URL for non-numeric pages and maintain separation for paged documents. - Enhanced `LangchainSummarizer` to respect max concurrency settings during summarization. - Improved error logging for source uploads in `DefaultSourceUploader`. - Added comprehensive tests for new sitemap parsing functions and summarization logic. - Updated README and documentation to reflect changes and provide guidance on memory management for backend pods.
…unction selection logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances sitemap extraction and document summarization features by introducing configurable HTML parsing strategies, improving concurrency control, and adding comprehensive test coverage.
Key Changes:
- Added three sitemap parser options (docusaurus, astro, generic) configurable via
SITEMAP_PARSERenvironment variable - Enhanced
PageSummaryEnhancerto intelligently group documents by URL for sitemap sources while maintaining page-level separation for paged documents - Improved concurrency control in
LangchainSummarizerandPageSummaryEnhancerto respectmax_concurrencysettings
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
libs/rag-core-api/src/rag_core_api/prompt_templates/answer_generation_prompt.py |
Refined system prompt with clearer instructions and improved output formatting requirements |
libs/extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py |
Refactored sitemap parsing into specialized functions for Docusaurus, Astro, and generic HTML with improved title extraction logic |
libs/extractor-api-lib/src/extractor_api_lib/impl/settings/sitemap_settings.py |
Added new settings class to configure sitemap parser selection via environment variables |
libs/extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py |
Updated to support runtime parser override and integrate with new parser selection logic |
libs/extractor-api-lib/src/extractor_api_lib/dependency_container.py |
Configured dependency injection to wire sitemap parsers based on settings using Selector pattern |
libs/extractor-api-lib/tests/sitemap_extractor_utils_test.py |
Added comprehensive tests for new parsing functions covering content extraction and metadata parsing |
libs/admin-api-lib/src/admin_api_lib/impl/information_enhancer/page_summary_enhancer.py |
Refactored grouping logic to handle sitemap vs paged documents and added concurrency limiting |
libs/admin-api-lib/src/admin_api_lib/impl/summarizer/langchain_summarizer.py |
Added max_concurrency support to control parallel summarization tasks |
libs/admin-api-lib/tests/page_summary_enhancer_test.py |
Added tests for new grouping logic and concurrency control |
libs/admin-api-lib/tests/langchain_summarizer_test.py |
Added tests verifying concurrency limits are respected |
libs/admin-api-lib/src/admin_api_lib/impl/api_endpoints/default_source_uploader.py |
Improved error message for timeout errors with actionable guidance |
libs/README.md |
Documented new SITEMAP_PARSER configuration option and parser override capability |
infrastructure/rag/values.yaml |
Added sitemap parser configuration, memory management guidance, and RERANKER_MODEL setting |
infrastructure/rag/templates/extractor/deployment.yaml |
Wired new sitemap configmap into extractor deployment |
infrastructure/rag/templates/extractor/configmap.yaml |
Created new configmap for sitemap-specific environment variables |
infrastructure/rag/templates/_admin_backend_and_extractor_helpers.tpl |
Added helper template for sitemap configmap name |
infrastructure/README.md |
Added guidance for troubleshooting OOMKilled issues with backend pods |
| from typing import Optional | ||
| from typing import Any | ||
|
|
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate import statement. The from typing import Any appears on both line 5 and line 6. Remove the duplicate import.
| from typing import Optional | |
| from typing import Any | |
| from typing import Optional, Any |
| @staticmethod | ||
| def _parse_max_concurrency(config: RunnableConfig) -> Optional[int]: | ||
| """Parse max concurrency from a RunnableConfig. | ||
| Returns | ||
| ------- | ||
| Optional[int] | ||
| An integer >= 1 if configured and valid, otherwise None. | ||
| """ | ||
| max_concurrency = config.get("max_concurrency") | ||
| if max_concurrency is None: | ||
| return None | ||
|
|
||
| try: | ||
| return max(1, int(max_concurrency)) | ||
| except (TypeError, ValueError): | ||
| return None |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _parse_max_concurrency method is missing the Parameters section in its docstring. Add a Parameters section to document the config parameter following NumPy style.
| class SitemapSettings(BaseSettings): | ||
| """Controls sitemap HTML parsing defaults.""" | ||
|
|
||
| class Config: | ||
| """Config class for reading Fields from env.""" | ||
|
|
||
| env_prefix = "SITEMAP_" | ||
| case_sensitive = False | ||
|
|
||
| parser: Literal["docusaurus", "astro", "generic"] = Field(default="docusaurus") |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SitemapSettings class is missing a comprehensive docstring. Following NumPy style, it should document the Attributes section listing the parser field with its type, description, and default value.
| from extractor_api_lib.impl.utils.sitemap_extractor_utils import ( | ||
| astro_sitemap_metadata_parser_function, | ||
| astro_sitemap_parser_function, | ||
| docusaurus_sitemap_metadata_parser_function, | ||
| docusaurus_sitemap_parser_function, | ||
| generic_sitemap_metadata_parser_function, | ||
| ) |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for generic_sitemap_parser_function. While generic_sitemap_metadata_parser_function is tested in test_meta_parser_falls_back_to_title_tag, there is no test for the corresponding parser function. Add a test to verify that generic_sitemap_parser_function correctly extracts content from generic HTML pages.
| if isinstance(page, int) or (isinstance(page, str) and page != "Unknown Title"): | ||
| return ("page_number", document_url, page) |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for the edge case where page is a non-"Unknown Title" string value without a document_url. The _group_key method on line 49 will return ("page_number", None, page_string), which could incorrectly group documents from different sources that happen to have the same page name. Add a test to verify this behavior or adjust the logic to handle this case.
| def docusaurus_sitemap_parser_function(content: Union[str, BeautifulSoup]) -> str: | ||
| """ | ||
| Given HTML content (as a string or BeautifulSoup object), return the concatenated text from all <article> elements. | ||
| Parse Docusaurus pages from a sitemap. | ||
| Parameters | ||
| ---------- | ||
| content : Union[str, BeautifulSoup] | ||
| The HTML content to parse, either as a string or a BeautifulSoup object. | ||
| Given HTML content (as a string or BeautifulSoup object), return the extracted text from the main content area. | ||
| """ | ||
| if isinstance(content, str): | ||
| soup = BeautifulSoup(content, "html.parser") | ||
| else: | ||
| soup = content | ||
| soup = _as_soup(content) | ||
| root = _select_docusaurus_root(soup) | ||
| return _extract_text(root) | ||
|
|
||
| article_elements = soup.find_all("article") | ||
| if not article_elements: | ||
| return str(content.get_text()) | ||
|
|
||
| texts = [element.get_text(separator=" ", strip=True) for element in article_elements] | ||
| return "\n".join(texts) | ||
| def astro_sitemap_parser_function(content: Union[str, BeautifulSoup]) -> str: | ||
| """ | ||
| Parse Astro pages from a sitemap. | ||
| Given HTML content (as a string or BeautifulSoup object), return the extracted text from the main content area. | ||
| """ | ||
| soup = _as_soup(content) | ||
| root = _select_astro_root(soup) | ||
| return _extract_text(root) | ||
|
|
||
| def custom_sitemap_metadata_parser_function(meta: dict, _content: Any) -> dict: | ||
|
|
||
| def generic_sitemap_parser_function(content: Union[str, BeautifulSoup]) -> str: | ||
| """ | ||
| Parse generic HTML pages from a sitemap. | ||
| This is a safe fallback that tries <article> first, then <main>, and finally the full document body. | ||
| """ | ||
| soup = _as_soup(content) | ||
| root = _select_generic_root(soup) | ||
| return _extract_text(root) |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstrings for these functions do not follow the NumPy style required by the repository's coding guidelines. They should include a Parameters section describing the content parameter and a Returns section describing the return type and value.
| def docusaurus_sitemap_metadata_parser_function(meta: dict, _content: Any) -> dict: | ||
| """Extract metadata for Docusaurus pages.""" | ||
| soup = _as_soup(_content) if isinstance(_content, (str, BeautifulSoup)) else _content | ||
| root = _select_docusaurus_root(soup) | ||
| source_url = meta.get("loc") or meta.get("source") | ||
| title = _extract_title(soup, root) | ||
| if title == "Unknown Title" and source_url: | ||
| title = _title_from_url(str(source_url)) | ||
| meta["title"] = title | ||
| return {"source": source_url, **meta} if source_url else meta | ||
|
|
||
|
|
||
| def astro_sitemap_metadata_parser_function(meta: dict, _content: Any) -> dict: | ||
| """Extract metadata for Astro pages.""" | ||
| soup = _as_soup(_content) if isinstance(_content, (str, BeautifulSoup)) else _content | ||
| root = _select_astro_root(soup) | ||
| source_url = meta.get("loc") or meta.get("source") | ||
| title = _extract_title(soup, root) | ||
| if title == "Unknown Title" and source_url: | ||
| title = _title_from_url(str(source_url)) | ||
| meta["title"] = title | ||
| return {"source": source_url, **meta} if source_url else meta | ||
|
|
||
|
|
||
| def generic_sitemap_metadata_parser_function(meta: dict, _content: Any) -> dict: | ||
| """Extract metadata for generic HTML pages.""" | ||
| soup = _as_soup(_content) if isinstance(_content, (str, BeautifulSoup)) else _content | ||
| root = _select_generic_root(soup) | ||
| source_url = meta.get("loc") or meta.get("source") | ||
| title = _extract_title(soup, root) | ||
| if title == "Unknown Title" and source_url: | ||
| title = _title_from_url(str(source_url)) | ||
| meta["title"] = title | ||
| return {"source": source_url, **meta} if source_url else meta |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstrings for these metadata parser functions do not follow the NumPy style required by the repository's coding guidelines. They should include a Parameters section (meta and _content parameters) and a Returns section describing the return type and value.
| @staticmethod | ||
| def _select_parser_functions( | ||
| parser_override: Optional[str], | ||
| ) -> tuple[Optional[callable], Optional[callable]]: |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type annotation for this method should use tuple[Callable, Callable] | tuple[None, None] (or Optional[tuple[Callable, Callable]]) instead of tuple[Optional[callable], Optional[callable]]. The current signature suggests either element can be None independently, but the implementation always returns both as None or both as callables. Also, use Callable (capital C) from typing instead of the lowercase built-in callable.
| except (json.JSONDecodeError, TypeError): | ||
| sitemap_loader_parameters[x.key] = x.value | ||
| else: | ||
| sitemap_loader_parameters[x.key] = int(x.value) if x.value.isdigit() else x.value |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential AttributeError if x.value is not a string. The code calls x.value.isdigit() on line 166, but if x.value is None or another type that doesn't have an isdigit() method, this will raise an AttributeError. Add a type check before calling isdigit(), such as: int(x.value) if isinstance(x.value, str) and x.value.isdigit() else x.value
| sitemap_loader_parameters[x.key] = int(x.value) if x.value.isdigit() else x.value | |
| sitemap_loader_parameters[x.key] = int(x.value) if isinstance(x.value, str) and x.value.isdigit() else x.value |
| async def _summarize_documents( | ||
| self, | ||
| documents: list[Document], | ||
| config: RunnableConfig, | ||
| *, | ||
| max_concurrency: Optional[int], | ||
| ) -> list[SummarizerOutput]: | ||
| """Summarize a set of already-chunked documents. | ||
| Notes | ||
| ----- | ||
| This optionally limits task fan-out using a per-call semaphore (max_concurrency). | ||
| The actual LLM call concurrency is always bounded by the instance semaphore held | ||
| inside `_summarize_chunk`. | ||
| """ |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _summarize_documents method is missing the Parameters and Returns sections in its docstring. Add these sections following NumPy style to document the documents, config, and max_concurrency parameters, as well as the return type.
SITEMAP_PARSERwith options:docusaurus,astro, andgeneric.PageSummaryEnhancerto group documents by URL for non-numeric pages and maintain separation for paged documents.LangchainSummarizerto respect max concurrency settings during summarization.DefaultSourceUploader.