Skip to content

Conversation

@robodev-r2d2
Copy link
Collaborator

  • Added support for configurable sitemap parsing via environment variable SITEMAP_PARSER with options: docusaurus, astro, and generic.
  • Introduced new config map for extractor sitemap settings.
  • Updated PageSummaryEnhancer to group documents by URL for non-numeric pages and maintain separation for paged documents.
  • Enhanced LangchainSummarizer to respect max concurrency settings during summarization.
  • Improved error logging for source uploads in DefaultSourceUploader.
  • Added comprehensive tests for new sitemap parsing functions and summarization logic.
  • Updated README and documentation to reflect changes and provide guidance on memory management for backend pods.

- Added support for configurable sitemap parsing via environment variable `SITEMAP_PARSER` with options: `docusaurus`, `astro`, and `generic`.
- Introduced new config map for extractor sitemap settings.
- Updated `PageSummaryEnhancer` to group documents by URL for non-numeric pages and maintain separation for paged documents.
- Enhanced `LangchainSummarizer` to respect max concurrency settings during summarization.
- Improved error logging for source uploads in `DefaultSourceUploader`.
- Added comprehensive tests for new sitemap parsing functions and summarization logic.
- Updated README and documentation to reflect changes and provide guidance on memory management for backend pods.
@a-klos a-klos merged commit 79b8254 into stackitcloud:main Dec 16, 2025
16 checks passed
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances sitemap extraction and document summarization features by introducing configurable HTML parsing strategies, improving concurrency control, and adding comprehensive test coverage.

Key Changes:

  • Added three sitemap parser options (docusaurus, astro, generic) configurable via SITEMAP_PARSER environment variable
  • Enhanced PageSummaryEnhancer to intelligently group documents by URL for sitemap sources while maintaining page-level separation for paged documents
  • Improved concurrency control in LangchainSummarizer and PageSummaryEnhancer to respect max_concurrency settings

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
libs/rag-core-api/src/rag_core_api/prompt_templates/answer_generation_prompt.py Refined system prompt with clearer instructions and improved output formatting requirements
libs/extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py Refactored sitemap parsing into specialized functions for Docusaurus, Astro, and generic HTML with improved title extraction logic
libs/extractor-api-lib/src/extractor_api_lib/impl/settings/sitemap_settings.py Added new settings class to configure sitemap parser selection via environment variables
libs/extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py Updated to support runtime parser override and integrate with new parser selection logic
libs/extractor-api-lib/src/extractor_api_lib/dependency_container.py Configured dependency injection to wire sitemap parsers based on settings using Selector pattern
libs/extractor-api-lib/tests/sitemap_extractor_utils_test.py Added comprehensive tests for new parsing functions covering content extraction and metadata parsing
libs/admin-api-lib/src/admin_api_lib/impl/information_enhancer/page_summary_enhancer.py Refactored grouping logic to handle sitemap vs paged documents and added concurrency limiting
libs/admin-api-lib/src/admin_api_lib/impl/summarizer/langchain_summarizer.py Added max_concurrency support to control parallel summarization tasks
libs/admin-api-lib/tests/page_summary_enhancer_test.py Added tests for new grouping logic and concurrency control
libs/admin-api-lib/tests/langchain_summarizer_test.py Added tests verifying concurrency limits are respected
libs/admin-api-lib/src/admin_api_lib/impl/api_endpoints/default_source_uploader.py Improved error message for timeout errors with actionable guidance
libs/README.md Documented new SITEMAP_PARSER configuration option and parser override capability
infrastructure/rag/values.yaml Added sitemap parser configuration, memory management guidance, and RERANKER_MODEL setting
infrastructure/rag/templates/extractor/deployment.yaml Wired new sitemap configmap into extractor deployment
infrastructure/rag/templates/extractor/configmap.yaml Created new configmap for sitemap-specific environment variables
infrastructure/rag/templates/_admin_backend_and_extractor_helpers.tpl Added helper template for sitemap configmap name
infrastructure/README.md Added guidance for troubleshooting OOMKilled issues with backend pods

Comment on lines 5 to 7
from typing import Optional
from typing import Any

Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate import statement. The from typing import Any appears on both line 5 and line 6. Remove the duplicate import.

Suggested change
from typing import Optional
from typing import Any
from typing import Optional, Any

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +63
@staticmethod
def _parse_max_concurrency(config: RunnableConfig) -> Optional[int]:
"""Parse max concurrency from a RunnableConfig.
Returns
-------
Optional[int]
An integer >= 1 if configured and valid, otherwise None.
"""
max_concurrency = config.get("max_concurrency")
if max_concurrency is None:
return None

try:
return max(1, int(max_concurrency))
except (TypeError, ValueError):
return None
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _parse_max_concurrency method is missing the Parameters section in its docstring. Add a Parameters section to document the config parameter following NumPy style.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +18
class SitemapSettings(BaseSettings):
"""Controls sitemap HTML parsing defaults."""

class Config:
"""Config class for reading Fields from env."""

env_prefix = "SITEMAP_"
case_sensitive = False

parser: Literal["docusaurus", "astro", "generic"] = Field(default="docusaurus")
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SitemapSettings class is missing a comprehensive docstring. Following NumPy style, it should document the Attributes section listing the parser field with its type, description, and default value.

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +22
from extractor_api_lib.impl.utils.sitemap_extractor_utils import (
astro_sitemap_metadata_parser_function,
astro_sitemap_parser_function,
docusaurus_sitemap_metadata_parser_function,
docusaurus_sitemap_parser_function,
generic_sitemap_metadata_parser_function,
)
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for generic_sitemap_parser_function. While generic_sitemap_metadata_parser_function is tested in test_meta_parser_falls_back_to_title_tag, there is no test for the corresponding parser function. Add a test to verify that generic_sitemap_parser_function correctly extracts content from generic HTML pages.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +50
if isinstance(page, int) or (isinstance(page, str) and page != "Unknown Title"):
return ("page_number", document_url, page)
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for the edge case where page is a non-"Unknown Title" string value without a document_url. The _group_key method on line 49 will return ("page_number", None, page_string), which could incorrectly group documents from different sources that happen to have the same page name. Add a test to verify this behavior or adjust the logic to handle this case.

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +101
def docusaurus_sitemap_parser_function(content: Union[str, BeautifulSoup]) -> str:
"""
Given HTML content (as a string or BeautifulSoup object), return the concatenated text from all <article> elements.
Parse Docusaurus pages from a sitemap.
Parameters
----------
content : Union[str, BeautifulSoup]
The HTML content to parse, either as a string or a BeautifulSoup object.
Given HTML content (as a string or BeautifulSoup object), return the extracted text from the main content area.
"""
if isinstance(content, str):
soup = BeautifulSoup(content, "html.parser")
else:
soup = content
soup = _as_soup(content)
root = _select_docusaurus_root(soup)
return _extract_text(root)

article_elements = soup.find_all("article")
if not article_elements:
return str(content.get_text())

texts = [element.get_text(separator=" ", strip=True) for element in article_elements]
return "\n".join(texts)
def astro_sitemap_parser_function(content: Union[str, BeautifulSoup]) -> str:
"""
Parse Astro pages from a sitemap.
Given HTML content (as a string or BeautifulSoup object), return the extracted text from the main content area.
"""
soup = _as_soup(content)
root = _select_astro_root(soup)
return _extract_text(root)

def custom_sitemap_metadata_parser_function(meta: dict, _content: Any) -> dict:

def generic_sitemap_parser_function(content: Union[str, BeautifulSoup]) -> str:
"""
Parse generic HTML pages from a sitemap.
This is a safe fallback that tries <article> first, then <main>, and finally the full document body.
"""
soup = _as_soup(content)
root = _select_generic_root(soup)
return _extract_text(root)
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstrings for these functions do not follow the NumPy style required by the repository's coding guidelines. They should include a Parameters section describing the content parameter and a Returns section describing the return type and value.

Copilot uses AI. Check for mistakes.
Comment on lines +146 to +179
def docusaurus_sitemap_metadata_parser_function(meta: dict, _content: Any) -> dict:
"""Extract metadata for Docusaurus pages."""
soup = _as_soup(_content) if isinstance(_content, (str, BeautifulSoup)) else _content
root = _select_docusaurus_root(soup)
source_url = meta.get("loc") or meta.get("source")
title = _extract_title(soup, root)
if title == "Unknown Title" and source_url:
title = _title_from_url(str(source_url))
meta["title"] = title
return {"source": source_url, **meta} if source_url else meta


def astro_sitemap_metadata_parser_function(meta: dict, _content: Any) -> dict:
"""Extract metadata for Astro pages."""
soup = _as_soup(_content) if isinstance(_content, (str, BeautifulSoup)) else _content
root = _select_astro_root(soup)
source_url = meta.get("loc") or meta.get("source")
title = _extract_title(soup, root)
if title == "Unknown Title" and source_url:
title = _title_from_url(str(source_url))
meta["title"] = title
return {"source": source_url, **meta} if source_url else meta


def generic_sitemap_metadata_parser_function(meta: dict, _content: Any) -> dict:
"""Extract metadata for generic HTML pages."""
soup = _as_soup(_content) if isinstance(_content, (str, BeautifulSoup)) else _content
root = _select_generic_root(soup)
source_url = meta.get("loc") or meta.get("source")
title = _extract_title(soup, root)
if title == "Unknown Title" and source_url:
title = _title_from_url(str(source_url))
meta["title"] = title
return {"source": source_url, **meta} if source_url else meta
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstrings for these metadata parser functions do not follow the NumPy style required by the repository's coding guidelines. They should include a Parameters section (meta and _content parameters) and a Returns section describing the return type and value.

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +67
@staticmethod
def _select_parser_functions(
parser_override: Optional[str],
) -> tuple[Optional[callable], Optional[callable]]:
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type annotation for this method should use tuple[Callable, Callable] | tuple[None, None] (or Optional[tuple[Callable, Callable]]) instead of tuple[Optional[callable], Optional[callable]]. The current signature suggests either element can be None independently, but the implementation always returns both as None or both as callables. Also, use Callable (capital C) from typing instead of the lowercase built-in callable.

Copilot uses AI. Check for mistakes.
except (json.JSONDecodeError, TypeError):
sitemap_loader_parameters[x.key] = x.value
else:
sitemap_loader_parameters[x.key] = int(x.value) if x.value.isdigit() else x.value
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential AttributeError if x.value is not a string. The code calls x.value.isdigit() on line 166, but if x.value is None or another type that doesn't have an isdigit() method, this will raise an AttributeError. Add a type check before calling isdigit(), such as: int(x.value) if isinstance(x.value, str) and x.value.isdigit() else x.value

Suggested change
sitemap_loader_parameters[x.key] = int(x.value) if x.value.isdigit() else x.value
sitemap_loader_parameters[x.key] = int(x.value) if isinstance(x.value, str) and x.value.isdigit() else x.value

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +127
async def _summarize_documents(
self,
documents: list[Document],
config: RunnableConfig,
*,
max_concurrency: Optional[int],
) -> list[SummarizerOutput]:
"""Summarize a set of already-chunked documents.
Notes
-----
This optionally limits task fan-out using a per-call semaphore (max_concurrency).
The actual LLM call concurrency is always bounded by the instance semaphore held
inside `_summarize_chunk`.
"""
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _summarize_documents method is missing the Parameters and Returns sections in its docstring. Add these sections following NumPy style to document the documents, config, and max_concurrency parameters, as well as the return type.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants