refactor: Enhance sitemap extraction and summarization features #185

robodev-r2d2 · 2025-12-16T12:13:01Z

Added support for configurable sitemap parsing via environment variable SITEMAP_PARSER with options: docusaurus, astro, and generic.
Introduced new config map for extractor sitemap settings.
Updated PageSummaryEnhancer to group documents by URL for non-numeric pages and maintain separation for paged documents.
Enhanced LangchainSummarizer to respect max concurrency settings during summarization.
Improved error logging for source uploads in DefaultSourceUploader.
Added comprehensive tests for new sitemap parsing functions and summarization logic.
Updated README and documentation to reflect changes and provide guidance on memory management for backend pods.

- Added support for configurable sitemap parsing via environment variable `SITEMAP_PARSER` with options: `docusaurus`, `astro`, and `generic`. - Introduced new config map for extractor sitemap settings. - Updated `PageSummaryEnhancer` to group documents by URL for non-numeric pages and maintain separation for paged documents. - Enhanced `LangchainSummarizer` to respect max concurrency settings during summarization. - Improved error logging for source uploads in `DefaultSourceUploader`. - Added comprehensive tests for new sitemap parsing functions and summarization logic. - Updated README and documentation to reflect changes and provide guidance on memory management for backend pods.

…marizer classes

…sses

…unction selection logic

…tractors

Copilot

Pull request overview

This PR enhances sitemap extraction and document summarization features by introducing configurable HTML parsing strategies, improving concurrency control, and adding comprehensive test coverage.

Key Changes:

Added three sitemap parser options (docusaurus, astro, generic) configurable via SITEMAP_PARSER environment variable
Enhanced PageSummaryEnhancer to intelligently group documents by URL for sitemap sources while maintaining page-level separation for paged documents
Improved concurrency control in LangchainSummarizer and PageSummaryEnhancer to respect max_concurrency settings

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
`libs/rag-core-api/src/rag_core_api/prompt_templates/answer_generation_prompt.py`	Refined system prompt with clearer instructions and improved output formatting requirements
`libs/extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py`	Refactored sitemap parsing into specialized functions for Docusaurus, Astro, and generic HTML with improved title extraction logic
`libs/extractor-api-lib/src/extractor_api_lib/impl/settings/sitemap_settings.py`	Added new settings class to configure sitemap parser selection via environment variables
`libs/extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py`	Updated to support runtime parser override and integrate with new parser selection logic
`libs/extractor-api-lib/src/extractor_api_lib/dependency_container.py`	Configured dependency injection to wire sitemap parsers based on settings using Selector pattern
`libs/extractor-api-lib/tests/sitemap_extractor_utils_test.py`	Added comprehensive tests for new parsing functions covering content extraction and metadata parsing
`libs/admin-api-lib/src/admin_api_lib/impl/information_enhancer/page_summary_enhancer.py`	Refactored grouping logic to handle sitemap vs paged documents and added concurrency limiting
`libs/admin-api-lib/src/admin_api_lib/impl/summarizer/langchain_summarizer.py`	Added max_concurrency support to control parallel summarization tasks
`libs/admin-api-lib/tests/page_summary_enhancer_test.py`	Added tests for new grouping logic and concurrency control
`libs/admin-api-lib/tests/langchain_summarizer_test.py`	Added tests verifying concurrency limits are respected
`libs/admin-api-lib/src/admin_api_lib/impl/api_endpoints/default_source_uploader.py`	Improved error message for timeout errors with actionable guidance
`libs/README.md`	Documented new SITEMAP_PARSER configuration option and parser override capability
`infrastructure/rag/values.yaml`	Added sitemap parser configuration, memory management guidance, and RERANKER_MODEL setting
`infrastructure/rag/templates/extractor/deployment.yaml`	Wired new sitemap configmap into extractor deployment
`infrastructure/rag/templates/extractor/configmap.yaml`	Created new configmap for sitemap-specific environment variables
`infrastructure/rag/templates/_admin_backend_and_extractor_helpers.tpl`	Added helper template for sitemap configmap name
`infrastructure/README.md`	Added guidance for troubleshooting OOMKilled issues with backend pods

Copilot · 2025-12-16T20:15:14Z

libs/admin-api-lib/src/admin_api_lib/impl/information_enhancer/page_summary_enhancer.py

 from typing import Optional
+from typing import Any



Duplicate import statement. The from typing import Any appears on both line 5 and line 6. Remove the duplicate import.

Suggested change

from typing import Optional

from typing import Any

from typing import Optional, Any

Copilot · 2025-12-16T20:15:14Z

libs/admin-api-lib/src/admin_api_lib/impl/summarizer/langchain_summarizer.py

+    @staticmethod
+    def _parse_max_concurrency(config: RunnableConfig) -> Optional[int]:
+        """Parse max concurrency from a RunnableConfig.
+
+        Returns
+        -------
+        Optional[int]
+            An integer >= 1 if configured and valid, otherwise None.
+        """
+        max_concurrency = config.get("max_concurrency")
+        if max_concurrency is None:
+            return None
+
+        try:
+            return max(1, int(max_concurrency))
+        except (TypeError, ValueError):
+            return None


The _parse_max_concurrency method is missing the Parameters section in its docstring. Add a Parameters section to document the config parameter following NumPy style.

Copilot · 2025-12-16T20:15:14Z

libs/extractor-api-lib/src/extractor_api_lib/impl/settings/sitemap_settings.py

+class SitemapSettings(BaseSettings):
+    """Controls sitemap HTML parsing defaults."""
+
+    class Config:
+        """Config class for reading Fields from env."""
+
+        env_prefix = "SITEMAP_"
+        case_sensitive = False
+
+    parser: Literal["docusaurus", "astro", "generic"] = Field(default="docusaurus")


The SitemapSettings class is missing a comprehensive docstring. Following NumPy style, it should document the Attributes section listing the parser field with its type, description, and default value.

Copilot · 2025-12-16T20:15:15Z

libs/extractor-api-lib/tests/sitemap_extractor_utils_test.py

+from extractor_api_lib.impl.utils.sitemap_extractor_utils import (
+    astro_sitemap_metadata_parser_function,
+    astro_sitemap_parser_function,
+    docusaurus_sitemap_metadata_parser_function,
+    docusaurus_sitemap_parser_function,
+    generic_sitemap_metadata_parser_function,
+)


Missing test coverage for generic_sitemap_parser_function. While generic_sitemap_metadata_parser_function is tested in test_meta_parser_falls_back_to_title_tag, there is no test for the corresponding parser function. Add a test to verify that generic_sitemap_parser_function correctly extracts content from generic HTML pages.

Copilot · 2025-12-16T20:15:15Z

libs/admin-api-lib/src/admin_api_lib/impl/information_enhancer/page_summary_enhancer.py

+        if isinstance(page, int) or (isinstance(page, str) and page != "Unknown Title"):
+            return ("page_number", document_url, page)


Missing test coverage for the edge case where page is a non-"Unknown Title" string value without a document_url. The _group_key method on line 49 will return ("page_number", None, page_string), which could incorrectly group documents from different sources that happen to have the same page name. Add a test to verify this behavior or adjust the logic to handle this case.

Copilot · 2025-12-16T20:15:15Z

libs/extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py

+def docusaurus_sitemap_parser_function(content: Union[str, BeautifulSoup]) -> str:
    """
-    Given HTML content (as a string or BeautifulSoup object), return the concatenated text from all <article> elements.
+    Parse Docusaurus pages from a sitemap.

-    Parameters
-    ----------
-    content : Union[str, BeautifulSoup]
-        The HTML content to parse, either as a string or a BeautifulSoup object.
+    Given HTML content (as a string or BeautifulSoup object), return the extracted text from the main content area.
    """
-    if isinstance(content, str):
-        soup = BeautifulSoup(content, "html.parser")
-    else:
-        soup = content
+    soup = _as_soup(content)
+    root = _select_docusaurus_root(soup)
+    return _extract_text(root)

-    article_elements = soup.find_all("article")
-    if not article_elements:
-        return str(content.get_text())

-    texts = [element.get_text(separator=" ", strip=True) for element in article_elements]
-    return "\n".join(texts)
+def astro_sitemap_parser_function(content: Union[str, BeautifulSoup]) -> str:
+    """
+    Parse Astro pages from a sitemap.

+    Given HTML content (as a string or BeautifulSoup object), return the extracted text from the main content area.
+    """
+    soup = _as_soup(content)
+    root = _select_astro_root(soup)
+    return _extract_text(root)

-def custom_sitemap_metadata_parser_function(meta: dict, _content: Any) -> dict:
+
+def generic_sitemap_parser_function(content: Union[str, BeautifulSoup]) -> str:
+    """
+    Parse generic HTML pages from a sitemap.
+
+    This is a safe fallback that tries <article> first, then <main>, and finally the full document body.
+    """
+    soup = _as_soup(content)
+    root = _select_generic_root(soup)
+    return _extract_text(root)


The docstrings for these functions do not follow the NumPy style required by the repository's coding guidelines. They should include a Parameters section describing the content parameter and a Returns section describing the return type and value.

Copilot · 2025-12-16T20:15:16Z

libs/extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py

+def docusaurus_sitemap_metadata_parser_function(meta: dict, _content: Any) -> dict:
+    """Extract metadata for Docusaurus pages."""
+    soup = _as_soup(_content) if isinstance(_content, (str, BeautifulSoup)) else _content
+    root = _select_docusaurus_root(soup)
+    source_url = meta.get("loc") or meta.get("source")
+    title = _extract_title(soup, root)
+    if title == "Unknown Title" and source_url:
+        title = _title_from_url(str(source_url))
+    meta["title"] = title
+    return {"source": source_url, **meta} if source_url else meta
+
+
+def astro_sitemap_metadata_parser_function(meta: dict, _content: Any) -> dict:
+    """Extract metadata for Astro pages."""
+    soup = _as_soup(_content) if isinstance(_content, (str, BeautifulSoup)) else _content
+    root = _select_astro_root(soup)
+    source_url = meta.get("loc") or meta.get("source")
+    title = _extract_title(soup, root)
+    if title == "Unknown Title" and source_url:
+        title = _title_from_url(str(source_url))
+    meta["title"] = title
+    return {"source": source_url, **meta} if source_url else meta
+
+
+def generic_sitemap_metadata_parser_function(meta: dict, _content: Any) -> dict:
+    """Extract metadata for generic HTML pages."""
+    soup = _as_soup(_content) if isinstance(_content, (str, BeautifulSoup)) else _content
+    root = _select_generic_root(soup)
+    source_url = meta.get("loc") or meta.get("source")
+    title = _extract_title(soup, root)
+    if title == "Unknown Title" and source_url:
+        title = _title_from_url(str(source_url))
+    meta["title"] = title
+    return {"source": source_url, **meta} if source_url else meta


The docstrings for these metadata parser functions do not follow the NumPy style required by the repository's coding guidelines. They should include a Parameters section (meta and _content parameters) and a Returns section describing the return type and value.

Copilot · 2025-12-16T20:15:16Z

libs/extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py

+    @staticmethod
+    def _select_parser_functions(
+        parser_override: Optional[str],
+    ) -> tuple[Optional[callable], Optional[callable]]:


The return type annotation for this method should use tuple[Callable, Callable] | tuple[None, None] (or Optional[tuple[Callable, Callable]]) instead of tuple[Optional[callable], Optional[callable]]. The current signature suggests either element can be None independently, but the implementation always returns both as None or both as callables. Also, use Callable (capital C) from typing instead of the lowercase built-in callable.

Copilot · 2025-12-16T20:15:16Z

libs/extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py

                except (json.JSONDecodeError, TypeError):
                    sitemap_loader_parameters[x.key] = x.value
            else:
                sitemap_loader_parameters[x.key] = int(x.value) if x.value.isdigit() else x.value


Potential AttributeError if x.value is not a string. The code calls x.value.isdigit() on line 166, but if x.value is None or another type that doesn't have an isdigit() method, this will raise an AttributeError. Add a type check before calling isdigit(), such as: int(x.value) if isinstance(x.value, str) and x.value.isdigit() else x.value

Suggested change

sitemap_loader_parameters[x.key] = int(x.value) if x.value.isdigit() else x.value

sitemap_loader_parameters[x.key] = int(x.value) if isinstance(x.value, str) and x.value.isdigit() else x.value

Copilot · 2025-12-16T20:15:17Z

libs/admin-api-lib/src/admin_api_lib/impl/summarizer/langchain_summarizer.py

+    async def _summarize_documents(
+        self,
+        documents: list[Document],
+        config: RunnableConfig,
+        *,
+        max_concurrency: Optional[int],
+    ) -> list[SummarizerOutput]:
+        """Summarize a set of already-chunked documents.
+
+        Notes
+        -----
+        This optionally limits task fan-out using a per-call semaphore (max_concurrency).
+        The actual LLM call concurrency is always bounded by the instance semaphore held
+        inside `_summarize_chunk`.
+        """


The _summarize_documents method is missing the Parameters and Returns sections in its docstring. Add these sections following NumPy style to document the documents, config, and max_concurrency parameters, as well as the return type.

a-klos added 7 commits December 16, 2025 13:05

refactor: Improve code formatting and readability in various files

3acbfb5

refactor: streamline summary creation and concurrency handling in sum…

94d7b8b

…marizer classes

refactor: consolidate max concurrency parsing logic in summarizer cla…

1fc4759

…sses

refactor: update sitemap settings initialization and enhance parser f…

4cbed38

…unction selection logic

refactor: reduce retry limits and enhance page summary handling in ex…

9316738

…tractors

refactor: simplify page grouping logic in PageSummaryEnhancer

d9a89eb

a-klos requested a review from Copilot December 16, 2025 19:56

Copilot started reviewing on behalf of a-klos December 16, 2025 19:56 View session

a-klos approved these changes Dec 16, 2025

View reviewed changes

a-klos merged commit 79b8254 into stackitcloud:main Dec 16, 2025
16 checks passed

Copilot AI reviewed Dec 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: Enhance sitemap extraction and summarization features #185

refactor: Enhance sitemap extraction and summarization features #185

Uh oh!

robodev-r2d2 commented Dec 16, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	from typing import Optional
	from typing import Any
	from typing import Optional, Any

		if isinstance(page, int) or (isinstance(page, str) and page != "Unknown Title"):
		return ("page_number", document_url, page)

	sitemap_loader_parameters[x.key] = int(x.value) if x.value.isdigit() else x.value
	sitemap_loader_parameters[x.key] = int(x.value) if isinstance(x.value, str) and x.value.isdigit() else x.value

refactor: Enhance sitemap extraction and summarization features #185

refactor: Enhance sitemap extraction and summarization features #185

Uh oh!

Conversation

robodev-r2d2 commented Dec 16, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants