Skip to content

Conversation

@jonasscheid
Copy link
Contributor

@jonasscheid jonasscheid commented Feb 6, 2026

Summary

  • The -c/--category option in download-all-public-category-files now accepts comma-separated values (e.g. -c RAW,SEARCH) to download files from multiple categories in a single command
  • get_all_category_file_list() accepts either a single category string or a list of categories
  • Backward compatible: single category values (e.g. -c RAW) continue to work as before

Usage

# Single category (unchanged)
pridepy download-all-public-category-files -a PXD008644 -o /out -c RAW

# Multiple categories (new)
pridepy download-all-public-category-files -a PXD008644 -o /out -c RAW,SEARCH

Test plan

  • Existing test_get_all_category_file_list passes (single category still works)
  • New test_get_all_category_file_list_multiple verifies multi-category filtering returns correct files from both categories
  • All 6 active tests pass (pytest pridepy/tests/ -v)
  • Verify CLI validation rejects invalid category names

Summary by CodeRabbit

  • New Features

    • Added support for downloading files from multiple categories simultaneously.
    • Command-line interface now accepts comma-separated categories for flexible filtering.
    • Added validation with detailed error messages for invalid category inputs.
  • Tests

    • Added test coverage for multi-category file filtering functionality.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2026

📝 Walkthrough

Walkthrough

The pull request extends file download functionality to support multiple categories simultaneously. The core method get_all_category_file_list now accepts either a single string or a list of strings for categories, with filtering applied across all provided categories. The CLI command download-all-public-category-files is updated to accept comma-separated category values with runtime validation against valid options. A new test verifies multi-category filtering behavior.

Changes

Cohort / File(s) Summary
Core file service logic
pridepy/files/files.py
Updated download_all_category_files and get_all_category_file_list to accept multiple categories as a list instead of a single string. Added category parameter conversion logic and filtering across category set. Includes fallback to default ["RAW"] when categories are not provided.
CLI command layer
pridepy/pridepy.py
Modified download-all-public-category-files to accept comma-separated categories instead of fixed choice list. Added runtime validation that parses, normalizes, and validates provided categories against valid options, raising click.BadParameter for invalid inputs.
Test coverage
pridepy/tests/test_raw_files.py
Added test_get_all_category_file_list_multiple test method to verify filtering behavior when multiple categories ("RAW" and "SEARCH") are passed to get_all_category_file_list.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • move to poetry #61: Related through overlapping CLI/file-download functionality and category-handling parameter changes.

Poem

🐰 Multiple categories now align,
No more choosing just a single line!
Files filter swift through validation's care,
More data down, with easier flair! 🎯

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding support for comma-separated multiple categories in the download-all-public-category-files command.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Contributor

Review Summary by Qodo

Support comma-separated multiple categories in category file downloads

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Support comma-separated multiple categories in download CLI command
• Accept both single category string and list in get_all_category_file_list()
• Add CLI validation for category names with helpful error messages
• Maintain backward compatibility with single category parameter
Diagram
flowchart LR
  CLI["CLI Input: -c RAW,SEARCH"] -- "Parse & Validate" --> Validation["Category Validation"]
  Validation -- "Split & Normalize" --> Categories["List of Categories"]
  Categories -- "Pass to Method" --> GetFiles["get_all_category_file_list"]
  GetFiles -- "Filter by Multiple" --> Results["Combined File Results"]
Loading

Grey Divider

File Changes

1. pridepy/files/files.py ✨ Enhancement +18/-9

Support multiple categories in file filtering methods

• Modified download_all_category_files() to accept categories list parameter while maintaining
 backward compatibility with category string parameter
• Updated get_all_category_file_list() to accept either single category string or list of
 categories using union type hint
• Changed file filtering logic to use set membership check instead of equality comparison for
 multiple category support
• Added type hints for return value List[Dict]

pridepy/files/files.py


2. pridepy/pridepy.py ✨ Enhancement +13/-4

Add CLI validation and parsing for comma-separated categories

• Removed click.Choice constraint from -c/--category option to allow comma-separated input
• Updated help text to document comma-separated category syntax
• Added CLI-level validation function to check category names against valid set and provide detailed
 error messages
• Normalized category input by splitting on commas, stripping whitespace, and converting to
 uppercase
• Pass parsed categories list to download_all_category_files() method

pridepy/pridepy.py


3. pridepy/tests/test_raw_files.py 🧪 Tests +13/-0

Add test for multiple category filtering

• Added new test test_get_all_category_file_list_multiple() to verify multi-category filtering
 functionality
• Test validates that filtering by multiple categories returns combined results from all specified
 categories
• Verifies both categories are present in the returned file list using set comparison
• Existing single-category test remains unchanged for backward compatibility verification

pridepy/tests/test_raw_files.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Contributor

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (2) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Py3.9 union annotation 🐞 Bug ⛯ Reliability
Description
get_all_category_file_list uses a quoted PEP604 union ("str | List[str]"). While it won’t
  crash at import time, any tooling that evaluates annotations (e.g., typing.get_type_hints, some
  doc generators) can raise SyntaxError on Python 3.9.
• The project declares Python ^3.9, so this compatibility risk is in-scope for supported runtimes.
• This can break downstream users in otherwise valid environments, especially in type-aware
  frameworks and documentation builds.
Code

pridepy/files/files.py[R756-758]

+    def get_all_category_file_list(
+        self, accession: str, categories: "str | List[str]"
+    ) -> List[Dict]:
Evidence
The function annotation contains str | List[str] (PEP604 union syntax) embedded in a string. The
package supports Python 3.9, which cannot evaluate that syntax if any runtime attempts to resolve
annotations from strings.

pridepy/files/files.py[756-758]
pyproject.toml[46-48]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`get_all_category_file_list` uses a quoted PEP604 union annotation (`"str | List[str]"`). On Python 3.9, this can break any runtime/tooling that evaluates annotations (e.g., `typing.get_type_hints`), causing a `SyntaxError`.

### Issue Context
The project declares `python = "^3.9"`, so Python 3.9 is a supported runtime.

### Fix Focus Areas
- pridepy/files/files.py[756-758]

### Suggested change
- Import `Union` from `typing`.
- Change annotation to `Union[str, List[str]]` (or `Sequence[str]` if you want to accept more iterables).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. get_all_category_file_list() unvalidated input 📘 Rule violation ⛯ Reliability
Description
get_all_category_file_list() assumes categories is always a str or iterable of strings and
  immediately calls set(categories) after only a str-type check.
• If a caller passes None or a non-iterable value, this will raise a runtime exception, and if
  callers pass unnormalized values (e.g., raw, whitespace, empty entries), it may silently return an
  empty list.
• This is missing explicit edge-case handling and input validation for an externally callable
  method.
Code

pridepy/files/files.py[R756-772]

+    def get_all_category_file_list(
+        self, accession: str, categories: "str | List[str]"
+    ) -> List[Dict]:
        """
-        Retrieve a list of files from a specific project that belong to a given category.
+        Retrieve a list of files from a specific project that belong to given categories.

        :param accession: The PRIDE project accession identifier.
-        :param category: The category of files to filter by.
-        :return: A list of files in the specified category.
+        :param categories: A single category string or list of categories to filter by.
+        :return: A list of files matching the specified categories.
        """
        record_files = self.stream_all_files_by_project(accession)
+        if isinstance(categories, str):
+            categories = [categories]
+        category_set = set(categories)
        category_files = [
-            file for file in record_files if file["fileCategory"]["value"] == category
+            file for file in record_files if file["fileCategory"]["value"] in category_set
        ]
Evidence
The compliance checklist requires explicit handling of null/empty/boundary cases and validation of
external inputs. The updated implementation only normalizes when categories is a str and then
calls set(categories) without guarding against None/invalid types or empty/whitespace values,
which can cause errors or silent misbehavior.

Rule 3: Generic: Robust Error Handling and Edge Case Management
Rule 6: Generic: Security-First Input Validation and Data Handling
pridepy/files/files.py[756-772]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`get_all_category_file_list()` does not validate/normalize the `categories` argument beyond checking for `str`, which can lead to runtime exceptions (e.g., `categories=None`) or silent empty results (e.g., whitespace/empty entries, unexpected casing).

## Issue Context
This method is externally callable from library consumers/tests, so it should defensively handle null/invalid inputs and provide actionable errors per the robustness and input-validation compliance requirements.

## Fix Focus Areas
- pridepy/files/files.py[756-772]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Implicit RAW default 🐞 Bug ✓ Correctness
Description
download_all_category_files now defaults to ['RAW'] when both categories and the deprecated
  category are omitted.
• This changes the API contract from “explicit category required” to “silent default,” which can
  mask caller bugs (e.g., forgetting to pass a category) and can initiate downloads the caller didn’t
  intend.
• Because downloads can be large/slow, silent fallback increases the risk of unexpected runtime
  behavior without any warning.
Code

pridepy/files/files.py[R728-745]

+        categories: List[str] = None,
+        category: str = None,
    ):
        """
-        Download all files of a specified category from a PRIDE project.
+        Download all files of specified categories from a PRIDE project.

        :param accession: The PRIDE project accession identifier.
        :param output_folder: The directory where the files will be downloaded.
        :param skip_if_downloaded_already: If True, skips downloading files that already exist.
        :param protocol: The transfer protocol to use (e.g., ftp, aspera, globus, s3).
        :param aspera_maximum_bandwidth: Maximum bandwidth for Aspera transfers.
        :param checksum_check: If True, downloads the checksum file for the project.
-        :param category: The category of files to download.
+        :param categories: List of file categories to download.
+        :param category: Single file category (deprecated, use categories instead).
        """
-        raw_files = self.get_all_category_file_list(accession, category)
+        if categories is None:
+            categories = [category] if category else ["RAW"]
+        raw_files = self.get_all_category_file_list(accession, categories)
Evidence
Both categories and category default to None, and the code silently substitutes ['RAW'] when
neither is provided, instead of rejecting invalid input.

pridepy/files/files.py[720-745]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`download_all_category_files` silently defaults to `['RAW']` when both `categories` and `category` are omitted. This can hide caller mistakes and trigger unintended downloads.

### Issue Context
The method now supports both `categories` (new) and `category` (deprecated). The safest behavior is to require one of them to be provided.

### Fix Focus Areas
- pridepy/files/files.py[720-745]

### Suggested change
- Replace the implicit default logic with validation:
 - If `categories is None` and `category is None`: raise `ValueError` (or a library-specific exception).
 - If `categories` is an empty list: raise `ValueError`.
 - If `category` is provided: set `categories = [category]` and consider logging a deprecation warning.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Flaky live API test 🐞 Bug ⛯ Reliability
Description
• New test asserts a fixed file count for a real PRIDE accession via a live HTTP call.
• This makes CI/test runs sensitive to network availability, PRIDE API uptime, and dataset drift
  (file counts can change over time).
• Failures will be noisy/non-actionable (test fails even if code is correct), reducing confidence in
  the test suite.
Code

pridepy/tests/test_raw_files.py[R41-52]

+    def test_get_all_category_file_list_multiple(self):
+        """
+        Test filtering by multiple categories at once.
+        PXD008644 has 2 RAW + 2 SEARCH = 4 files combined.
+        """
+        raw = Files()
+        result = raw.get_all_category_file_list("PXD008644", ["RAW", "SEARCH"])
+        assert len(result) == 4
+
+        # Verify both categories are present
+        categories = {file["fileCategory"]["value"] for file in result}
+        assert categories == {"RAW", "SEARCH"}
Evidence
The test calls Files.get_all_category_file_list, which calls stream_all_files_by_project and
ultimately Util.read_json_stream that performs an HTTP GET. The test then asserts a hard-coded
count.

pridepy/tests/test_raw_files.py[41-52]
pridepy/files/files.py[85-92]
pridepy/util/api_handling.py[72-107]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The new multi-category test depends on the live PRIDE API and asserts a fixed count. This is prone to flakiness due to network/API/data changes.

### Issue Context
`Files.get_all_category_file_list` calls `stream_all_files_by_project`, which uses `Util.read_json_stream` to perform HTTP requests.

### Fix Focus Areas
- pridepy/tests/test_raw_files.py[41-52]
- pridepy/files/files.py[85-92]

### Suggested change
- Use `unittest.mock.patch` to patch `Files.stream_all_files_by_project` (preferred) and return a small in-memory list of file dicts with categories RAW/SEARCH.
- Assert filtering behavior based on that fixture, not on remote counts.
- If you want to keep a live smoke test, separate/mark it as integration and skip by default.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

5. Misleading raw_files naming 📘 Rule violation ✓ Correctness
Description
raw_files now stores files from multiple categories (and may include non-RAW categories), but
  the variable name still implies only RAW files.
• This reduces readability and can lead to misunderstandings when maintaining or extending the
  download flow.
Code

pridepy/files/files.py[R743-748]

+        if categories is None:
+            categories = [category] if category else ["RAW"]
+        raw_files = self.get_all_category_file_list(accession, categories)
        self.download_files(
            raw_files,
            accession,
Evidence
The checklist requires self-documenting, meaningful naming. After changing the function to support
multiple categories, the variable raw_files no longer accurately describes its contents.

Rule 2: Generic: Meaningful Naming and Self-Documenting Code
pridepy/files/files.py[743-748]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The variable `raw_files` is misleading because it can now include files from multiple/non-RAW categories.

## Issue Context
This method was updated to accept multiple categories; naming should reflect the broader meaning to keep the code self-documenting.

## Fix Focus Areas
- pridepy/files/files.py[743-748]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +756 to +758
def get_all_category_file_list(
self, accession: str, categories: "str | List[str]"
) -> List[Dict]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Py3.9 union annotation 🐞 Bug ⛯ Reliability

get_all_category_file_list uses a quoted PEP604 union ("str | List[str]"). While it won’t
  crash at import time, any tooling that evaluates annotations (e.g., typing.get_type_hints, some
  doc generators) can raise SyntaxError on Python 3.9.
• The project declares Python ^3.9, so this compatibility risk is in-scope for supported runtimes.
• This can break downstream users in otherwise valid environments, especially in type-aware
  frameworks and documentation builds.
Agent Prompt
### Issue description
`get_all_category_file_list` uses a quoted PEP604 union annotation (`"str | List[str]"`). On Python 3.9, this can break any runtime/tooling that evaluates annotations (e.g., `typing.get_type_hints`), causing a `SyntaxError`.

### Issue Context
The project declares `python = "^3.9"`, so Python 3.9 is a supported runtime.

### Fix Focus Areas
- pridepy/files/files.py[756-758]

### Suggested change
- Import `Union` from `typing`.
- Change annotation to `Union[str, List[str]]` (or `Sequence[str]` if you want to accept more iterables).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@ypriverol ypriverol self-requested a review February 6, 2026 15:22
@ypriverol ypriverol merged commit b046bd0 into PRIDE-Archive:master Feb 6, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants