Skip to content

Comments

Fix KeyError in statistical analysis helper for large datasets#227

Open
Efdix wants to merge 1 commit intoventolab:masterfrom
Efdix:fix/clean-submit
Open

Fix KeyError in statistical analysis helper for large datasets#227
Efdix wants to merge 1 commit intoventolab:masterfrom
Efdix:fix/clean-submit

Conversation

@Efdix
Copy link

@Efdix Efdix commented Dec 29, 2025

Fix KeyError in cpdb_statistical_analysis_helper with large datasets

Description

This PR fixes a KeyError encountered when running cpdb_statistical_analysis_method.call on large datasets (e.g., ~500k cells).

Issue

When processing wide DataFrames (many columns/cells), counts.set_index('id_multidata', inplace=True, drop=True) in cpdb_statistical_analysis_helper.py was causing all columns (cell barcodes) to be lost/dropped from the DataFrame, leading to a KeyError in the subsequent step counts = counts[cells_names].

Traceback:

KeyError: "None of [Index(['...'], dtype='object', length=493019)] are in the [columns]"

Fix

Replaced the set_index(inplace=True) operation with a manual index assignment:

counts.index = counts['id_multidata']
counts.index.name = 'id_multidata'

This approach avoids the internal Pandas issue associated with set_index on very large DataFrames and correctly preserves the columns.

Verification

  • Verified that the KeyError is resolved and the analysis proceeds correctly with the large dataset.
  • Verified that the logic remains equivalent for smaller datasets.

Environment

  • CellphoneDB version: 5.0.1
  • Pandas version: 2.3.3
  • Python version: 3.14.2
  • Dataset size: ~500k cells

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant