-
Notifications
You must be signed in to change notification settings - Fork 8
vep attributions #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vep attributions #49
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds variant effect attribution functionality to DECIMA, enabling computation of feature attributions for genetic variants to interpret their effects on gene expression through transcription factor motifs. The changes refactor the VEP module into separate files for better organization and add comprehensive attribution analysis capabilities.
Key changes:
- Refactored VEP code from a single
__init__.pyinto separatevep.pyandattributions.pymodules for better code organization - Added
variant_effect_attributionfunction to compute attributions for variants with support for different attribution methods and transformations - Introduced
VariantAttributionResultclass extendingAttributionResultto handle variant-specific attribution data and analysis - Added CLI command
vep-attributionfor variant attribution analysis - Enhanced
Attributionclass with subtraction operator (__sub__) for differential attribution analysis
Reviewed changes
Copilot reviewed 12 out of 14 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| src/decima/vep/init.py | Refactored to import from new submodules instead of containing all VEP code |
| src/decima/vep/vep.py | New file containing variant effect prediction functions moved from __init__.py with extracted helper functions |
| src/decima/vep/attributions.py | New file with variant effect attribution functionality including the main variant_effect_attribution function |
| src/decima/utils/io.py | Added VariantAttributionWriter class for writing variant attribution results to HDF5 files |
| src/decima/core/attribution.py | Added __sub__ method to Attribution class; added VariantAttributionResult class; refactored AttributionResult methods to support variant-gene pairs |
| src/decima/cli/vep_attribution.py | New CLI command for variant attribution analysis with comprehensive options |
| src/decima/cli/init.py | Registered new vep-attribution CLI command |
| tests/test_vep_attributions.py | New test file for variant attribution functionality |
| tests/test_vep.py | Updated import path for VEP functions |
| tests/test_interpret_attribution.py | Added tests for Attribution.__sub__ method |
| tests/test_attribution.py | Changed assertions to use all() for numpy array comparisons; removed duplicate assertion |
| tests/test_cli.py | Added test for new vep-attribution CLI command |
| .gitignore | Added logs/ directory to gitignore |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| output_prefix=(str(output_prefix) + "_{model}").format(model=idx), | ||
| tasks=tasks, | ||
| off_tasks=off_tasks, | ||
| model=model, |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the recursive call at line 125, the parameter model is being passed, but it should be the individual model from the loop variable. The loop iterates over models with for idx, model in enumerate(models), so this passes the original model parameter instead of the current iteration's model, which will cause infinite recursion or incorrect behavior.
| df_variant (Union[pd.DataFrame, str]): Input variants. Can be a pandas DataFrame or a path | ||
| to a file (.tsv, .csv, or .vcf). If a file path is provided, it will be loaded. | ||
| Required columns/fields depend on the input format but generally include chromosome, | ||
| position, reference allele, alternate allele. | ||
| output_prefix (str): Path to the output HDF5 file where attributions will be saved. |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring parameter name is df_variant but the actual parameter is named variants. This inconsistency could confuse users reading the documentation.
| "--metadata", | ||
| default=None, | ||
| callback=parse_metadata, | ||
| help=f"Path to the metadata anndata file or name of the model. If not provided, the compabilite metadata for the model will be used. Default: {DEFAULT_ENSEMBLE}.", |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word 'compabilite' should be 'compatible'.
| help=f"Path to the metadata anndata file or name of the model. If not provided, the compabilite metadata for the model will be used. Default: {DEFAULT_ENSEMBLE}.", | |
| help=f"Path to the metadata anndata file or name of the model. If not provided, the compatible metadata for the model will be used. Default: {DEFAULT_ENSEMBLE}.", |
src/decima/core/attribution.py
Outdated
| self.attribution_h5, | ||
| self._idx[gene], | ||
| gene, | ||
| "_".join(gene), |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code at line 996 uses "_".join(gene) which works when gene is a tuple of (variant, gene_name) from VariantAttributionResult, but would fail when gene is a string in the base AttributionResult case. The genes parameter type hint should be Union[List[str], List[Tuple[str, str]]] to accurately reflect this polymorphic behavior, or the API should be refactored to make it clearer and type-safe.
| "_".join(gene), | |
| gene if isinstance(gene, str) else "_".join(gene), |
| ) | ||
| return attribution_alt - attribution | ||
|
|
||
| def _get_metadata(self, idx: List[str], metadata_anndata: Optional[str] = None, custom_genome: bool = False): |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint for the idx parameter is incorrect. It's declared as List[str] but should be Union[int, List[int]] since it's used to index into self.genes and can be either a single index (as called from load_attribution at line 895) or a list of indices (as called from recursive_seqlet_calling at line 989).
| def _get_metadata(self, idx: List[str], metadata_anndata: Optional[str] = None, custom_genome: bool = False): | |
| def _get_metadata(self, idx: Union[int, List[int]], metadata_anndata: Optional[str] = None, custom_genome: bool = False): |
| def add( | ||
| self, | ||
| variant: str, | ||
| gene: str, | ||
| rel_pos: int, | ||
| seqs_ref: np.ndarray, | ||
| attrs_ref: np.ndarray, | ||
| seqs_alt: np.ndarray, | ||
| attrs_alt: np.ndarray, | ||
| gene_mask_start: int, | ||
| gene_mask_end: int, | ||
| ): | ||
| """Add attribution data for a variant gene pair. | ||
| Args: | ||
| variant: Variant name from the variants list. | ||
| gene: Gene name from the genes list. | ||
| attrs: Attribution scores, shape (4, DECIMA_CONTEXT_SIZE). | ||
| seqs: One-hot DNA sequence, shape (4, DECIMA_CONTEXT_SIZE). | ||
| """ | ||
| super().add((variant, gene), seqs_ref, attrs_ref, gene_mask_start, gene_mask_end) | ||
| idx = self.idx[(variant, gene)] | ||
| self.h5_writer["variants"][idx] = np.array(variant, dtype="S100") | ||
| self.h5_writer["rel_pos"][idx] = int(rel_pos) | ||
| self.h5_writer["sequence_alt"][idx, :] = convert_input_type( | ||
| torch.from_numpy(seqs_alt), # convert_input_type only support Tensor |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method requires 10 positional arguments, whereas overridden AttributionWriter.add requires at most 6.
| def add( | |
| self, | |
| variant: str, | |
| gene: str, | |
| rel_pos: int, | |
| seqs_ref: np.ndarray, | |
| attrs_ref: np.ndarray, | |
| seqs_alt: np.ndarray, | |
| attrs_alt: np.ndarray, | |
| gene_mask_start: int, | |
| gene_mask_end: int, | |
| ): | |
| """Add attribution data for a variant gene pair. | |
| Args: | |
| variant: Variant name from the variants list. | |
| gene: Gene name from the genes list. | |
| attrs: Attribution scores, shape (4, DECIMA_CONTEXT_SIZE). | |
| seqs: One-hot DNA sequence, shape (4, DECIMA_CONTEXT_SIZE). | |
| """ | |
| super().add((variant, gene), seqs_ref, attrs_ref, gene_mask_start, gene_mask_end) | |
| idx = self.idx[(variant, gene)] | |
| self.h5_writer["variants"][idx] = np.array(variant, dtype="S100") | |
| self.h5_writer["rel_pos"][idx] = int(rel_pos) | |
| self.h5_writer["sequence_alt"][idx, :] = convert_input_type( | |
| torch.from_numpy(seqs_alt), # convert_input_type only support Tensor | |
| def add(self, *args, **kwargs): | |
| """Add attribution data for a variant-gene pair. | |
| This method supports the original extended calling convention: | |
| add( | |
| variant, gene, rel_pos, | |
| seqs_ref, attrs_ref, | |
| seqs_alt, attrs_alt, | |
| gene_mask_start, gene_mask_end, | |
| ) | |
| and can also be called in a base-class style where the first argument | |
| is a (variant, gene) key: | |
| add( | |
| (variant, gene), | |
| seqs_ref, attrs_ref, | |
| gene_mask_start, gene_mask_end, | |
| rel_pos=..., seqs_alt=..., attrs_alt=..., | |
| ) | |
| """ | |
| # Case 1: extended subclass-specific signature with 9 positional args. | |
| if len(args) == 9: | |
| ( | |
| variant, | |
| gene, | |
| rel_pos, | |
| seqs_ref, | |
| attrs_ref, | |
| seqs_alt, | |
| attrs_alt, | |
| gene_mask_start, | |
| gene_mask_end, | |
| ) = args | |
| # Case 2: base-class-like positional prefix, with extras supplied via | |
| # remaining positional args or keyword arguments. | |
| elif len(args) >= 5: | |
| key, seqs_ref, attrs_ref, gene_mask_start, gene_mask_end = args[:5] | |
| try: | |
| variant, gene = key | |
| except (TypeError, ValueError): # not iterable of length 2 | |
| raise TypeError( | |
| "First argument must be (variant, gene) when calling " | |
| "VariantAttributionWriter.add with base-class-style arguments." | |
| ) | |
| remaining = args[5:] | |
| if len(remaining) == 3: | |
| rel_pos, seqs_alt, attrs_alt = remaining | |
| else: | |
| try: | |
| rel_pos = kwargs.pop("rel_pos") | |
| seqs_alt = kwargs.pop("seqs_alt") | |
| attrs_alt = kwargs.pop("attrs_alt") | |
| except KeyError as e: | |
| raise TypeError( | |
| f"Missing required keyword argument {e.args[0]!r} for " | |
| "VariantAttributionWriter.add." | |
| ) | |
| else: | |
| raise TypeError( | |
| "VariantAttributionWriter.add expected either 9 positional " | |
| "arguments or at least 5 positional arguments matching the " | |
| "base-class signature." | |
| ) | |
| super().add((variant, gene), seqs_ref, attrs_ref, gene_mask_start, gene_mask_end) | |
| idx = self.idx[(variant, gene)] | |
| self.h5_writer["variants"][idx] = np.array(variant, dtype="S100") | |
| self.h5_writer["rel_pos"][idx] = int(rel_pos) | |
| self.h5_writer["sequence_alt"][idx, :] = convert_input_type( | |
| torch.from_numpy(seqs_alt), # convert_input_type only supports Tensor |
No description provided.