Skip to content

Conversation

@MuhammedHasan
Copy link
Collaborator

No description provided.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds variant effect attribution functionality to DECIMA, enabling computation of feature attributions for genetic variants to interpret their effects on gene expression through transcription factor motifs. The changes refactor the VEP module into separate files for better organization and add comprehensive attribution analysis capabilities.

Key changes:

  • Refactored VEP code from a single __init__.py into separate vep.py and attributions.py modules for better code organization
  • Added variant_effect_attribution function to compute attributions for variants with support for different attribution methods and transformations
  • Introduced VariantAttributionResult class extending AttributionResult to handle variant-specific attribution data and analysis
  • Added CLI command vep-attribution for variant attribution analysis
  • Enhanced Attribution class with subtraction operator (__sub__) for differential attribution analysis

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
src/decima/vep/init.py Refactored to import from new submodules instead of containing all VEP code
src/decima/vep/vep.py New file containing variant effect prediction functions moved from __init__.py with extracted helper functions
src/decima/vep/attributions.py New file with variant effect attribution functionality including the main variant_effect_attribution function
src/decima/utils/io.py Added VariantAttributionWriter class for writing variant attribution results to HDF5 files
src/decima/core/attribution.py Added __sub__ method to Attribution class; added VariantAttributionResult class; refactored AttributionResult methods to support variant-gene pairs
src/decima/cli/vep_attribution.py New CLI command for variant attribution analysis with comprehensive options
src/decima/cli/init.py Registered new vep-attribution CLI command
tests/test_vep_attributions.py New test file for variant attribution functionality
tests/test_vep.py Updated import path for VEP functions
tests/test_interpret_attribution.py Added tests for Attribution.__sub__ method
tests/test_attribution.py Changed assertions to use all() for numpy array comparisons; removed duplicate assertion
tests/test_cli.py Added test for new vep-attribution CLI command
.gitignore Added logs/ directory to gitignore

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

output_prefix=(str(output_prefix) + "_{model}").format(model=idx),
tasks=tasks,
off_tasks=off_tasks,
model=model,
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the recursive call at line 125, the parameter model is being passed, but it should be the individual model from the loop variable. The loop iterates over models with for idx, model in enumerate(models), so this passes the original model parameter instead of the current iteration's model, which will cause infinite recursion or incorrect behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +70
df_variant (Union[pd.DataFrame, str]): Input variants. Can be a pandas DataFrame or a path
to a file (.tsv, .csv, or .vcf). If a file path is provided, it will be loaded.
Required columns/fields depend on the input format but generally include chromosome,
position, reference allele, alternate allele.
output_prefix (str): Path to the output HDF5 file where attributions will be saved.
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring parameter name is df_variant but the actual parameter is named variants. This inconsistency could confuse users reading the documentation.

Copilot uses AI. Check for mistakes.
"--metadata",
default=None,
callback=parse_metadata,
help=f"Path to the metadata anndata file or name of the model. If not provided, the compabilite metadata for the model will be used. Default: {DEFAULT_ENSEMBLE}.",
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word 'compabilite' should be 'compatible'.

Suggested change
help=f"Path to the metadata anndata file or name of the model. If not provided, the compabilite metadata for the model will be used. Default: {DEFAULT_ENSEMBLE}.",
help=f"Path to the metadata anndata file or name of the model. If not provided, the compatible metadata for the model will be used. Default: {DEFAULT_ENSEMBLE}.",

Copilot uses AI. Check for mistakes.
self.attribution_h5,
self._idx[gene],
gene,
"_".join(gene),
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code at line 996 uses "_".join(gene) which works when gene is a tuple of (variant, gene_name) from VariantAttributionResult, but would fail when gene is a string in the base AttributionResult case. The genes parameter type hint should be Union[List[str], List[Tuple[str, str]]] to accurately reflect this polymorphic behavior, or the API should be refactored to make it clearer and type-safe.

Suggested change
"_".join(gene),
gene if isinstance(gene, str) else "_".join(gene),

Copilot uses AI. Check for mistakes.
)
return attribution_alt - attribution

def _get_metadata(self, idx: List[str], metadata_anndata: Optional[str] = None, custom_genome: bool = False):
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for the idx parameter is incorrect. It's declared as List[str] but should be Union[int, List[int]] since it's used to index into self.genes and can be either a single index (as called from load_attribution at line 895) or a list of indices (as called from recursive_seqlet_calling at line 989).

Suggested change
def _get_metadata(self, idx: List[str], metadata_anndata: Optional[str] = None, custom_genome: bool = False):
def _get_metadata(self, idx: Union[int, List[int]], metadata_anndata: Optional[str] = None, custom_genome: bool = False):

Copilot uses AI. Check for mistakes.
Comment on lines +412 to +437
def add(
self,
variant: str,
gene: str,
rel_pos: int,
seqs_ref: np.ndarray,
attrs_ref: np.ndarray,
seqs_alt: np.ndarray,
attrs_alt: np.ndarray,
gene_mask_start: int,
gene_mask_end: int,
):
"""Add attribution data for a variant gene pair.
Args:
variant: Variant name from the variants list.
gene: Gene name from the genes list.
attrs: Attribution scores, shape (4, DECIMA_CONTEXT_SIZE).
seqs: One-hot DNA sequence, shape (4, DECIMA_CONTEXT_SIZE).
"""
super().add((variant, gene), seqs_ref, attrs_ref, gene_mask_start, gene_mask_end)
idx = self.idx[(variant, gene)]
self.h5_writer["variants"][idx] = np.array(variant, dtype="S100")
self.h5_writer["rel_pos"][idx] = int(rel_pos)
self.h5_writer["sequence_alt"][idx, :] = convert_input_type(
torch.from_numpy(seqs_alt), # convert_input_type only support Tensor
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method requires 10 positional arguments, whereas overridden AttributionWriter.add requires at most 6.

Suggested change
def add(
self,
variant: str,
gene: str,
rel_pos: int,
seqs_ref: np.ndarray,
attrs_ref: np.ndarray,
seqs_alt: np.ndarray,
attrs_alt: np.ndarray,
gene_mask_start: int,
gene_mask_end: int,
):
"""Add attribution data for a variant gene pair.
Args:
variant: Variant name from the variants list.
gene: Gene name from the genes list.
attrs: Attribution scores, shape (4, DECIMA_CONTEXT_SIZE).
seqs: One-hot DNA sequence, shape (4, DECIMA_CONTEXT_SIZE).
"""
super().add((variant, gene), seqs_ref, attrs_ref, gene_mask_start, gene_mask_end)
idx = self.idx[(variant, gene)]
self.h5_writer["variants"][idx] = np.array(variant, dtype="S100")
self.h5_writer["rel_pos"][idx] = int(rel_pos)
self.h5_writer["sequence_alt"][idx, :] = convert_input_type(
torch.from_numpy(seqs_alt), # convert_input_type only support Tensor
def add(self, *args, **kwargs):
"""Add attribution data for a variant-gene pair.
This method supports the original extended calling convention:
add(
variant, gene, rel_pos,
seqs_ref, attrs_ref,
seqs_alt, attrs_alt,
gene_mask_start, gene_mask_end,
)
and can also be called in a base-class style where the first argument
is a (variant, gene) key:
add(
(variant, gene),
seqs_ref, attrs_ref,
gene_mask_start, gene_mask_end,
rel_pos=..., seqs_alt=..., attrs_alt=...,
)
"""
# Case 1: extended subclass-specific signature with 9 positional args.
if len(args) == 9:
(
variant,
gene,
rel_pos,
seqs_ref,
attrs_ref,
seqs_alt,
attrs_alt,
gene_mask_start,
gene_mask_end,
) = args
# Case 2: base-class-like positional prefix, with extras supplied via
# remaining positional args or keyword arguments.
elif len(args) >= 5:
key, seqs_ref, attrs_ref, gene_mask_start, gene_mask_end = args[:5]
try:
variant, gene = key
except (TypeError, ValueError): # not iterable of length 2
raise TypeError(
"First argument must be (variant, gene) when calling "
"VariantAttributionWriter.add with base-class-style arguments."
)
remaining = args[5:]
if len(remaining) == 3:
rel_pos, seqs_alt, attrs_alt = remaining
else:
try:
rel_pos = kwargs.pop("rel_pos")
seqs_alt = kwargs.pop("seqs_alt")
attrs_alt = kwargs.pop("attrs_alt")
except KeyError as e:
raise TypeError(
f"Missing required keyword argument {e.args[0]!r} for "
"VariantAttributionWriter.add."
)
else:
raise TypeError(
"VariantAttributionWriter.add expected either 9 positional "
"arguments or at least 5 positional arguments matching the "
"base-class signature."
)
super().add((variant, gene), seqs_ref, attrs_ref, gene_mask_start, gene_mask_end)
idx = self.idx[(variant, gene)]
self.h5_writer["variants"][idx] = np.array(variant, dtype="S100")
self.h5_writer["rel_pos"][idx] = int(rel_pos)
self.h5_writer["sequence_alt"][idx, :] = convert_input_type(
torch.from_numpy(seqs_alt), # convert_input_type only supports Tensor

Copilot uses AI. Check for mistakes.
@MuhammedHasan MuhammedHasan merged commit fff776d into main Dec 25, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants