Skip to content

Repository to develop the snakemake pipeline to produce an long-reads guided structural annotation

Notifications You must be signed in to change notification settings

ConesaLab/SQANTI_evidence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

146 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SQANTI-evidence

Long-Read Evidence-Driven Structural Annotation Pipeline
A Snakemake workflow to produce structural genome annotations leveraging long-read sequencing data.


Table of Contents

  1. Overview
  2. Features
  3. Requirements
  4. Installation
  5. Usage
  6. Configuration
  7. Pipeline Workflow
  8. Scripts & Rules
  9. Output
  10. Examples
  11. License & Citations
  12. Contact / Support

Overview

This repository implements a Snakemake pipeline (with auxiliary scripts) to generate structural genome annotations guided by long-read sequencing data (e.g. PacBio, Oxford Nanopore).
It aims to produce high-quality annotations by combining transcript evidence from long reads with conventional annotation strategies. The main structure of the pipeline and use of the long-read transcriptomics is derived from this paper.


Features

  • Modular pipeline built with Snakemake
  • Integration of long-read data to inform exon/intron boundaries
  • Flexible configuration for different organisms & datasets
  • Support for cluster execution (e.g. SLURM)
  • Scripts to assist in annotation processing and QC

Requirements

  • Snakemake (version >= X.X)
  • Python (>= 3.8) + dependencies
  • Linux / Unix environment
  • Long-read RNA (or cDNA) sequencing aligned data (BAM or SAM)
  • Reference genome (FASTA)
  • (Optional) Annotation hints / protein / transcript evidence

You’ll find an envs/ folder for environment / dependency configurations.


Installation

  1. Clone the repository:

    git clone https://github.com/pabloati/LR_annotation.git
    cd LR_annotation
    
    
  2. Create and activate a conda / mamba environment (if using):

    conda env create -f envs/env.yaml
    conda activate <env_name>

    (You may have multiple environments defined under envs/, inspect and choose the appropriate one.)

  3. Install any extra Python packages not handled by the environment file:

    pip install -r requirements.txt

    (If requirements.txt does not exist, you can generate one from the environment.)


Usage

The main script to run the pipeline is SQANTI_evidence


Configuration

The behaviour of the pipeline is controlled via:

  • config.yaml — main configuration file (genome paths, sample IDs, parameters)
  • profile_slurm.yaml — parameters and settings for SLURM (if using cluster)

Edit config.yaml to point to your reference genome, aligned reads, and other evidence files.


Pipeline Workflow

Rough outline of the major steps / rules (in rules/):

  1. Preprocessing of reads / alignments
  2. Transcript feature extraction
  3. Long-read informed exon/intron boundary refinement
  4. Evidence merging with other annotation sources
  5. Final structural annotation (e.g. GFF3 output)
  6. QC and filtering steps

Refer to the individual rule files in rules/ for detailed logic.


Scripts & Rules

  • scripts/ — utility scripts used by the workflow (e.g. parsing, filtering)
  • snakefile — main workflow entry
  • rules/ — subrules modularizing steps
  • lr_annot.py — core Python module / driver (if used in pipeline)

You can read through them to see custom parameters, function calls, and expected behavior.


Output

Typical outputs include:

  • GFF3 / GTF annotated structural models
  • Transcript / exon / intron files
  • QC reports
  • Intermediate alignment / feature files

Output paths and filenames are configurable via config.yaml.


Examples

(You may want to include a small example or test dataset to demonstrate pipeline execution. If you have one, mention it here. E.g.):

  • example/ — folder with toy genome + reads, config, and expected outputs

  • Usage:

    cd example
    snakemake -j 4
  • Compare output GFF3 with expected reference.

If you don’t have an example, you could add one in future to help users.


License & Citations

State your license (e.g. MIT, GPL, etc.) here. Also include citations to relevant tools or papers used in this pipeline.

MIT License
(c) 2025 Pablo A. Oti (or your name)

Please cite this repository as:

Oti, P. (2025). LR_annotation: Long-Read Guided Structural Annotation Pipeline. GitHub. https://github.com/pabloati/LR_annotation


Contact / Support

For questions or issues, open an Issue on GitHub. You can also reach me at: your_email@example.com.


Future Improvements

You might want to add:

  • A Docker or Singularity container for reproducibility
  • Automated tests / CI
  • Support for additional evidence types
  • Visualization modules
  • More extensive examples & documentation

About

Repository to develop the snakemake pipeline to produce an long-reads guided structural annotation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published