Long-Read Evidence-Driven Structural Annotation Pipeline
A Snakemake workflow to produce structural genome annotations leveraging long-read sequencing data.
- Overview
- Features
- Requirements
- Installation
- Usage
- Configuration
- Pipeline Workflow
- Scripts & Rules
- Output
- Examples
- License & Citations
- Contact / Support
This repository implements a Snakemake pipeline (with auxiliary scripts) to generate structural genome annotations guided by long-read sequencing data (e.g. PacBio, Oxford Nanopore).
It aims to produce high-quality annotations by combining transcript evidence from long reads with conventional annotation strategies. The main structure of the pipeline and use of the long-read transcriptomics is derived from this paper.
- Modular pipeline built with Snakemake
- Integration of long-read data to inform exon/intron boundaries
- Flexible configuration for different organisms & datasets
- Support for cluster execution (e.g. SLURM)
- Scripts to assist in annotation processing and QC
- Snakemake (version >= X.X)
- Python (>= 3.8) + dependencies
- Linux / Unix environment
- Long-read RNA (or cDNA) sequencing aligned data (BAM or SAM)
- Reference genome (FASTA)
- (Optional) Annotation hints / protein / transcript evidence
You’ll find an envs/ folder for environment / dependency configurations.
-
Clone the repository:
git clone https://github.com/pabloati/LR_annotation.git cd LR_annotation -
Create and activate a conda / mamba environment (if using):
conda env create -f envs/env.yaml conda activate <env_name>
(You may have multiple environments defined under
envs/, inspect and choose the appropriate one.) -
Install any extra Python packages not handled by the environment file:
pip install -r requirements.txt
(If
requirements.txtdoes not exist, you can generate one from the environment.)
The main script to run the pipeline is SQANTI_evidence
The behaviour of the pipeline is controlled via:
config.yaml— main configuration file (genome paths, sample IDs, parameters)profile_slurm.yaml— parameters and settings for SLURM (if using cluster)
Edit config.yaml to point to your reference genome, aligned reads, and other evidence files.
Rough outline of the major steps / rules (in rules/):
- Preprocessing of reads / alignments
- Transcript feature extraction
- Long-read informed exon/intron boundary refinement
- Evidence merging with other annotation sources
- Final structural annotation (e.g. GFF3 output)
- QC and filtering steps
Refer to the individual rule files in rules/ for detailed logic.
scripts/— utility scripts used by the workflow (e.g. parsing, filtering)snakefile— main workflow entryrules/— subrules modularizing stepslr_annot.py— core Python module / driver (if used in pipeline)
You can read through them to see custom parameters, function calls, and expected behavior.
Typical outputs include:
- GFF3 / GTF annotated structural models
- Transcript / exon / intron files
- QC reports
- Intermediate alignment / feature files
Output paths and filenames are configurable via config.yaml.
(You may want to include a small example or test dataset to demonstrate pipeline execution. If you have one, mention it here. E.g.):
-
example/— folder with toy genome + reads, config, and expected outputs -
Usage:
cd example snakemake -j 4 -
Compare output GFF3 with expected reference.
If you don’t have an example, you could add one in future to help users.
State your license (e.g. MIT, GPL, etc.) here. Also include citations to relevant tools or papers used in this pipeline.
MIT License
(c) 2025 Pablo A. Oti (or your name)
Please cite this repository as:
Oti, P. (2025). LR_annotation: Long-Read Guided Structural Annotation Pipeline. GitHub. https://github.com/pabloati/LR_annotation
For questions or issues, open an Issue on GitHub. You can also reach me at: your_email@example.com.
You might want to add:
- A Docker or Singularity container for reproducibility
- Automated tests / CI
- Support for additional evidence types
- Visualization modules
- More extensive examples & documentation