A simple Python and BWA MEM-based aligner with improved read pairing.
polyalign aims to generate sam-format alignments which can be used for high accuracy polishing of repetitive sequences.
It is heavily inspired by polypolish align from Polypolish, but aims to reduce memory usages. This allows its use on eukaryote genomes.
PolyAlign requires python, and is easiest to install using pip and git. It also requires bwa to be installed and available to the system command line.
On Linux, install dependencies using your normal package manager, for example:
sudo apt update
sudo apt install bwa python3 python3-pip gitInstall PolyAlign using pip and git:
pip install git+https://github.com/zephyris/polyalignTo reinstall and upgrade use pip and git:
pip install --upgrade --force-reinstall git+https://github.com/zephyris/polyalignTo uninstall use pip
pip uninstall polyalignYou may want to set up PolyAlign in a python virtual environment, like conda. Please see the Conda documentation on how to install conda.
Dependencies can be installed in a conda environment by running:
conda install git pip
conda install -c bioconda bwa PolyAlign can be installed using pip and git as above:
pip install git+https://github.com/zephyris/polyalignpython3 -m polyalign [filtered|filteredsplit|paired] <reference.fasta> <reads_1.fastq> <reads_2.fastq> <output_basename>Reads have to be fastq format, fasta is not supported. In filtered mode, the output is two sam files, <output_basename>_1.sam and <output_basename>_2.sam.
In filteredsplit mode, the output is two sam files per sequence in the input <reference.fasta> file, output to the directories <output_basename>_1 and <output_basename>_2.
In paired mode, the output is one sam file, <output_basename>.sam. In paired mode, <output_basename> of - will give sam output on stdout.
polyalign first samples a subset of reads from the start of both fastq files to identify orientation of read pairs and typical insert size from reads pairs aligned as a unique pair.
Next, it aligns the entire set of reads. Read pairs where one is not aligned and one is aligned to a single place are retained. Read pairs where both are aligned to a single place are retained.
Reads alignments to multiple places are retained, if a pair can be formed which gives typical insert size and correct orientation. For filtered and filteredsplit outputs, all such alignments are retained. For paired, one good pairing is randomly selected as the output.
In filtered mode, this behaviour broadly matches matches Polypolish polypolish filter. The resulting output can be used for subsequent polishing using polypolish polish.
In filteredsplit mode, each individual sam file can be used for polypolish polish against the appropriate reference sequence.
In paired mode, this outputs a sam file similar to normal bwa mem paired alignments, and can be used for general subsequent analyses.
You can use polyalign in your Python scripts - however it is subject to change. The Polyalign class carries out high-level operation, ouputting using the Output class.
BwaMem is to run bwa mem alignments. Alignment and AlignmentClass are used to parse alignments and alignment pairs.
This module was written to optimise polishing of small eukaryotic genomes assembled from noisy Nanopore data. The Polypolish strategy for polishing repetitive sequences is very promising, but designed for small (bacterial) genomes. Polyalign allows application of the same method to larger, eg. eukarotic genomes, without requiring enormous computational resources.
For example:
python3 -m polyalign splitfiltered genome.fasta illumina1.fq illumina2.fq outputname
python3 -m polyalign splitfasta genome.fasta outputname
for fasta in outputname/*.fasta; do
filename=$(basename "$fasta")
filename="${filename%.*}"
polypolish polish $fasta outputname_1/${filename}_1.sam outputname_2/${filename}_2.sam >> polishedgenome.fasta
doneI haven't ultimately used it for a published genome assembly, but I've made it available in case it is useful. Please send me a message and cite this Github repository if you are publishing anything using this as a tool. Please also cite Polypolish, as this is very closely modeled on that work.