Fast, accurate construction of multiple sequence alignments from protein language embeddings
Multiple sequence alignment (MSA) is a foundational task in computational biology, underpinning protein structure prediction, evolutionary analysis, and domain annotation. Traditional MSA algorithms rely on pairwise amino acid substitution matrices derived from conserved protein families. While effective for aligning closely related sequences, these scoring schemes struggle in the low identity ''twilight zone.''
Here, we present a new approach for constructing MSAs leveraging amino acid embeddings generated by protein language models (PLMs), which capture rich evolutionary and contextual information from massive and diverse sequence datasets. We introduce a windowed reciprocal-weighted embedding similarity metric that is surprisingly effective in identifying corresponding amino acids across sequences.
Building on this metric, we develop ARIES (Alignment via RecIprocal Embedding Similarity), an algorithm that constructs a PLM-generated template embedding and aligns each sequence to this template via dynamic time warping in order to build a global MSA. Across diverse benchmark datasets, ARIES achieves significantly higher accuracy than existing state-of-the-art approaches, especially in low-identity regimes where traditional methods degrade, while scaling almost linearly with the number of sequences to be aligned.
Together, these results provide the first large-scale demonstration of the power of PLMs for accurate and scalable MSA construction across protein families of varying sizes and levels of similarity, highlighting the potential of PLMs to transform comparative sequence analysis.
git clone https://github.com/Singh-Lab/ARIES.git
cd ARIES
conda env create -f environment.yml
conda activate ARIES
pip install .
aries --input BAliBASE --output-dir ./outputs/BAliBASE
aries --input HOMSTRAD --output-dir ./outputs/HOMSTRAD
aries --input QuanTest2 --output-dir ./outputs/QuanTest2
aries --input /path/to/fastas --output-dir ./path/to/outputs
aries --input /path/to/fastas --ref-dir /path/to/refs --output-dir ./tmp/out
Notes:
- Input files must be FASTA (
.fasta). - Reference files must be FASTA (
.aln,.fasta, or.fa). - Reference filenames must match the corresponding input filename stem to enable scoring.
- Required arguments:
--input(or-i) and--output-dir(or-o). - Run
aries -hto see all options and defaults.
usage: aries -i INPUT -o OUTPUT_DIR [--ref-dir REF_DIR]
[--compare {clustalo,clustalw}]
[--plm PLM] [--num-hidden-states NUM_HIDDEN_STATES]
[-w WINDOW] [-r RECIPROCAL] [--batch BATCH]
[--blur BLUR] [--pad-char PAD_CHAR]
[--medoid-topk MEDOID_TOPK] [--sim-metric SIM_METRIC]
[--maxlen MAXLEN] [--device DEVICE] [--seed SEED]
--input, -i Dataset name (BAliBASE, HOMSTRAD, QuanTest2) or an input FASTA folder.
--output-dir, -o Directory to write ARIES alignments (FASTA). Created if missing.
--ref-dir Optional reference alignment directory (enables scoring).
--compare Run comparison aligners in addition to ARIES: clustalo and/or clustalw.
--plm PLM name (esm2-35M, esm2-150M, esm2-650M, protbert, prottrans,
prottrans-half, or a Hugging Face model name).
--num-hidden-states Number of hidden states to concat. Default: 9
--window, -w Context window size for similarity. Default: 5
--reciprocal, -r Reciprocal weighting for similarity. Default: 200.0
--batch PLM batch size. Default: 32
--blur Gaussian blur sigma for similarity. Default: 3.0
--pad-char Padding character (default: X). Pass '!' to use the tokenizer's
native pad token.
--medoid-topk Medoid top-k selection for template synthesis: 'log' (ceil(log2(n))), 'logn' (ceil(log(n))),
or a positive integer k. Default: logn
--sim-metric Similarity metric (l2-gm, l2, cosine, etc.). Default: l2-gm
--maxlen Max sequence length to include from dataset. Default: 1022
--device Device for PLM/ARIES (e.g., cuda or cpu). Default: cuda
--seed Random seed. Default: 123
