Skip to content

Unexpected liftover behavior when lifting PacBio long genomic reads between two haplotypes #87

@minkinaa

Description

@minkinaa

Hi Nae-Chyun,

I have been using leviosam2 to perform liftover between two haplotypes in the GM12878 cell line using PacBio long read data, and I have run into a few cases of unexpected liftover behavior I'll describe below. Though by and large liftover appears to happen correctly, some liftover events are not entirely consistent with the underlying chain file with various outcomes, ranging from small local misalignment to those impacting an entire read or set of reads at a particular location in the genome. I am unsure whether these are due to the same underlying issue. The optional realignment module unfortunately is not a great solution to these problems and I'll explain why below as well.

For the purpose of these examples, we are lifting reads which natively map to haplotype 2 (I'll call this the native haplotype) to haplotype 1's coordinates (I'll call this "target" haplotype). No realignment was performed. I did not run the full workflow -- just 'leviosam index' followed by 'leviosam lift'.

I have included the parameters & test files which can be used to recreate each issue (files linked at the bottom of post).

(1) Indels are not lining up after liftover

Image

In the top panel in the example above, the top track contains lifted over reads (hap2 to hap1) & the bottom track shows the genomic alignment of hap2 onto hap1 from which the chain file was made. We can see that the chain contains a 3 & 291 base insertion within this region in hap2 relative to hap1. We see these insertions in the reads as well, but their position does not line up with the chain. This behavior seems unexpected -- shouldn't the liftover follow the chain exactly? (As a sanity check, the bottom panel shows these same reads aligned to their native genome, with the inverted chain at the bottom showing where the 291bp indel lies. There is no obvious variation in the native alignment in these reads -- it seems to be introduced in the liftover step).

Read issue 1 files & params for testing:
chain: hap2-to-hap1coords.inverted.chain
fasta: hap1.fa
fasta index: hap1.fa.fai
input_bam: reads_issue_1.bam
-G 100000
-f $fasta_path
-m
-S mapq:0
-T 1000
-S aln_score:1000

(2) Liftover appears unstable at homopolymer regions

Image

Homopolymers consistently seem to cause liftover issues as shown in the top panel. Bases adjacent to the homopolymer are shifted in a subset of reads, but the misalignment often resolves with an indel at seemingly a random position. Interestingly, this behavior at homopolymers seems read-dependent, so if a longer misalignment happens at homopolymer X in read 1, we observe longer misalignments at other homopolymers in that read. See the boxed region of the bottom panel for examples -- for an individual read, the misalignments at the four homopolymer regions shown are all of consistent length.

Read issue 2 files & params for testing:
chain: hap2-to-hap1coords.inverted.chain
fasta: hap1.fa
fasta index: hap1.fa.fai
input_bam: reads_issue_2.bam
-G 100000
-f $fasta_path -m
-S mapq:0 -T 1000
-S aln_score:1000

(3) Entire reads get mis-lifted when their start position falls within a chain indel

Image

There is an issue that arises when a read begins in a location in hap2 (native haplotype) which does not exist in hap1, either appearing as an insertion in the chain file (bottom panel), or a break between two parts of the chain (top panel). Unsurprisingly, if we look at the distribution of these reads across a single chromosome (bottom right), we observe peaks corresponding to these indel/chain break events. These reads are lifted over to the generally correct location (see top right inset), but don't align perfectly to the reference.

Depending on the chain used, thus far I've seen this to impact .1-.3% of reads, except in cases like the one below.

Read issue 3 files & params for testing:
chain: hap2-to-hap1coords.inverted.chain
fasta: hap1.fa
fasta index: hap1.fa.fai
input_bam: reads_issue_3.bam
-G 100000
-f $fasta_path -m
-S mapq:0 -T 1000
-S aln_score:1000

(4) Gross mis-lifting impacting all reads on one chromosome arm

Image

Shown above is an example where almost all the reads on the long arm of chromosome 6 were lifted over improperly. I suspect that something about the complex chain structure at the centromere is causing every read that lifts over to a location beyond it to be mis-lifted. It does not appear to be a simple off by X error, as the reads are not all shifted the same number of bases from their proper alignment (though they tend to be shifted just a small number of bases from their correct location for the most part). This does seem to be an artifact specific to this chain. For example, we can lift hap2 reads to the hap1 coordinate system using a chain which was constructed by aligning the hap2 assembly to the hap1 assembly (and then inverting, chain 2 above). Or, we can instead align the hap1 assembly to hap2 (chain 1 above). In this case, chain 2 produced good results across chromosome 6, while only the short arm of chr6 has accurate liftover events with chain 1. I have verified that the chain files themselves are accurate.

Read issue 4 files & params for testing:
chain: hap1-to-hap2coords.chain (the liftover works ok for me when run with the other chain I provided here)
fasta: hap1.fa
fasta index: hap1.fa.fai
input_bam: reads_issue_4_good_read.bam (reads on the short arm of chr6 which lift over ok); reads_issue_4_bad_read.bam (reads on the long arm of chr6 which are being mislifted)
-G 100000
-f $fasta_path -m
-S mapq:0 -T 1000
-S aln_score:1000

Why adding the optional realignment module does not work

In order for reads to get realigned via the -x parameter, we have to set a parameter which would funnel these problematic reads into the deferred category. I did this by setting the max NM tag value to 200. While a subset of the reads were realigned as expected, this introduced a new issue when lifting over from one haplotype to another, where we expect some variation between the haplotypes. For example, if one haplotype contains a 300bp insertion relative to the other and a read overlaps this insertion, this read will be funneled into the deferred category. Thus, using this parameter with long read data, we lose a lot of perfectly fine reads.

Parameters used for leviosam2 lift with realignment:
-f ${fasta}
-G 100000
-m
-T 1000
-S mapq:0
-S aln_score:1000
-S hdist:200
-x config/pacbio_all.yaml

Referenced input files:
Remove the '.txt' extension from these to run. These were added to get around github format constraints.

  • the .fa file is too large to upload here but I can provide via upon request. It is not actually necessary for reproducing the errors above

reads_issue_4_good_read.bam.txt
reads_issue_4_good_read.bam.bai.txt
reads_issue_4_bad_read.bam.txt
reads_issue_4_bad_read.bam.bai.txt
reads_issue_3.bam.txt
reads_issue_3.bam.bai.txt
reads_issue_2_homopolymer.bam.txt
reads_issue_2_homopolymer.bam.bai.txt
reads_issue_1.bam.txt
reads_issue_1.bam.bai.txt
hap2-to-hap1coords.inverted.chain.txt
hap1.fa.fai.txt
hap1-to-hap2coords.chain.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions