Skip to content

vamos output format and length errors #26

@hdashnow

Description

@hdashnow

Hi @mchaisso,

Sorry, this is a long one.
tldr;

  • Can you please release a version with this off-by-1 error fixed?
  • Can you please generate VCF format with REF and ALT sequences like below to allow us to use the sequence in downstream analyses?

As you know, we've been benchmarking vamos alongside other TR callers. We were getting startlingly low exact length concordance between vamos on reads and the assembly allele (20%). However, we did a little digging and were able to tweak the vamos output to produce ~90 % length concordance. To do so, I had to make some assumptions and account for an off-by-one error. Obviously, I would much prefer not to be messing with the output like this.

Here is what we did:
vamos --read -S -b ${aln} -r ${vamos_tr_regions} -s ${sample} -o ${sample}.vamos.vcf -t 16
vamos_tr_regions is a random subset of ~40k loci using the efficient motif set.
The -S flag to add alleles to the output VCF was critical to allow us to identify alleles that are identical to the reference. However, it produces a non-standard VCF format where the REF sequence is N and the ALTs are in the sample field.

I wrote a script to fix the VCF:

  • Add REF alleles by extracting them from the ref fasta using the VCF POS - 1 and END (actually END - 1, see below)
  • Move ALT alleles from the sample column into the ALT column
  • Compare REF and ALT, and if they match, change the corresponding GT to 0 and remove from ALT.
  • Adjust the H1/H2 fields in info to match
    Fix vamos VCF dashnowlab/TR-Benchmarking#12

As soon as I did this, I realized that most loci differed by 1 bp at the end position between the REF and ALT. From other genotyping methods, we expect most loci to match the reference (i.e., be 0/0). Once I removed the last bp of every REF allele, most loci were 0/0.

So, since the original END coordinates match the input bed file I'm guessing there is 1 bp dropped from the ALT allele at the end coordinate. Given you mentioned an off-by-1 error previously, you may have already fixed this in the code base. But the fix doesn't appear to have been released yet.

Before:

grep -v "#" HG007.30x.haplotagged.vamos.vcf | cut -f 10 | cut -d ":" -f1 | sort | uniq -c

39846 1/1
2880 1/2

After:

grep -v "#" HG007.30x.haplotagged.vamos.fixed.vcf | cut -f 10 | sort | uniq -c

32753 0/0
1814 0/1
7093 1/1
1066 1/2

This now looks much more like I'd expect.

Here is an example locus from HG007

chr1	5944963	.	N	<VNTR>	.	PASS	END=5944980;RU=AAGAG;SVTYPE=STR;ALTANNO_H1=0-0-0-0;LEN_H1=4;ALTANNO_H2=0-0-0;LEN_H2=3;	GT	1/2:AAAGAGAAGAGAAGAGAAGAGA:AAAGAGAAGAGAAGAGA

chr1	5944963	.	AAAGAGAAGAGAAGAGA	AAAGAGAAGAGAAGAGAAGAGA	.	PASS	END=5944980;RU=AAGAG;SVTYPE=STR;ALTANNO_H1=0-0-0-0;LEN_H1=4;ALTANNO_H0=0-0-0;LEN_H0=3;	GT	0/1

You can see that 1/2 becomes 0/1 as after checking the sequences, it is an exact match for the reference.

cc'ing @avvaruakshay in case I missed anything

Thanks!

Warm regards,
Harriet

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions