-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Hi @mchaisso,
Sorry, this is a long one.
tldr;
- Can you please release a version with this off-by-1 error fixed?
- Can you please generate VCF format with REF and ALT sequences like below to allow us to use the sequence in downstream analyses?
As you know, we've been benchmarking vamos alongside other TR callers. We were getting startlingly low exact length concordance between vamos on reads and the assembly allele (20%). However, we did a little digging and were able to tweak the vamos output to produce ~90 % length concordance. To do so, I had to make some assumptions and account for an off-by-one error. Obviously, I would much prefer not to be messing with the output like this.
Here is what we did:
vamos --read -S -b ${aln} -r ${vamos_tr_regions} -s ${sample} -o ${sample}.vamos.vcf -t 16
vamos_tr_regions is a random subset of ~40k loci using the efficient motif set.
The -S flag to add alleles to the output VCF was critical to allow us to identify alleles that are identical to the reference. However, it produces a non-standard VCF format where the REF sequence is N and the ALTs are in the sample field.
I wrote a script to fix the VCF:
- Add REF alleles by extracting them from the ref fasta using the VCF POS - 1 and END (actually END - 1, see below)
- Move ALT alleles from the sample column into the ALT column
- Compare REF and ALT, and if they match, change the corresponding GT to 0 and remove from ALT.
- Adjust the H1/H2 fields in info to match
Fix vamos VCF dashnowlab/TR-Benchmarking#12
As soon as I did this, I realized that most loci differed by 1 bp at the end position between the REF and ALT. From other genotyping methods, we expect most loci to match the reference (i.e., be 0/0). Once I removed the last bp of every REF allele, most loci were 0/0.
So, since the original END coordinates match the input bed file I'm guessing there is 1 bp dropped from the ALT allele at the end coordinate. Given you mentioned an off-by-1 error previously, you may have already fixed this in the code base. But the fix doesn't appear to have been released yet.
Before:
grep -v "#" HG007.30x.haplotagged.vamos.vcf | cut -f 10 | cut -d ":" -f1 | sort | uniq -c
39846 1/1
2880 1/2
After:
grep -v "#" HG007.30x.haplotagged.vamos.fixed.vcf | cut -f 10 | sort | uniq -c
32753 0/0
1814 0/1
7093 1/1
1066 1/2
This now looks much more like I'd expect.
Here is an example locus from HG007
chr1 5944963 . N <VNTR> . PASS END=5944980;RU=AAGAG;SVTYPE=STR;ALTANNO_H1=0-0-0-0;LEN_H1=4;ALTANNO_H2=0-0-0;LEN_H2=3; GT 1/2:AAAGAGAAGAGAAGAGAAGAGA:AAAGAGAAGAGAAGAGA
chr1 5944963 . AAAGAGAAGAGAAGAGA AAAGAGAAGAGAAGAGAAGAGA . PASS END=5944980;RU=AAGAG;SVTYPE=STR;ALTANNO_H1=0-0-0-0;LEN_H1=4;ALTANNO_H0=0-0-0;LEN_H0=3; GT 0/1
You can see that 1/2 becomes 0/1 as after checking the sequences, it is an exact match for the reference.
cc'ing @avvaruakshay in case I missed anything
Thanks!
Warm regards,
Harriet