-
Notifications
You must be signed in to change notification settings - Fork 0
Vcf files
Back to Update Mouse Genome Details
For instructions, see Install vcftools
If at any step you receive an error like this:
the index file either does not exist or is older than the vcf file. Please reindex.
If so, re-index by typing
tabix -p vcf my_file.vcf.gz
/Users/clemensl/vcfTools/vcftools_0.1.9/bin/vcf-subset -c C57BL,DBA 20111102-snps-all.vcf.gz > snps-two.vcf
The perl script is called evaluateSanger.pl. The script writes out a file with SNPs for which ATG (Above Threshold Genotype) is either 0 or 1 for both C57BL and DBA and satisfies the conditions in the following table.
| C57BL | DBA | Keep? |
| 1 | 0 | yes |
| 1 | 1 | yes |
| 0 | 1 | yes |
| 0 | 0 | no |
| Abbreviation | Definition |
| CH | Chromosome |
| POS | Position |
| ID | Identifier |
| REF | Reference |
| ALT | Alternate |
| Q | Quality |
| F | Filter |
| INFO | Information |
| AC | Allele count in genotypes |
| AN | Total number of alleles in called genotypes |
| GT | Genotype |
| ATG | Above Threshold Genotype |
| MQ | Mapping Quality |
| HCG | High Confidence Genotype |
| GQ | Genotype Quality |
| DP | Read Depth |
Here is what a couple of lines in the vcf file look like:
| CH | POS | ID | REF | ALT | Q | F | INFO | FORMAT | C57BL | DBA |
| 1 | 3000054 | . | C | T | -1 | . | AC=0;AN=4 | GT:ATG:MQ:HCG:GQ:DP | 0/0:0:60:-1:61:8 | 0/0:0:59:-1:88:17 |
| 1 | 3000093 | . | T | C | -1 | . | AC=0;AN=4 | GT:ATG:MQ:HCG:GQ:DP | 0/0:0:57:-1:79:14 | 0/0:0:59:-1:103:22 |
Possible values for ATG:
| Value | Meaning |
| 1 | SNP for genotypes above threshold |
| 0 | Wildtype Reference for genotypes above threshold |
| -8 | missing |
| -7 | heterozygous |
| -6 | SNP with mapping or genotype quality less than 40 |
| -5 | SNP with read depth greater than 150 |
| -4 | SNP HCG but within 5 bp of a short period [7] simple repeat |
| -3 | SNP HCG but within 15 bp of an indel |
| -2 | SNP not HCG |
| -1 | Wildtype with mapping / genotype quality less than 40 or read Depth greater than 150 |
Possible values for HCG:
| Value | Meaning |
| 1 | yes for nonref alleles |
| 0 | no for nonref alleles |
| -1 | for ref homs |
Some of this information is not required. I wrote notes as I was experimenting. I left the notes here in case they will prove of use in the future.
To get help on tabix, type
tabix
You will see:
Program: tabix (TAB-delimited file InderXer)
Version: 0.2.5 (r1005)
Usage: tabix <in.tab.bgz> [region1 [region2 [...]]]
Options: -p STR preset: gff, bed, sam, vcf, psltbl [gff]
-s INT sequence name column [1]
-b INT start column [4]
-e INT end column; can be identical to '-b' [5]
-S INT skip first INT lines [0]
-c CHAR symbol for comment/meta lines [#]
-r FILE replace the header with the content of FILE [null]
-B region1 is a BED file (entire file will be read)
-0 zero-based coordinate
-h print also the header lines
-H print only the header lines
-l list chromosome names
-f force to overwrite the index
To see the vcf file header, use the -H option.
tabix -H /Users/clemensl/SangerDownloads/snps/20111102-snps-all.vcf.gz ##fileformat=VCFv3.3 ##FORMAT=DP,1,Integer,"Read Depth" ##FORMAT=GQ,1,Integer,"Genotype Quality" ##FORMAT=GT,1,String,"Genotype" ##INFO=DP,1,Integer,"Total Depth" ##FORMAT=MQ,1,Integer,"Mapping Quality" ##FORMAT=HCG,1,Integer,"High Confidence Genotype. Can be 1(yes) or 0(no) for nonref alleles. It is always -1 for ref homs." ##FORMAT=ATG,1,Integer,"Above Threshold Genotype. Can be 1 (SNP) or 0 (Wildtype Reference) for genotypes above threshold, or -8 (missing), -7 (heterozygous), -6 (SNP with mapping or genotype quality < 40), -5 (SNP with read depth > 150), -4 (SNP HCG but within 5 bp of a short period [7] simple repeat), -3 (SNP HCG but within 15 bp of an indel), -2 (SNP not HCG), -1 (Wildtype with mapping / genotype quality < 40 or read Depth > 150)" ##INFO=AC,-1,Integer,"Allele count in genotypes" ##INFO=AN,1,Integer,"Total number of alleles in called genotypes" #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 129P2 129S1 129S5 AKR A_J BALB C3H C57BL CAST CBA DBA LP_J NOD NZO PWK SPRET WSB
You can see by using tabix that the files downloaded from Sanger are version 3.3.
Then I tried using tabix to write out some information from the file, but I got an error about the index file. (The index files end in vcf.gz.tbi)
tabix -h /Users/clemensl/SangerDownloads/snps/20111102-snps-all.vcf.gz 1:5000000-5000100 [tabix] the index file either does not exist or is older than the vcf file. Please reindex.
Then I re-indexed using this command:
tabix -p vcf 20111102-snps-all.vcf.gz
After this, the tabix -h command was successful:
tabix -h /Users/clemensl/SangerDownloads/snps/20111102-snps-all.vcf.gz 1:5000000-5000100 ##fileformat=VCFv3.3 ##FORMAT=DP,1,Integer,"Read Depth" ##FORMAT=GQ,1,Integer,"Genotype Quality" ##FORMAT=GT,1,String,"Genotype" ##INFO=DP,1,Integer,"Total Depth" ##FORMAT=MQ,1,Integer,"Mapping Quality" ##FORMAT=HCG,1,Integer,"High Confidence Genotype. Can be 1(yes) or 0(no) for nonref alleles. It is always -1 for ref homs." ##FORMAT=ATG,1,Integer,"Above Threshold Genotype. Can be 1 (SNP) or 0 (Wildtype Reference) for genotypes above threshold, or -8 (missing), -7 (heterozygous), -6 (SNP with mapping or genotype quality < 40), -5 (SNP with read depth > 150), -4 (SNP HCG but within 5 bp of a short period [7] simple repeat), -3 (SNP HCG but within 15 bp of an indel), -2 (SNP not HCG), -1 (Wildtype with mapping / genotype quality < 40 or read Depth > 150)" ##INFO=AC,-1,Integer,"Allele count in genotypes" ##INFO=AN,1,Integer,"Total number of alleles in called genotypes" #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 129P2 129S1 129S5 AKR A_J BALB C3H C57BL CAST CBA DBA LP_J NOD NZO PWK SPRET WSB 1 5000060 . C T -1 . AC=4;AN=34 GT:ATG:MQ:HCG:GQ:DP 0/0:0:60:-1:209:57 0/0:0:60:-1:97:20 0/0:0:59:-1:97:20 0/0:0:60:-1:185:49 0/0:0:60:-1:124:29 0/0:0:60:-1:91:18 0/0:0:60:-1:106:23 0/0:0:59:-1:127:30 0/0:0:59:-1:91:25 0/0:0:60:-1:71:21 1/1:1:59:1:133:32 0/0:0:59:-1:115:26 0/0:0:57:-1:130:31 0/0:0:60:-1:88:17 1/1:1:59:1:93:28 0/0:0:58:-1:121:28 0/0:0:60:-1:97:20 1 5000087 . A C -1 . AC=2;AN=34 GT:ATG:MQ:HCG:GQ:DP 0/0:0:60:-1:197:53 0/0:0:60:-1:85:16 0/0:0:60:-1:88:17 0/0:0:60:-1:200:54 0/0:0:60:-1:133:32 0/0:0:60:-1:91:18 0/0:0:60:-1:130:31 0/0:0:60:-1:90:29 0/0:0:60:-1:100:21 0/0:0:60:-1:115:26 0/0:0:60:-1:127:30 0/0:0:59:-1:130:32 0/0:0:59:-1:106:23 0/0:0:60:-1:91:18 0/0:0:59:-1:118:27 0/0:0:60:-1:133:32 1/1:1:60:1:82:15 1 5000100 . A G -1 . AC=2;AN=34 GT:ATG:MQ:HCG:GQ:DP 0/0:0:60:-1:203:55 0/0:0:60:-1:94:19 0/0:0:60:-1:97:20 0/0:0:60:-1:186:62 0/0:0:60:-1:130:31 0/0:0:60:-1:97:20 0/0:0:60:-1:97:28 0/0:0:60:-1:130:31 0/0:0:60:-1:100:21 0/0:0:60:-1:90:29 0/0:0:60:-1:139:34 0/0:0:60:-1:104:34 0/0:0:59:-1:103:22 0/0:0:59:-1:66:19 0/0:0:58:-1:112:25 1/1:1:60:1:148:37 0/0:0:60:-1:85:16
I first tried running vcftools using the following command which should write out everything on chromosome 19 to .ped and .map files in the plink format.
./vcftools --gzvcf /Users/clemensl/SangerDownloads/20111102-snps-all.vcf.gz --chr 19 --plink
I got this error:
Error:VCF version must be v4.0 or v4.1: You are using version VCFv3.3
So it seems the perl scripts run on earlier versions but the vcftools do not.
So I unarchived one of the vcf.gz files to create a vcf file, just by double clicking. Then I ran this:
cat /Users/clemensl/SangerDownloads/experiment/20111102-snps-all.vcf |./vcf-convert > /Users/clemensl/SangerDownloads/experiment/snps-all-new.vcf
This took hours to run, but it did finish.
Then zip and index again:
bgzip /Users/clemensl/SangerDownloads/experiment/snps-all-new.vcf tabix -p vcf /Users/clemensl/SangerDownloads/experiment/snps-all-new.vcf.gz
Now the version is 4.1:
tabix -H snps-all-new.vcf.gz ##fileformat=VCFv4.1 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##FORMAT=<ID=MQ,Number=1,Type=Integer,Description="Mapping Quality"> ##FORMAT=<ID=HCG,Number=1,Type=Integer,Description="High Confidence Genotype. Can be 1(yes) or 0(no) for nonref alleles. It is always -1 for ref homs."> ##FORMAT=<ID=ATG,Number=1,Type=Integer,Description="Above Threshold Genotype. Can be 1 (SNP) or 0 (Wildtype Reference) for genotypes above threshold, or -8 (missing), -7 (heterozygous), -6 (SNP with mapping or genotype quality < 40), -5 (SNP with read depth > 150), -4 (SNP HCG but within 5 bp of a short period [7] simple repeat), -3 (SNP HCG but within 15 bp of an indel), -2 (SNP not HCG), -1 (Wildtype with mapping / genotype quality < 40 or read Depth > 150)"> ##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 129P2 129S1 129S5 AKR A_J BALB C3H C57BL CAST CBA DBA LP_J NOD NZO PWK SPRET WSB
Question: we are supposed to look for C57BL/6J and DBA/2J. I see DBA and C57BL. Are those the ones?
/Users/clemensl/vcfTools/vcftools_0.1.9/bin/vcftools --gzvcf /Users/clemensl/SangerDownloads/experiment/snps-all-new.vcf.gz --chr 19 --plink --out chr19 VCFtools - v0.1.9.0 (C) Adam Auton 2009 Parameters as interpreted: --gzvcf /Users/clemensl/SangerDownloads/experiment/snps-all-new.vcf.gz --chr 19 --out chr19 --plink Using zlib version: 1.2.3 Versions of zlib >= 1.2.4 will be *much* faster when reading zipped VCF files. Reading Index file. Building new index file. Scanning Chromosome: 1 Scanning Chromosome: 2 Scanning Chromosome: 3 Scanning Chromosome: 4 Scanning Chromosome: 5 Scanning Chromosome: 6 Scanning Chromosome: 7 Scanning Chromosome: 8 Scanning Chromosome: 9 Scanning Chromosome: 10 Scanning Chromosome: 11 Scanning Chromosome: 12 Scanning Chromosome: 13 Scanning Chromosome: 14 Scanning Chromosome: 15 Scanning Chromosome: 16 Scanning Chromosome: 17 Scanning Chromosome: 18 Scanning Chromosome: 19 Scanning Chromosome: X Writing Index file. File contains 65243635 entries and 17 individuals. Filtering by chromosome. Chromosome: 1 Chromosome: 2 Chromosome: 3 Chromosome: 4 Chromosome: 5 Chromosome: 6 Chromosome: 7 Chromosome: 8 Chromosome: 9 Chromosome: 10 Chromosome: 11 Chromosome: 12 Chromosome: 13 Chromosome: 14 Chromosome: 15 Chromosome: 16 Chromosome: 17 Chromosome: 18 Chromosome: 19 Chromosome: X Skipping Remainder. Keeping 1566622 entries on specified chromosomes. Applying Required Filters. After filtering, kept 17 out of 17 Individuals After filtering, kept 1566622 out of a possible 1566622 Sites Writing PLINK PED file ... PLINK: Only outputting biallelic loci. Writing PLINK MAP file ... Done. Run Time = 8132.00 seconds
Because of the line:
Versions of zlib >= 1.2.4 will be *much* faster when reading zipped VCF files.
I decided to update zlib. Here are the instructions.
Go to Download zlib The file you get will be named something like this:
zlib-1.2.7.tar.gz
Unarchive by double clicking. Then look at the top of the file Makefile.in which explains how to install.
./configure; make test
You should get two messages saying tests are successful.
*** zlib test OK *** *** zlib shared test OK ***
Then run
make install