Skip to content
walkerhound edited this page Nov 8, 2012 · 59 revisions

Back to Update Mouse Genome Details

Table of Contents

Outline of one method of extracting data from vcf.gz files

Download and install vcftools

For instructions, see Install vcftools

Recreate indexes if necessary

If at any step you receive an error like this:

the index file either does not exist or is older than the vcf file. Please reindex.

If so, re-index by typing

tabix -p vcf my_file.vcf.gz

Extract data about C57BL and DBA

/Users/clemensl/vcfTools/vcftools_0.1.9/bin/vcf-subset -c C57BL,DBA  20111102-snps-all.vcf.gz > snps-two.vcf

Run perl script to extract required data

The perl script is called evaluateSanger.pl. The script writes out a file with SNPs for which ATG (Above Threshold Genotype) is either 0 or 1 for both C57BL and DBA and satisfies the conditions in the following table.

C57BL DBA Keep?
1 0 yes
1 1 yes
0 1 yes
0 0 no

Format of vcf files

Abbreviation Definition
CH Chromosome
POS Position
ID Identifier
REF Reference
ALT Alternate
Q Quality
F Filter
INFO Information
AC Allele count in genotypes
AN Total number of alleles in called genotypes
GT Genotype
ATG Above Threshold Genotype
MQ Mapping Quality
HCG High Confidence Genotype
GQ Genotype Quality
DP Read Depth

Here is what a couple of lines in the vcf file look like:

CH POS ID REF ALT Q F INFO FORMAT C57BL DBA
1 3000054 . C T -1 . AC=0;AN=4 GT:ATG:MQ:HCG:GQ:DP 0/0:0:60:-1:61:8 0/0:0:59:-1:88:17
1 3000093 . T C -1 . AC=0;AN=4 GT:ATG:MQ:HCG:GQ:DP 0/0:0:57:-1:79:14 0/0:0:59:-1:103:22

Possible values for ATG:

Value Meaning
1 SNP for genotypes above threshold
0 Wildtype Reference for genotypes above threshold
-8 missing
-7 heterozygous
-6 SNP with mapping or genotype quality less than 40
-5 SNP with read depth greater than 150
-4 SNP HCG but within 5 bp of a short period [7] simple repeat
-3 SNP HCG but within 15 bp of an indel
-2 SNP not HCG
-1 Wildtype with mapping / genotype quality less than 40 or read Depth greater than 150

Possible values for HCG:

Value Meaning
1 yes for nonref alleles
0 no for nonref alleles
-1 for ref homs

More information about vcf files and tools

Some of this information is not required. I wrote notes as I was experimenting. I left the notes here in case they will prove of use in the future.

Using Tabix

To get help on tabix, type

tabix

You will see:

Program: tabix (TAB-delimited file InderXer)
Version: 0.2.5 (r1005)

Usage:   tabix <in.tab.bgz> [region1 [region2 [...]]]

Options: -p STR     preset: gff, bed, sam, vcf, psltbl [gff]
         -s INT     sequence name column [1]
         -b INT     start column [4]
         -e INT     end column; can be identical to '-b' [5]
         -S INT     skip first INT lines [0]
         -c CHAR    symbol for comment/meta lines [#]
         -r FILE    replace the header with the content of FILE [null]
         -B         region1 is a BED file (entire file will be read)
         -0         zero-based coordinate
         -h         print also the header lines
         -H         print only the header lines
         -l         list chromosome names
         -f         force to overwrite the index

To see the vcf file header, use the -H option.

tabix -H /Users/clemensl/SangerDownloads/snps/20111102-snps-all.vcf.gz
##fileformat=VCFv3.3
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=DP,1,Integer,"Total Depth"
##FORMAT=MQ,1,Integer,"Mapping Quality"
##FORMAT=HCG,1,Integer,"High Confidence Genotype. Can be 1(yes) or 0(no) for nonref alleles. It is always -1 for ref homs."
##FORMAT=ATG,1,Integer,"Above Threshold Genotype. Can be 1 (SNP) or 0 (Wildtype Reference) for genotypes above threshold, or -8 (missing), -7 (heterozygous), -6 (SNP with mapping or genotype quality < 40), -5 (SNP with read depth > 150), -4 (SNP HCG but within 5 bp of a short period [7] simple repeat), -3 (SNP HCG but within 15 bp of an indel), -2 (SNP not HCG), -1 (Wildtype with mapping / genotype quality < 40 or read Depth > 150)"
##INFO=AC,-1,Integer,"Allele count in genotypes"
##INFO=AN,1,Integer,"Total number of alleles in called genotypes"
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	129P2	129S1	129S5	AKR	A_J	BALB	C3H	C57BL	CAST	CBA	DBA	LP_J	NOD	NZO	PWK	SPRET	WSB

You can see by using tabix that the files downloaded from Sanger are version 3.3.

Then I tried using tabix to write out some information from the file, but I got an error about the index file. (The index files end in vcf.gz.tbi)

tabix -h /Users/clemensl/SangerDownloads/snps/20111102-snps-all.vcf.gz 1:5000000-5000100
[tabix] the index file either does not exist or is older than the vcf file. Please reindex.

Then I re-indexed using this command:

tabix -p vcf 20111102-snps-all.vcf.gz

After this, the tabix -h command was successful:

tabix -h /Users/clemensl/SangerDownloads/snps/20111102-snps-all.vcf.gz 1:5000000-5000100
##fileformat=VCFv3.3
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=DP,1,Integer,"Total Depth"
##FORMAT=MQ,1,Integer,"Mapping Quality"
##FORMAT=HCG,1,Integer,"High Confidence Genotype. Can be 1(yes) or 0(no) for nonref alleles. It is always -1 for ref homs."
##FORMAT=ATG,1,Integer,"Above Threshold Genotype. Can be 1 (SNP) or 0 (Wildtype Reference) for genotypes above threshold, or -8 (missing), -7 (heterozygous), -6 (SNP with mapping or genotype quality < 40), -5 (SNP with read depth > 150), -4 (SNP HCG but within 5 bp of a short period [7] simple repeat), -3 (SNP HCG but within 15 bp of an indel), -2 (SNP not HCG), -1 (Wildtype with mapping / genotype quality < 40 or read Depth > 150)"
##INFO=AC,-1,Integer,"Allele count in genotypes"
##INFO=AN,1,Integer,"Total number of alleles in called genotypes"
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	129P2	129S1	129S5	AKR	A_J	BALB	C3H	C57BL	CAST	CBA	DBA	LP_J	NOD	NZO	PWK	SPRET	WSB
1	5000060	.	C	T	-1	.	AC=4;AN=34	GT:ATG:MQ:HCG:GQ:DP	0/0:0:60:-1:209:57	0/0:0:60:-1:97:20	0/0:0:59:-1:97:20	0/0:0:60:-1:185:49	0/0:0:60:-1:124:29	0/0:0:60:-1:91:18	0/0:0:60:-1:106:23	0/0:0:59:-1:127:30	0/0:0:59:-1:91:25	0/0:0:60:-1:71:21	1/1:1:59:1:133:32	0/0:0:59:-1:115:26	0/0:0:57:-1:130:31	0/0:0:60:-1:88:17	1/1:1:59:1:93:28	0/0:0:58:-1:121:28	0/0:0:60:-1:97:20
1	5000087	.	A	C	-1	.	AC=2;AN=34	GT:ATG:MQ:HCG:GQ:DP	0/0:0:60:-1:197:53	0/0:0:60:-1:85:16	0/0:0:60:-1:88:17	0/0:0:60:-1:200:54	0/0:0:60:-1:133:32	0/0:0:60:-1:91:18	0/0:0:60:-1:130:31	0/0:0:60:-1:90:29	0/0:0:60:-1:100:21	0/0:0:60:-1:115:26	0/0:0:60:-1:127:30	0/0:0:59:-1:130:32	0/0:0:59:-1:106:23	0/0:0:60:-1:91:18	0/0:0:59:-1:118:27	0/0:0:60:-1:133:32	1/1:1:60:1:82:15
1	5000100	.	A	G	-1	.	AC=2;AN=34	GT:ATG:MQ:HCG:GQ:DP	0/0:0:60:-1:203:55	0/0:0:60:-1:94:19	0/0:0:60:-1:97:20	0/0:0:60:-1:186:62	0/0:0:60:-1:130:31	0/0:0:60:-1:97:20	0/0:0:60:-1:97:28	0/0:0:60:-1:130:31	0/0:0:60:-1:100:21	0/0:0:60:-1:90:29	0/0:0:60:-1:139:34	0/0:0:60:-1:104:34	0/0:0:59:-1:103:22	0/0:0:59:-1:66:19	0/0:0:58:-1:112:25	1/1:1:60:1:148:37	0/0:0:60:-1:85:16

Running vcftools

I first tried running vcftools using the following command which should write out everything on chromosome 19 to .ped and .map files in the plink format.

./vcftools --gzvcf /Users/clemensl/SangerDownloads/20111102-snps-all.vcf.gz --chr 19 --plink

I got this error:

Error:VCF version must be v4.0 or v4.1:
You are using version VCFv3.3

So it seems the perl scripts run on earlier versions but the vcftools do not.

So I unarchived one of the vcf.gz files to create a vcf file, just by double clicking. Then I ran this:

cat /Users/clemensl/SangerDownloads/experiment/20111102-snps-all.vcf |./vcf-convert > /Users/clemensl/SangerDownloads/experiment/snps-all-new.vcf

This took hours to run, but it did finish.

Then zip and index again:

bgzip /Users/clemensl/SangerDownloads/experiment/snps-all-new.vcf
tabix -p vcf /Users/clemensl/SangerDownloads/experiment/snps-all-new.vcf.gz

Now the version is 4.1:

tabix -H snps-all-new.vcf.gz
##fileformat=VCFv4.1
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=MQ,Number=1,Type=Integer,Description="Mapping Quality">
##FORMAT=<ID=HCG,Number=1,Type=Integer,Description="High Confidence Genotype. Can be 1(yes) or 0(no) for nonref alleles. It is always -1 for ref homs.">
##FORMAT=<ID=ATG,Number=1,Type=Integer,Description="Above Threshold Genotype. Can be 1 (SNP) or 0 (Wildtype Reference) for genotypes above threshold, or -8 (missing), -7 (heterozygous), -6 (SNP with mapping or genotype quality < 40), -5 (SNP with read depth > 150), -4 (SNP HCG but within 5 bp of a short period [7] simple repeat), -3 (SNP HCG but within 15 bp of an indel), -2 (SNP not HCG), -1 (Wildtype with mapping / genotype quality < 40 or read Depth > 150)">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	129P2	129S1	129S5	AKR	A_J	BALB	C3H	C57BL	CAST	CBA	DBA	LP_J	NOD	NZO	PWK	SPRET	WSB

Question: we are supposed to look for C57BL/6J and DBA/2J. I see DBA and C57BL. Are those the ones?

/Users/clemensl/vcfTools/vcftools_0.1.9/bin/vcftools --gzvcf /Users/clemensl/SangerDownloads/experiment/snps-all-new.vcf.gz --chr 19 --plink --out chr19

VCFtools - v0.1.9.0
(C) Adam Auton 2009

Parameters as interpreted:
	--gzvcf /Users/clemensl/SangerDownloads/experiment/snps-all-new.vcf.gz
	--chr 19
	--out chr19
	--plink

Using zlib version: 1.2.3
Versions of zlib >= 1.2.4 will be *much* faster when reading zipped VCF files.
Reading Index file.
Building new index file.
	Scanning Chromosome: 1
	Scanning Chromosome: 2
	Scanning Chromosome: 3
	Scanning Chromosome: 4
	Scanning Chromosome: 5
	Scanning Chromosome: 6
	Scanning Chromosome: 7
	Scanning Chromosome: 8
	Scanning Chromosome: 9
	Scanning Chromosome: 10
	Scanning Chromosome: 11
	Scanning Chromosome: 12
	Scanning Chromosome: 13
	Scanning Chromosome: 14
	Scanning Chromosome: 15
	Scanning Chromosome: 16
	Scanning Chromosome: 17
	Scanning Chromosome: 18
	Scanning Chromosome: 19
	Scanning Chromosome: X
Writing Index file.
File contains 65243635 entries and 17 individuals.
Filtering by chromosome.
	Chromosome: 1
	Chromosome: 2
	Chromosome: 3
	Chromosome: 4
	Chromosome: 5
	Chromosome: 6
	Chromosome: 7
	Chromosome: 8
	Chromosome: 9
	Chromosome: 10
	Chromosome: 11
	Chromosome: 12
	Chromosome: 13
	Chromosome: 14
	Chromosome: 15
	Chromosome: 16
	Chromosome: 17
	Chromosome: 18
	Chromosome: 19
	Chromosome: X
Skipping Remainder.
Keeping 1566622 entries on specified chromosomes.
Applying Required Filters.
After filtering, kept 17 out of 17 Individuals
After filtering, kept 1566622 out of a possible 1566622 Sites
Writing PLINK PED file ... 
	PLINK: Only outputting biallelic loci.
Writing PLINK MAP file ... Done.
Run Time = 8132.00 seconds

Update zlib

Because of the line:

Versions of zlib >= 1.2.4 will be *much* faster when reading zipped VCF files.

I decided to update zlib. Here are the instructions.

Go to Download zlib The file you get will be named something like this:

zlib-1.2.7.tar.gz

Unarchive by double clicking. Then look at the top of the file Makefile.in which explains how to install.

./configure; make test

You should get two messages saying tests are successful.

		*** zlib test OK ***
		*** zlib shared test OK ***

Then run

make install

Clone this wiki locally