-
Notifications
You must be signed in to change notification settings - Fork 62
Tutorial 1: The Basics
This introductory tutorial is meant to give some practice in using the mtag from the command line, but it is not intended to be comprehensive. It would be helpful take a look at the mtag's available options by navigating to its directory and entering:
python mtag.py -h
in the command line before beginning.
You should have already downloaded the mtag repository. For simplicity, all commands are executed from the directory including both the mtag folder and the input summary statistics, but this is not at all necessary. You can clone the repository with the command:
git clone https://github.com/omeed-maghzian/mtag.git
Performing MTAG is useful when applied to the summary statistics of multiple GWAS. The GWAS summary statistics must be formatted in a way that mtag can read. This preprocessing has been done in the sample data used below; all that you need to do is unzip the files in the folder containing the mtag directory and you will be able to directly execute the mtag commands below.
We demonstrate the use of mtag with GWAS results on neuroticism and subjective well-being from Okbay et. al. (2016). We use a subset of the original GWAS results from a random subsample of SNPs found in Hapmap3. The neuroticism and subjective well-being data summary statistics can be found here (Note: requires registering for an account).
These sample files are already formatted in the default output accepted by the command line tool. That is, they are whitespace-delimited .txt files with the following column names (the specific order is not necessary):
snpid chr bpos a1 a2 freq z pval n
The following columns are necessary for MTAG to run:
-
snpidThe unique snp identifier, typically an rsid number. The multiple summary statistics files are matched together using this column. This is the also the SNP identifier that is passed toldscwhen munging the summary statistics and estimating the residual covariance matrix. Other column names may be passed via the--snp_nameoption. -
a1/a2: Alleles observed at the particular locus.a1is considered to be the effect allele, which should also be reflected in the signs of the Z-scores. These columns are also passed to theldscroutine.mtagchecks and flips thea1anda2alleles so that they are identical across input files. Other column names may be passed via the--a1_nameand--a2_nameoptions. -
freq: The effect allele frequency (that is, the frequency ofa1), used to filter rare variants and weight convert MTAG results to unstandardized effect sizes. Other column names may be passed via the--eaf_nameoption. -
z: The Z-scores associated with the SNP effect sizes for the GWAS. Along with the sample size column, they are used in virtually every essential step in themtagprogram. Other column names may be passed with the--z_nameoption. Note: If both the effect sizes and the standard errors are present in the input GWAS summary statistics,zis not required by MTAG. To use this option, check that the repository is up-to-date and specify the flag--use_beta_se. The default column names arebetaandse, which can be specified via the flags--beta_nameand--se_name. -
n: The SNP sample sizes. Likez, they form an essential component of the analysis. Other column names may be passed with the--n_nameoption.
The other columns are not directly used by mtag.py but are part of the munging procedure implemented via ldsc. Therefore, at minimum, they should be in a format recognized by the ldsc see here for details. The chromosome number chr and base pair position bpos of SNPs are also used as reference columns printed in the results files. Other names can be passed with the --chr_name and --bpos_name options, respectively.
Using mtag with the default options implements the following steps:
1. Read in the input GWAS summary statistics and filter the SNPs by minor allele frequency (MAF) >= 0.01 and sample size N >= (2/3) * 90th percentile.
2. Merge the filtered GWAS summary statistics results together, taking the intersection of available SNPs.
3. Estimate the residual covariance matrix via LD Score regression.
4. Estimate the genetic covariance matrix Omega.
5. Perform MTAG and output results.
We apply MTAG by simply calling the mtag.py script via python, with minimal options:
python /[path]/mtag.py \
--sumstats 1_OA2016_hm3samp_NEUR.txt,1_OA2016_hm3samp_SWB.txt \
--out ./tutorial_results_1.1NS \
--n_min 0.0 \
--stream_stdout &
Setting --sumstats is the only necessary option when calling mtag.py, but it is highly recommended to also set --out. The lower sample size bound is set to 0 via --n_min to avoid dropping more SNPs in this example. --stream_stdout allows the log of the analysis to be streamed to the console during runtime (off by default).
-
--sumstatsshould be given a whitespace-delimited list of GWAS summary statistics. The order of the files does not affect the analysis, but it does correspond to the index used for the results files. -
--outshould be given a string in the form of[DIR]/[TAG], where[DIR]is the output directory and[TAG]is the prefix attached to all files written bymtag. The default is./mtag_results.
Letting [TAG] and [DIR] denote the components of the --out option above, running mtag should have produced five files in your current directory:
-
[TAG].logtimestamps the different steps taken bymtag.py. -
[TAG]_sigma_hat.txtstores the estimated residual covariance matrix. -
[TAG]_omega_hat.txtstores the estimated genetic covariance matrix. -
[TAG]_trait_1.txtand[TAG]_trait_2.txtare tab-delimited results files corresponding to the MTAG-adjusted effect sizes and standard errors for the neuroticism and subjective well-being summary statistics, respectively. Note that the files are numbered in the order they were presented in the list provided to the--sumstatsflag.
Opening up tutorial_results_1.log, we see that the header lists all the arguments passed to mtag.py.
<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<>
<> MTAG: Multi-trait Analysis of GWAS
<> Version: 1.0.7
<> (C) 2017 Omeed Maghzian, Raymond Walters, and Patrick Turley
<> Harvard University Department of Economics / Broad Institute of MIT and Harvard
<> GNU General Public License v3
<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
Calling ./mtag.py \
--stream-stdout \
--n-min 0.0 \
--sumstats 1_OA2016_hm3samp_NEUR.txt,1_OA2016_hm3samp_SWB.txt \
--out ./tutorial_results_1.1NS
Most of the generic filters based on sample size, allele frequency, etc., are implemented through calling ldsc's munge_sumstats.py in the local version of ldsc contained with the mtag repository; users should consult the ldsc documentation to understand what is going on under the hood. The output of munge_sumstats is redirected into the mtag log file.
Apart from providing time stamps for the estimating the different matrices needed for MTAG, the log file also displays the calculated values of Omega and Sigma along with a summary output of the results:
Summary of MTAG results:
------------------------
Trait N (max) N (mean) # SNPs used GWAS mean chi^2 MTAG mean chi^2 GWAS equiv. (max) N
1 ...hm3samp_NEUR.txt 111111 55913 167467 1.243 1.278 127214
2 ..._hm3samp_SWB.txt 111111 71824 167467 1.135 1.221 180879
Estimated Omega:
[[ 4.677e-06 -2.302e-06]
[ -2.302e-06 1.959e-06]]
Estimated Sigma:
[[ 1.052 -0.167]
[-0.167 1.016]]
MTAG results saved to file.
MTAG complete. Time elapsed: 44.8766591549s
All values for each column are computed from the set of SNPs that passed all mtag filters. The GWAS mean chi^2 column is adjusted by the diagonal terms of Sigma. MTAG mean chi^2 is calculated using the effect sizes and standard errors produced by mtag. A clear way to assess the increases in power from MTAG is to approximate the sample size needed to achieve the same mean chi2 value in a standard GWAS. GWAS equiv. (max) N provides this measure by multiplying the maximum sample size (N (max)) by the ratio (chi2MTAG, mean-1)/(chi2GWAS, mean-1).
tutorial_results_1.1NS_trait_1.txt provides the MTAG-adjusted effect sizes for the neuroticism GWAS:
snpid chr bpos a1 a2 z n freq mtag_beta mtag_se mtag_z mtag_pval
rs2736372 8 11106041 T C -7.71614161262 111111.111111 0.4179 -0.0324880486907 0.00419105765062 -7.7517541869 9.06317063823e-15
rs2060465 8 11162609 T C 7.69444599845 62500.0 0.6194 0.038971244976 0.00536428475564 7.26494709944 3.73184288437e-13
rs10096421 8 10831868 T G -7.561098219 111111.111111 0.4646 -0.0306226260845 0.00414457338635 -7.38860751877 1.48374420103e-13
rs2409722 8 11039816 T G -7.38261601808 111111.111111 0.4627 -0.0321870047337 0.00414572461396 -7.76390323306 8.23547416667e-15
rs11991118 8 10939273 T G 7.32202915636 111111.111111 0.5056 0.0306039239803 0.0041344320288 7.40220755042 1.33938936247e-13
The first eight columns are copied from the corresponding input file. mtag_beta and mtag_se provide the unstandardized weights and standard errors calculated by mtag, yielding the corresponding z-scores mtag_z and p-values mtag_pval. For standardized results, simply include the --std_betas flag (the p-values are unaffected).
The input and output files both treat the effect size as increasing in a1. The program uses the a1 and a2 columns of the first file as the reference for all other traits: the signs of the Z-scores for an input file are reversed if its a2 column is the first trait's a1 column. The allele columns in the results files are printed after this reversal, so you may need to be careful with the signs of mtag_beta when comparing them to your original GWAS results.
A word of caution about using MTAG with SNP sample sizes that are small: Generally, applications of mtag with a small final list of SNPs may not generate positive semidefinite estimates of the covariance matrices and may even lead to runtime errors in mtag.
If for some reason you would like to rerun MTAG but already have an estimate of Omega in hand, you can use the --gencov_path flag to directly pass an estimate of Omega to mtag. Likewise, estimates of Sigma may be passed with --residcov_path. In both cases, the typical estimation routine is skipped. It is best to manually pass these matrices only when you are confident that they are the likelihood-optimal (or close to optimal) values for the given set of SNPs used in MTAG. As a result, it is strongly recommended to only pass file paths to --gencov_path and --residcov_path that have been produced from an earlier application of mtag.
Without the need to estimate these covariance matrices, mtag's runtime is primarily limited by the reading andwriting of files (which is unusually small for our example set of summary statistics).
python /[path]/mtag.py \
--sumstats 1_OA2016_hm3samp_NEUR.txt,1_OA2016_hm3samp_SWB.txt \
--out ./tutorial_results_1.2SN \
--n_min 0 \
--gencov_path tutorial_results_1.1NS_omega_hat.txt \
--residcov_path tutorial_results_1.1NS_sigma_hat.txt &
Summary of MTAG results:
------------------------
Trait N (max) N (mean) # SNPs used GWAS mean chi^2 MTAG mean chi^2 GWAS equiv. (max) N
1 ...hm3samp_NEUR.txt 111111 55913 167467 1.243 1.278 127214
2 ..._hm3samp_SWB.txt 111111 71824 167467 1.135 1.221 180879
Estimated Omega:
[[ 4.677e-06 -2.302e-06]
[ -2.302e-06 1.959e-06]]
Estimated Sigma:
[[ 1.052 -0.167]
[-0.167 1.016]]
MTAG results saved to file.
MTAG complete. Time elapsed: 23.8759078979s