Impute2 pipeline
Table of Contents
This page describes the Impute2 pipeline developed by the GoNL - Impute team.
Overview
The pipeline performs the following steps:
- Pre-processing of study data
- Liftover PED/MAP files from build 36 to build 37
- Quality control of study data
- Analysis
- Phasing the study data
- Imputing the study data
Detailed description and a list of used software and versions can be found on the bottom of this page.
All analysis jobs are generated using MOLGENIS Compute, more information about MOLGENIS Compute and the other analysis pipelines can be found here: https://github.com/molgenis/molgenis-pipelines
Paper ready summary
UNDER CONSTRUCTION
You can use the following text in your paper:
Summary
The study data was lifted over from human genome build 36 to build 37 using Plink(1) and UCSC liftOver, followed by alignment to reference data and filtering on MAF larger than 1%, Hardy-Weinberg Equilibrium p-value of 1e-4 and a call rate higher than 0.95. Afterwards the study data was pre-phased per chromosome using SHAPEIT2 v.2.644(2). Finally the imputation over genome chunks of 5Mb was performed using IMPUTE2 2.3.0(3). Here we used Genome of the Netherlands release 4 (499 unrelated individuals)(4) and 1000 Genomes phase1 integrated version 3 (1092 individuals)(5) respectively as reference panel.
We used MOLGENIS compute(6) to implement the imputation pipeline, keep track of all analysis jobs and easily distribute all imputation chunks in parallel on our PBS compute cluster and the national life science grid. All pipelines are available as open source via http://www.molgenis.org/wiki/ComputeStart.
Acknowledgements
We would like to thank The Target project (http://www.rug.nl/target) for providing the compute infrastructure used for imputation and the BigGrid?/eBioGrid project (http://www.ebiogrid.nl) for sponsoring the imputation pipeline implementation.
References
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81, Available: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1950838/
- Delaneau O, Zagury J-F, Marchini J (2013) Improved whole-chromosome phasing for disease and population genetic studies. Nature methods 10: 5–6. Available: http://dx.doi.org/10.1038/nmeth.2307
- Howie BN, Donnelly P, Marchini J (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS genetics 5: e1000529. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2689936&tool=pmcentrez&rendertype=abstract
- Placeholder for GoNL paper
- The 100 Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092 human genomes. Nature, 491. Available: http://www.nature.com/nature/journal/v491/n7422/full/nature11632.html
- Byelas, H., Dijkstra, M., Neerincx, P., Van Dijk, F., Kanterakis, A., Deelen, P., & Swertz, M. (2013). Scaling bio-analyses from computational clusters to grids. IWSG 2013.
Reference data
- ALL_1000G_phase1integrated_v3_impute.tgz from the impute website, this dataset contains 1,092 individuals
- GoNL release 4, 499 unrelated individuals, 998 haplotypes
Detailed description
Step 1: Liftover PED/MAP files from build 36 to build 37
- Location: https://github.com/molgenis/molgenis-pipelines/blob/master/compute4/Liftover_genome_build_PEDMAP/protocols/liftoverPEDMAP.ftl
This protocol applies the following steps: - Create a bed file based on MAP file.
- Map the bed file to build 37 using UCSVC liftover tool.
- Create list of unmapped SNPs.
- Create plink mappings file.
- Create new plink data without unmapped SNPs using Plink.
Step 2: Quality control of study data
- Location: https://github.com/molgenis/molgenis-pipelines/blob/master/compute4/Imputation_imputationtool_studyQC/protocols/studyQC.ftl
This protocol applies QC to the study data using imputationTool. This tool employs a binary format, called TriTyper for rapid loading of big genotypic data. ImputationTool performs the following checks:- Assesses strand alignment of alleles and swap SNPs if needed. For example, if a SNP which is in LD with multiple SNPs has a negative score the alleles are swapped and LD is calculated again. If the score of the SNP is still negative the SNP is removed from the study data.
- Regular Quality Checks done during routine processing of GWAS data. These checks include: Hardy-Weinberq equilibrium should be higher that 10-4, minor allele frequency should be higher than 0.01 and call rate should be higher than 0.95. If any of these criteria fails the SNP is removed from the study panel. These SNPs are removed because of the high likelihood that they contain erroneous genotypes.
- Simple sanity checks like check if the SNP is present in the reference panel or if it has null alleles. In both cases the SNP is removed from the study panel.
- Check if there are significant differences between the allele frequencies in the two panels. Difference higher than 25% indicates that there is an important qualitative difference that guides this contrast. A possible imputation of this SNP is prone to introduce invalid information. For this reason these SNPs are removed from the study panel.
- Assesses if the haplotype structure is comparable between reference and GWAS data. This is performed by pairwise comparison of r-squared between SNPs in both reference and GWAS. For SNPs in LD (r-squared > 0.1), the allele frequencies are compared. SNPs are removed from the GWAS data when the major allele differs more often than it is identical.
Step 3: Phasing the study data
- Location: https://github.com/molgenis/molgenis-pipelines/blob/master/compute4/Imputation_shapeit_phasing/protocols/shapeitPhasing.ftl
This protocol phases the study data using shapeit, study data can be in the following formats: Ped/Map?, Bed/Bim?, Gen/Haps?.
This protocol applies the following steps: - Gather the map files as provided in the ALL_1000G_phase1integrated_v3_impute.tgz from the impute website.
- Phase the study data using the map files.
Step 4: Imputing study data
- Location: https://github.com/molgenis/molgenis-pipelines/blob/master/compute4/Imputation_impute2/protocols/impute2Imputation.ftl
This protocol imputes the study data in chromosome bins of 5million bases using Impute2. Afterwards the results per chromosome bin are merged into full chromosomes.
This protocol applies the following steps: - Imputation using Impute2 the phased study data, 1000G map files and the following two parameters over 5million base bins per chromosome:
- k_hap: 1500, this parameter sets the number of reference haplotypes to use (default: 500).
- Ne: 20000, this parameter controls the effective population size in the population genetic-model (default: 20000).
- Concatenating all chromosome bins into a full chromosome.
Information on the Impute2 output format can be found here: http://mathgen.stats.ox.ac.uk/impute/output_file_options.html
Used software
Below is a list of used software and versions.
Software | Version |
UCSC liftOver tool | 20120905 |
Plink | v1.07 |
ImputationTool | 20120912 |
Shapeit | v2.r644 |
Impute | v2.3.0 |
For help with trac wiki formatting: http://trac.edgewall.org/wiki/WikiFormatting