Imputation pipeline
Table of Contents
There are at the moment two collaborating initiatives:
- VU: Mathijs Kattenberg and Joukejan Hottenga using IMPUTE
- UMCG: Lude Franke, Harm-Jan Westra, George Byelas, Morris Swertz using Beagle
The objective is to bring these pipelines into the same space so they can be properly compared and optimized.
TODO: describe the protocols here;
Description from Harm-Jan
Also: ImputationTool
The imputation pipeline has changed, in such a way that it was reduced to only a few steps. To facilitate QC and conversion steps, I've bundled our conversion tools in one single program called ImputationTool.jar.
Here, I shortly describe the steps that need to be in the new pipeline, in placeholders I also describe what the commands could look like, if you would implement this in a shellscript (or java program). These examples can be the complete execution steps of the pipeline.
Commands to run locally:
Step 1
- According to Harm-Jan: if the dataset is in binary plink format, use plink —recode to convert back to ped+map
- Transform .bed , .bim and .fam files into ASCII format
/Users/alexandroskanterakis/Tools/plink/plink-1.07-mac-intel/plink --bfile /Users/alexandroskanterakis/Data/Finnish_cohort/CD_Finnuncorr.maf05 --ped /Users/alexandroskanterakis/Data/Finnish_cohort/CD_Finnuncorr.maf05.ped --map CD_Finnuncorr.maf05.map --recode
- Produces the files:
-rw-r--r-- 1 alexandroskanterakis staff 11761120 Oct 5 12:11 plink.map -rw-r--r-- 1 alexandroskanterakis staff 4898208052 Oct 5 12:11 plink.ped
- real Execution time: 26m20.723s
- creates rather big files 4.6G plink.ped
Step 2
- Convert dataset to trityper format, if it is in ped+map format.
java -Xmx4g -jar /Users/alexandroskanterakis/Tools/imputation/ImputationTool/dist/ImputationTool.jar --mode pmtt --in /Users/alexandroskanterakis/Data/Finnish_cohort/ --out /Users/alexandroskanterakis/Data/Finnish_cohort/ real 74m52.547s user 38m14.646s sys 5m25.472s
- Files Created:
-rw-r--r-- 1 alexandroskanterakis staff 2449071024 Oct 5 15:05 GenotypeMatrix.dat -rw-r--r-- 1 alexandroskanterakis staff 46196 Oct 5 13:54 Individuals.txt -rw-r--r-- 1 alexandroskanterakis staff 98791 Oct 5 13:54 PhenotypeInformation.txt -rw-r--r-- 1 alexandroskanterakis staff 10771996 Oct 5 13:54 SNPMappings.txt -rw-r--r-- 1 alexandroskanterakis staff 5006368 Oct 5 13:54 SNPs.txt
Step 3
- compare the dataset to be imputed to the reference dataset (for example HapMap??2 release 24, also in TriTyper?? format), and remove any snps for which the haplotypes are different, or do not correlate to the reference dataset. Also remove any SNP that is not in the reference. Save the output as Ped+Map
java -Xmx4g -jar /Users/alexandroskanterakis/Tools/imputation/ImputationTool/dist/ImputationTool.jar --mode ttpmh --in /Users/alexandroskanterakis/Data/Finnish_cohort/ --hap /Users/alexandroskanterakis/Data/HapMap2-r24-CEU/ --out /Users/alexandroskanterakis/Data/Finnish_cohort/referenceOutput/ real 60m53.623s user 30m35.172s sys 2m34.325s
- Created Files (for each chromosome):
-rw-r--r-- 1 alexandroskanterakis staff 221184 Oct 5 16:07 chr1.dat -rw-r--r-- 1 alexandroskanterakis staff 1004358 Oct 5 16:06 chr1.excludedsnps.txt -rw-r--r-- 1 alexandroskanterakis staff 442368 Oct 5 16:07 chr1.map -rw-r--r-- 1 alexandroskanterakis staff 441350 Oct 5 16:07 chr1.markersBeagleFormat -rw-r--r-- 1 alexandroskanterakis staff 5802372 Oct 5 16:07 chr1.ped -rw-r--r-- 1 alexandroskanterakis staff 117708 Oct 5 16:06 chr1.warningsnps.txt
Steps 4-9
- beagle: imputes in batches of samples (imputes the whole genome in subsets of samples)
- impute: imputes all samples at once for subset of a genome. Select a window
- split the ped files in batches of 300 samples
* mkdir -p ".$datasetLocation."/batches/ * split -a2 -l$batchSize $pedAndMapOutputLocation $batchOutputLocation
- run linkage2beagle to convert the ped and map files to beagle format
for each batch do java -Xmx7g -jar linkage2beagle.jar data=$batchOutputLocation/chr$chromosome.dat pedigree=$batchOutputLocation/chr$chromosome.ped.$batch beagle=$beagleLocation/chr$chromosome.bgl.$batch done
- split the ped files in batches of 300 samples
Commands to run in server:
- run the actual imputation on the batches on the cluster (needs hapmap to be recoded to beagle format as well, but I have these files for you)
for each batch do java -Xmx11g -Djava.io.tmpdir=\$TMPDIR -jar beagle.jar unphased=$beagleLocation/chr$chromosome.bgl.$batch phased=$referenceLocation/HM2_Chr$chromosome-BEAGLE markers=$referenceLocation/markers_Chr$chromosome.txt missing=0 out=$outputLocation/Chr$chromosome/chr$chromosome-$batch done
Commands to run locally:
- convert the beagle imputed files into trityper format
java -Xmx4g -jar ImputationTool.jar bttb $outputLocation Chr/ChrCHROMOSOME-BATCH $imputedTriTyperLocation $numSamples
- correlate the imputed snps to the snps in the original dataset
java -Xmx4g -jar ImputationTool.jar corr $trityperOutputLocation $datasetName $imputedTriTyperLocation $imputedDatasetName
- (if needed) convert to other formats (plink dosage / ped+map))
That's basically it. A lot simpler than the previous version, don't you think? The required tool is attached to this e-mail, but might still be a bit buggish. Any recommendations are therefore more than welcome.
IMPUTE2 pipeline
impute2 accepts only gen and sample files as input. So we may have to perform some format conversions before running impute2. If our initial datasets are Ped and Map files then we can use the method: ConvertManyPedMapToGenSample? to convert it to gen and sample files. If our initial datasets are in Bed/Bim/Fam? format then we can use ConvertBedBimFamToPedMap? to convert to Ped and Map files and then use ConvertManyPedMapToGenSample? to convert to gen and sample files. As soon as you are done with these conversions steps you can use the UseImpute2WithOnePhasedReferencePanelForCompleteChromosomeInBatches to perform the imputation. This method has to be run once for each chromosome.
ConvertBedBimFamToPedMap?
This script converts BED, BIM, FAM files to PED and MAP by using plink. The "Path to BED, BIM, FAM file" parameter should contain the path and the suffix of these files. For example if the files are:
- /path/to/FILE1.maf05.bed
- /path/to/FILE1.maf05.bim
- /path/to/FILE1.maf05.fam
Then the parameter value should be: /path/to/FILE1.maf05
'Note: creates rather big plink.ped files
Parameters
- Path to plink executable (example: /Users/alexandroskanterakis/Tools/plink/plink-1.07-mac-intel/plink)
- Path to BED, BIM, FAM file (example: /Users/alexandroskanterakis/Data/Finnish_cohort/CD_Finnuncorr.maf05)
- Filename of exported PED file (example: /Users/alexandroskanterakis/Data/Finnish_cohort/CD_Finnuncorr.maf05.ped)
- Filename of exported MAP file (example: /Users/alexandroskanterakis/Data/Finnish_cohort/CD_Finnuncorr.maf05.map)
Example of usage:
Code highlighting:
ConvertBedBimFamToPedMap( plink_path="/Users/alexandroskanterakis/Tools/plink/plink-1.07-mac-intel/plink", bbf_path="/Users/alexandroskanterakis/Data/Finnish_cohort/CD_Finnuncorr.maf05", ped_path="/Users/alexandroskanterakis/Data/Finnish_cohort/CD_Finnuncorr.maf05.ped", map_path="/Users/alexandroskanterakis/Data/Finnish_cohort/CD_Finnuncorr.maf05.map")
Source code
http://www.bbmriwiki.nl/svn/Imputation/impute2/ConvertBedBimFamToPedMap.py
DividePedMapToChromosomes?
This script divides a pair of PED and MAP files to chromosomes by using plink (http://pngu.mgh.harvard.edu/~purcell/plink/).
Parameters
- path to plink (example: /Users/alexandroskanterakis/Tools/plink/plink-1.07-mac-intel/plink)
- path to map file (example: /Users/alexandroskanterakis/Data/Finnish_cohort/CD_Finnuncorr.maf05.map)
- path to ped file (example: /Users/alexandroskanterakis/Data/Finnish_cohort/CD_Finnuncorr.maf05.ped)
- Directory where files will be exported (example: /Users/alexandroskanterakis/Data/Finnish_cohort/DividedChromosomes)
- Suffix of the exported files (example: output_)
Example
Code highlighting:
DividePedMapToChromosomes( plink_path= "/Users/alexandroskanterakis/Tools/plink/plink-1.07-mac-intel/plink", map_path="/Users/alexandroskanterakis/Data/Finnish_cohort/CD_Finnuncorr.maf05.map", ped_path="/Users/alexandroskanterakis/Data/Finnish_cohort/CD_Finnuncorr.maf05.ped", output_path="/Users/alexandroskanterakis/Data/Finnish_cohort/DividedChromosomes", suffix="output_")
Source code
http://www.bbmriwiki.nl/svn/Imputation/impute2/DividePedMapToChromosomes.py
ConvertListsOfPedAndMapFilesToGenAndSample?
Converts many Ped and Map genotype files usually used by plink to gen and sample files usually used by impute, chiamo, hapgen and snptest. The conversion tool that is used is gtool. All Python lists have to have the same size.
Parameters
- gtoolPath : Path to gtool (example: /Users/alexandroskanterakis/Tools/gtool/gtool)
- pedInputListOfFiles : Python list of input ped files
- mapInputListOfFiles : Python list of input map files
- sampleOutputListOfFiles : Python list of output sample files
- genOutputListOfFiles : Python list of output gen files
Example
Code highlighting:
output_divided_path = "/path/to/output/files" suffix = "_chr" ConvertListsOfPedAndMapFilesToGenAndSample( gtoolPath = "/Users/alexandroskanterakis/Tools/gtool/gtool", pedInputListOfFiles= [output_divided_path+ "/" + suffix + "_" + str(x) + ".ped" for x in range(1,22)+['X', 'Y']], mapInputListOfFiles=[output_divided_path+ "/" + suffix + "_" + str(x) + ".map" for x in range(1,22)+['X', 'Y']], sampleOutputListOfFiles=[output_divided_path+ "/" + suffix + "_" + str(x) + ".sample" for x in range(1,22)+['X', 'Y']], genOutputListOfFiles=[output_divided_path+ "/" + suffix + "_" + str(x) + ".gen" for x in range(1,22)+['X', 'Y']] )
Source code
http://www.bbmriwiki.nl/svn/Imputation/impute2/ConvertListsOfPedAndMapFilesToGenAndSample.py
UseImpute2WithOnePhasedReferencePanel
Run impute2 (https://mathgen.stats.ox.ac.uk/impute/impute_v2.html) with one phased reference panel.
Parameters
- impute2Path: Path of the impute2 tool (example: /Users/alexandroskanterakis/Tools/imputation/impute_v2.1.0_MacOSX_Intel/impute2)
- mParameter: Fine-scale recombination map for the region to be analyzed. (example: Users/alexandroskanterakis/Data/HAPMAP_1000GP/hapmap3_r2_plus_1000g_jun2010_b36_ceu/genetic_map_chr1_combined_b36.txt )
- hParameter: File of known haplotypes, with one row per SNP and one column per haplotype (example: /Users/alexandroskanterakis/Data/HAPMAP_1000GP/hapmap3_r2_plus_1000g_jun2010_b36_ceu/hapmap3.r2.b36.allMinusPilot1CEU.chr1.snpfilt.haps)
- lParameter: Legend file(s) with information about the SNPs in the -h file(s) (example: /Users/alexandroskanterakis/Data/HAPMAP_1000GP/hapmap3_r2_plus_1000g_jun2010_b36_ceu/hapmap3.r2.b36.allMinusPilot1CEU.chr1.snpfilt.legend)
- gParameter: File containing genotypes for a study cohort that we want to impute: (example: /Users/alexandroskanterakis/Data/Finnish_cohort/DividedChromosomes/Divided_1.gen)
- startPos: Start position of the genomic interval to use for inference (example: 1)
- endPos: End position of the genomic interval to use for inference (example: 5000000)
- Ne: Effective size' of the population (commonly denoted as Ne in the population genetics literature) from which your dataset was sampled (example: 11418)
- oParameter: Name of main output file. (example: /Users/alexandroskanterakis/Data/Finnish_cohort/DividedChromosomes/Divided_1.impute2)
Example
Code highlighting:
seImpute2WithOnePhasedReferencePanel( impute2Path="/Users/alexandroskanterakis/Tools/imputation/impute_v2.1.0_MacOSX_Intel/impute2", mParameter="/Users/alexandroskanterakis/Data/HAPMAP_1000GP/hapmap3_r2_plus_1000g_jun2010_b36_ceu/genetic_map_chr1_combined_b36.txt", hParameter="/Users/alexandroskanterakis/Data/HAPMAP_1000GP/hapmap3_r2_plus_1000g_jun2010_b36_ceu/hapmap3.r2.b36.allMinusPilot1CEU.chr1.snpfilt.haps", lParameter="/Users/alexandroskanterakis/Data/HAPMAP_1000GP/hapmap3_r2_plus_1000g_jun2010_b36_ceu/hapmap3.r2.b36.allMinusPilot1CEU.chr1.snpfilt.legend", gParameter="/Users/alexandroskanterakis/Data/Finnish_cohort/DividedChromosomes/Divided_1.gen", startPos=1, endPos=5000000, Ne=11418, oParameter="/Users/alexandroskanterakis/Data/Finnish_cohort/DividedChromosomes/Divided_1.impute2")
Source code
http://www.bbmriwiki.nl/svn/Imputation/impute2/UseImpute2WithOnePhasedReferencePanel.py
UseImpute2WithOnePhasedReferencePanelForCompleteChromosomeInBatches
Run impute2 (https://mathgen.stats.ox.ac.uk/impute/impute_v2.html) with one phased reference panel. Divide the chromosome in batches.
Parameters
- impute2Path: Path of impute2 (example: /Users/alex/Tools/impute2/impute_v2.1.2_MacOSX_Intel/impute2)
- mParameter: Fine-scale recombination map for the region to be analyzed (-m parameter) (example: /Volumes/Data2/Impute/alex/hapmap3_r2_plus_1000g_jun2010_b36_ceu/genetic_map_chr1_combined_b36.txt)
- hParameter: File of known haplotypes, with one row per SNP and one column per haplotype (-h parameter) (example: /Volumes/Data2/Impute/alex/hapmap3_r2_plus_1000g_jun2010_b36_ceu/hapmap3.r2.b36.allMinusPilot1CEU.chr1.snpfilt.haps)
- lParameter: Legend file(s) with information about the SNPs in the -h file(s) (-l parameter) (example: /Volumes/Data2/Impute/alex/hapmap3_r2_plus_1000g_jun2010_b36_ceu/hapmap3.r2.b36.allMinusPilot1CEU.chr1.snpfilt.legend)
- gParameter: File containing genotypes for a study cohort that we want to impute (-g parameter) (example: /Volumes/Data2/Impute/alex/gen_sample/chr1.gen)
- sizeOfBatch: Size of each batch (example: 5000000)
- Ne: Effective size of the population (commonly denoted as Ne in the population genetics literature) from which your dataset was sampled (example: 11418)
- oParameter: Name of main output file (-o parameter) (example: /Volumes/Data2/Impute/alex/gen_sample/)
- suffix: suffix for output fil (example: chr_1_)
- indexOfChromosome: Chromosome (1-22, X, Y, M) (example: 1)
Example
Code highlighting:
UseImpute2WithOnePhasedReferencePanelForCompleteChromosomeInBatches( impute2Path = "/Users/alex/Tools/impute2/impute_v2.1.2_MacOSX_Intel/impute2" , mParameter = "/Volumes/Data2/Impute/alex/hapmap3_r2_plus_1000g_jun2010_b36_ceu/genetic_map_chr1_combined_b36.txt" , hParameter = "/Volumes/Data2/Impute/alex/hapmap3_r2_plus_1000g_jun2010_b36_ceu/hapmap3.r2.b36.allMinusPilot1CEU.chr1.snpfilt.haps" , lParameter = "/Volumes/Data2/Impute/alex/hapmap3_r2_plus_1000g_jun2010_b36_ceu/hapmap3.r2.b36.allMinusPilot1CEU.chr1.snpfilt.legend" , gParameter = "/Volumes/Data2/Impute/alex/gen_sample/chr1.gen" , sizeOfBatch = 5000000 , Ne = 11418 , oParameter = "/Volumes/Data2/Impute/alex/gen_sample/" , suffix = "chr_1_" , indexOfChromosome = "1")
Source Code
BEABLE pipeline
TODO: paste shell script descriptions of each step
Discussion
Mixing platforms may influence imputation results
We into some troubles, resulting in our test statistic being highly inflated (which is indicative of false positive results). We thought of some possible causes which might explain this effect, although we should still test them:
- SNPs with bad imputation quality: we should remove SNPs with an R2 value < 0.90 prior to GWAS. These values are stored alongside the beagle imputation output. Taking a more stringent cutoff seemed to decrease the inflation, although you lose half of the SNPs.
- batch effects caused by overrepresentation of a certain haplotype within an imputation batch: for each batch of samples, beagle estimates a best fitting model to predict the genotypes of the missing SNPs, which is dependent upon both the input data as the reference dataset. Cases and controls should be therefore randomly distributed across the batches. Another option is to use impute, rather than beagle, since its batches are across parts of the genome, instead of samples.
- difference in source platform: different platforms have different SNP content. When you impute datasets coming from different platforms, the resulting model which is based on the input data is also different. When associating traits in a GWAS meta-analysis, these differences may account for a platform specific effect. We should therefore remove the SNPs which are non-overlapping between such platforms, prior to imputation, and impute the samples after combining the datasets. This would remove such a platform-bias, although would also cause a huge loss of available SNPs, when the overlap between platforms is small. However, in my opinion, this problem is similar to the batch effect problem, and can possibly be resolved by randomizing the sample content of the batches: the model will then possibly be fitted to the data that is available. In any case the datasets that are used in a meta-analysis should be imputed together.