Introduction
ImputationTool is a collection of methods to perform pre- and post- analysis for imputation related tasks.
Implementation
ImputationTool developers:
- Dr. Lude Franke (Lude@…): Format design, Initial methods.
- Harm-Jan Westra (harm-jan@…): Extensions, Format converters, SNPs checks.
It has been written in java, NetBeans.
Availability
Documentation
From the ImputationTool help screen:
ImputationTool v0.2 ------------------------ PreProcessing ------------------------ # Create random batches of cases and controls from a TriTyper dataset. Creates a file called batches.txt in outdir. --mode batch --in TriTyperdir --out outdir --size batchsize ------------------------ Imputation ------------------------ # Convert Impute Imputed data into TriTyper --mode itt --in ImputeDir --out TriTyperDir ------------------------ Beagle ------------------------ # Convert beagle files (one file/chromosome) to TriTyper. Filetemplate is a template for the batch filenames, The text CHROMOSOME will be replaced by the chromosome number. --mode btt --in BeagleDir --tpl template --ext ext --out TriTyperDir [--fam famfile] # Convert batches of beagle files (multiple files / chromosome) to trityper files. Filetemplate is a template for the batch filenames, The text CHROMOSOME will be replaced by the chromosome number, BATCH by the batchname. --mode bttb --in BeagleDirdir --tpl template --out TriTyperDir --size numbatches ------------------------ Ped+Map (Plink files) ------------------------ # Converts Ped and Map files created by ttpmh to Beagle format --mode pmbg --in indir --batch-file batches.txt # Converts TriTyper file to Plink Dosage format. Filetemplate is a template for the batch filenames, The text CHROMOSOME will be replaced by the chromosome number, BATCH by the batchname. --mode ttpd --in indir --beagle beagledir --tpl template --batchdesc batchdescriptor --out outdir --fam famfile # Converts PED and MAP files to TriTyper. --mode pmtt --in Ped+MapDir --out TriTyperDir # Converts TriTyper file to PED and MAP files. The FAM file is optional. --split splits the ped and map files per chromosome --mode ttpm --in indir --out outdir [--fam famfile] [--split] # Converts TriTyper dataset to Ped+Map concordant to reference (hap) dataset. Supply a batchfile if you want to export in batches. Supply a chromosome if you want to export a certain chromosome. --mode ttpmh --in TriTyperDir --hap TriTyperReferenceDir --out outdir [--fam famfile] [--batch-file batchfile] [--chr chromosome] [--exclude fileName] --------------------- PostProcessing --------------------- # Correlates genotypes of imputed vs non-imputed datasets. Saves a file called correlationOutput.txt in outdir, containing correlation per chromosome as well as correlation distribution. --mode corr --in TriTyperDir --name datasetname --in2 TriTyperDir2 --name2 datasetname2 --out outdir [--snps snplist] # Correlates genotypes of imputed vs non-imputed datasets. Also take Beagle imputation score (R2) into account. Saves a file called correlationOutput.txt in outdir, containing correlation per chromosome as well as correlation distribution. --mode corrb --in TriTyperDir --name datasetname --in2 TriTyperDir2 --name2 datasetname2 --out outdir --beagle beagleDir --tpl template --size numBatches # Gets all the excluded snps from chrx.excludedsnps.txt with a certain call-rate threshold (0 < threshold < 1.0) --mode ecra --in TriTyperDir --threshold threshold # Generates R2 distribution (beagle quality score) for each batch and chromosome, and tests each batch against chromosome R2 distribution, using WilcoxonMannWhitney test --mode r2dist --in BeagleDir --template template --out outdir --size numbatches # Merge two TriTyper datasets --mode merge --in TriTyper1Dir --in2 TryTyper2Dir --out outdir
Example
This is a common scenario using ImputationTool: Suppose that we have the following directory structure:
- study
- study.ped
- study.map
- reference
- reference.ped
- reference.map
To impute the study vs. the reference with beagle:
- mkdir study_TriTyper
- Convert study ped/map data to TriTyper:
- java -jar ImputationTool.jar --mode pmtt --in study/ --out study_TriTyper
- mkdir reference_TriTyper
- Convert reference ped/map data to Trityper:
- java -jar ImputationTool.jar --mode pmtt --in study/ --out study_TriTyper
- mkdir batches
- Create batches of 300 samples of the study:
- java -jar ImputationTool.jar --mode batch --in study_TriTyper --out batches/ --size 300
- mkdir reference_analyzed
- Convert reference into beagle (and do some quality check as well)
- java -jar ImputationTool.jar --mode ttpmh --in reference_TriTyper --hap reference_TriTyper --out reference_analyzed
- convert reference to beagle (repeat for rest chromosomes)
- java -jar linkage2beagle.jar reference_analyzed/chr1.dat reference_analyzed/chr1.ped > reference_analyzed/chr1.bgl
- mkdir study_reference_compare
- Perform a comparison between reference and study.
- java -jar ImputationTool.jar --mode ttpmh --in study_TriTyper --hap reference_TriTyper --batch-file batches/batches.txt --out study_reference_compare
- Convert analyzed study to beagle:
- java -jar linkage2beagle.jar study_reference_compare/chr1.dat study_reference_compare/chr1.ped > reference_compare/chr1.bgl
- mkdir RESULTS
- And now time for the imputation step (beagle needed)
- java -jar beagle.jar phased=reference_analyzed/chr1.bgl unphased=study_reference_compare/chr1.bgl markers=reference_analyzed/chr1.markersBeagleFormat missing=0 out=RESULTS/output
The TriTyper Format
TriTyper is a binary format to store genotype information, including insertion, deletion and expression data, providing very efficient read/write/seek methods.
Filtering
In the ttpmh mode, ImputationTool applies the following filtering between a study and a reference dataset:
The filtering steps imputation tool does when comparing to reference:
- assesses alleles and swaps SNP if needed
ref: C/T GWAS: A/G --> needs to be swapped and inverted to become C/T
- checks Hardy-Weinberg equilibrium <= 0.0001, MAF < 0.01, callrate < 0.95. If above threshold, SNP is removed
- checks if SNP is present in reference data, if not, SNP is removed from GWAS data
- checks if SNP has null alleles, if so, SNP is removed from GWAS data
- checks if allele frequency is comparable to reference. If not (>25% difference), SNP is removed from GWAS data.
- Assesses if the haplotype structure is comparable between reference and GWAS data. This is performed by pairwise comparison of r-squared between SNPs in both reference and GWAS. For SNPs in LD (r-squared > 0.1), the allele frequencies are compared. SNPs are removed from the GWAS data when the major allele differs more often than it is identical.