== Introduction == !ImputationTool is a collection of methods to perform pre- and post- analysis for imputation related tasks. === Implementation === !ImputationTool developers: * Dr. Lude Franke (Lude@ludesign.nl): Format design, Initial methods. * Harm-Jan Westra (harm-jan@attix.nl): Extensions, Format converters, SNPs checks. It has been written in java, !NetBeans. === Availability === * http://www.bbmriwiki.nl/svn/Imputation/alex/ImputationTool/ == Documentation == From the !ImputationTool help screen: {{{ ImputationTool v0.2 ------------------------ PreProcessing ------------------------ # Create random batches of cases and controls from a TriTyper dataset. Creates a file called batches.txt in outdir. --mode batch --in TriTyperdir --out outdir --size batchsize ------------------------ Imputation ------------------------ # Convert Impute Imputed data into TriTyper --mode itt --in ImputeDir --out TriTyperDir ------------------------ Beagle ------------------------ # Convert beagle files (one file/chromosome) to TriTyper. Filetemplate is a template for the batch filenames, The text CHROMOSOME will be replaced by the chromosome number. --mode btt --in BeagleDir --tpl template --ext ext --out TriTyperDir [--fam famfile] # Convert batches of beagle files (multiple files / chromosome) to trityper files. Filetemplate is a template for the batch filenames, The text CHROMOSOME will be replaced by the chromosome number, BATCH by the batchname. --mode bttb --in BeagleDirdir --tpl template --out TriTyperDir --size numbatches ------------------------ Ped+Map (Plink files) ------------------------ # Converts Ped and Map files created by ttpmh to Beagle format --mode pmbg --in indir --batch-file batches.txt # Converts TriTyper file to Plink Dosage format. Filetemplate is a template for the batch filenames, The text CHROMOSOME will be replaced by the chromosome number, BATCH by the batchname. --mode ttpd --in indir --beagle beagledir --tpl template --batchdesc batchdescriptor --out outdir --fam famfile # Converts PED and MAP files to TriTyper. --mode pmtt --in Ped+MapDir --out TriTyperDir # Converts TriTyper file to PED and MAP files. The FAM file is optional. --split splits the ped and map files per chromosome --mode ttpm --in indir --out outdir [--fam famfile] [--split] # Converts TriTyper dataset to Ped+Map concordant to reference (hap) dataset. Supply a batchfile if you want to export in batches. Supply a chromosome if you want to export a certain chromosome. --mode ttpmh --in TriTyperDir --hap TriTyperReferenceDir --out outdir [--fam famfile] [--batch-file batchfile] [--chr chromosome] [--exclude fileName] --------------------- PostProcessing --------------------- # Correlates genotypes of imputed vs non-imputed datasets. Saves a file called correlationOutput.txt in outdir, containing correlation per chromosome as well as correlation distribution. --mode corr --in TriTyperDir --name datasetname --in2 TriTyperDir2 --name2 datasetname2 --out outdir [--snps snplist] # Correlates genotypes of imputed vs non-imputed datasets. Also take Beagle imputation score (R2) into account. Saves a file called correlationOutput.txt in outdir, containing correlation per chromosome as well as correlation distribution. --mode corrb --in TriTyperDir --name datasetname --in2 TriTyperDir2 --name2 datasetname2 --out outdir --beagle beagleDir --tpl template --size numBatches # Gets all the excluded snps from chrx.excludedsnps.txt with a certain call-rate threshold (0 < threshold < 1.0) --mode ecra --in TriTyperDir --threshold threshold # Generates R2 distribution (beagle quality score) for each batch and chromosome, and tests each batch against chromosome R2 distribution, using WilcoxonMannWhitney test --mode r2dist --in BeagleDir --template template --out outdir --size numbatches # Merge two TriTyper datasets --mode merge --in TriTyper1Dir --in2 TryTyper2Dir --out outdir }}} === Example === This is a common scenario using !ImputationTool: Suppose that we have the following directory structure: * study * study.ped * study.map * reference * reference.ped * reference.map To impute the study vs. the reference with beagle: * mkdir study_TriTyper * Convert study ped/map data to !TriTyper: * java -jar !ImputationTool.jar --mode pmtt --in study/ --out study_TriTyper * mkdir reference_TriTyper * Convert reference ped/map data to Trityper: * java -jar !ImputationTool.jar --mode pmtt --in study/ --out study_TriTyper * mkdir batches * Create batches of 300 samples of the study: * java -jar !ImputationTool.jar --mode batch --in study_TriTyper --out batches/ --size 300 * mkdir reference_analyzed * Convert reference into beagle (and do some quality check as well) * java -jar !ImputationTool.jar --mode ttpmh --in reference_TriTyper --hap reference_TriTyper --out reference_analyzed * convert reference to beagle (repeat for rest chromosomes) * java -jar linkage2beagle.jar reference_analyzed/chr1.dat reference_analyzed/chr1.ped > reference_analyzed/chr1.bgl * mkdir study_reference_compare * Perform a comparison between reference and study. * java -jar !ImputationTool.jar --mode ttpmh --in study_TriTyper --hap reference_TriTyper --batch-file batches/batches.txt --out study_reference_compare * Convert analyzed study to beagle: * java -jar linkage2beagle.jar study_reference_compare/chr1.dat study_reference_compare/chr1.ped > reference_compare/chr1.bgl * mkdir RESULTS * And now time for the imputation step (beagle needed) * java -jar beagle.jar phased=reference_analyzed/chr1.bgl unphased=study_reference_compare/chr1.bgl markers=reference_analyzed/chr1.markersBeagleFormat missing=0 out=RESULTS/output === The !TriTyper Format === !TriTyper is a binary format to store genotype information, including insertion, deletion and expression data, providing very efficient read/write/seek methods. === Filtering === In the '''ttpmh''' mode, !ImputationTool applies the following filtering between a study and a reference dataset: The filtering steps imputation tool does when comparing to reference: * assesses alleles and swaps SNP if needed ref: C/T GWAS: A/G --> needs to be swapped and inverted to become C/T * checks Hardy-Weinberg equilibrium <= 0.0001, MAF < 0.01, callrate < 0.95. If above threshold, SNP is removed * checks if SNP is present in reference data, if not, SNP is removed from GWAS data * checks if SNP has null alleles, if so, SNP is removed from GWAS data * checks if allele frequency is comparable to reference. If not (>25% difference), SNP is removed from GWAS data. * Assesses if the haplotype structure is comparable between reference and GWAS data. This is performed by pairwise comparison of r-squared between SNPs in both reference and GWAS. For SNPs in LD (r-squared > 0.1), the allele frequencies are compared. SNPs are removed from the GWAS data when the major allele differs more often than it is identical.