= Imputation pipeline =
[[TOC()]]

There are at the moment two collaborating initiatives:

 * VU: Mathijs Kattenberg and Joukejan Hottenga using IMPUTE
 * UMCG: Lude Franke, Harm-Jan Westra, George Byelas, Morris Swertz using Beagle

The objective is to bring these pipelines into the same space so they can be properly compared and optimized.

TODO: describe the protocols here;

== Description from Harm-Jan ==

The imputation pipeline has changed, in such a way that it was reduced to only a few steps. To facilitate QC and conversion steps, I've bundled our conversion tools in one single program called ImputationTool.jar.  

Here, I shortly describe the steps that need to be in the new pipeline, in placeholders I also describe what the commands could look like, if you would implement this in a shellscript (or java program). These examples can be the complete execution steps of the pipeline.

'' Commands to run locally: '' 
 1. if the dataset is in binary plink format, use plink --recode to convert back to ped+map)
 2. convert dataset to trityper format, if it is in ped+map format.
{{{  
java -Xmx4g -jar ImputationTool.jar pmtt $plinkLocation $trityperOutputLocation
}}}
 3. compare the dataset to be imputed to the reference dataset (for example HapMap2 release 24, also in TriTyper format), and remove any snps for which the haplotypes are different, or do not correlate to the reference dataset. Also remove any SNP that is not in the reference. Save the output as Ped+Map
{{{
java -Xmx4g -jar ImputationTool.jar ttpmh $trityperOutputLocation $referenceLocation $pedAndMapOutputLocation [$famFile] # supply a famfile, if you have any... it is not required
}}}
 4. split the ped files in batches of 300 samples
{{{
  * mkdir -p ".$datasetLocation."/batches/
  * split -a2 -l$batchSize $pedAndMapOutputLocation $batchOutputLocation
}}}
 5. run linkage2beagle to convert the ped and map files to beagle format
{{{
for each batch 
do
      java -Xmx7g -jar linkage2beagle.jar data=$batchOutputLocation/chr$chromosome.dat pedigree=$batchOutputLocation/chr$chromosome.ped.$batch  beagle=$beagleLocation/chr$chromosome.bgl.$batch
done 
}}}

'' Commands to run in server: ''
 6. run the actual imputation on the batches on the cluster (needs hapmap to be recoded to beagle format as well, but I have these files for you)
{{{
for each batch 
do
	java -Xmx11g -Djava.io.tmpdir=\$TMPDIR -jar beagle.jar unphased=$beagleLocation/chr$chromosome.bgl.$batch phased=$referenceLocation/HM2_Chr$chromosome-BEAGLE markers=$referenceLocation/markers_Chr$chromosome.txt missing=0 out=$outputLocation/Chr$chromosome/chr$chromosome-$batch
done
}}}

'' Commands to run locally: ''
 7. convert the beagle imputed files into trityper format
{{{
java -Xmx4g -jar ImputationTool.jar bttb $outputLocation Chr/ChrCHROMOSOME-BATCH $imputedTriTyperLocation $numSamples	
}}}
 8. correlate the imputed snps to the snps in the original dataset
{{{
java -Xmx4g -jar ImputationTool.jar corr $trityperOutputLocation $datasetName $imputedTriTyperLocation $imputedDatasetName 
}}}
 9. (if needed) convert to other formats (plink dosage / ped+map))

That's basically it. A lot simpler than the previous version, don't you think? The required tool is attached to this e-mail, but might still be a bit buggish. Any recommendations are therefore more than welcome.

== IMPUTE pipeline ==

TODO: paste shell script descriptions of each step.

== BEABLE pipeline ==

TODO: paste shell script descriptions of each step