= SNP calling pipeline = Status: Alpha Authors: Freerk van Dijk, Morris Swertz This is the documentation of the BBMRI-NL snp calling pipeline based on the [http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit Broad GATK]. It consists of the following three workflows: * Workflow 1: SnpCallingPipeline/ReferencePreparation * Workflow 2: SnpCallingPipeline/AlignmentAndCleaning * Workflow 3: SnpCallingPipeline/VariantCalling == Schematic Overview == This simplified overview this schema hides intermediate sort and indexing steps and only shows data inputs/outputs first time they occur. {{{#!graphviz digraph g { size="10,10" node [shape=box,style=filled,color=white] "dbsnp" "reference.fasta" "realign.intervals" "indelcalls.vcf" "chr[1-24].fasta" "flowcell_lane.1.fq.gz" "flowcell_lane.2.fq.gz" "flowcell_lane.aligned.bam" "flowcell_lane2.aligned.bam" "flowcell_lane3.aligned.bam" "sample.aligned.bam" "sample QC reports" "sample_chr[1-24].vcf" node [shape=ellipse,color=yellow] subgraph cluster_0 { style=filled; color=lightgrey; "reference.fasta" -> RealignerTargetCreator -> "realign.intervals" "indelcalls.vcf"-> RealignerTargetCreator "reference.fasta"->Split->"chr[1-24].fasta" dbsnp -> RealignerTargetCreator label = "Per genome (1)"; } subgraph cluster_1 { style=filled; color=lightgrey; "flowcell_lane.1.fq.gz" -> align1 -> alignPE "chr[1-24].fasta" -> align1 "chr[1-24].fasta" -> align2 "chr[1-24].fasta" -> alignPE "flowcell_lane.2.fq.gz" -> align2 -> alignPE -> MarkDuplicates -> "IndelRealigner & \n FixMateInformation (knownsOnly)" ->"Quality Recalibration"->"flowcell_lane.aligned.bam" "realign.intervals" -> "IndelRealigner & \n FixMateInformation (knownsOnly)" label = "Per Lane (750*3=2250) "; } subgraph cluster_2 { style=filled; color=lightgrey; "flowcell_lane.aligned.bam" -> Merge -> "sample.aligned.bam" -> "IndelRealigner & FixMateInformation" "flowcell_lane2.aligned.bam" -> Merge "flowcell_lane3.aligned.bam" -> Merge "IndelRealigner & FixMateInformation" -> IndelGenotyperV2 -> FilterSingleCalls -> UnifiedGenotyper -> Filtration -> VariantEval -> "sample QC reports" Filtration -> "sample_chr[1-24].vcf" label = "Per Sample or Trio*Chromosome (750*24=18k)"; } subgraph cluster_3 { style=filled; color=lightgrey; "sample.aligned.bam" -> "TODO compare against arrays and BGI output" label = "QC per sample"; } } }}} Discussion * How long takes alignment per genome? * If this takes very long we can split read files * How long takes realign knownsonly (per genome)? * If very long, we need to rewrite workflow 2 to split before realign * For realign: if we split per chromosome, can we also split bam file? * How to easily lift over from b36 to b37 * Contact BGI if they can use b37?? Todo: * Recode workflow 2 to work per genome instead of per chromosome * QC pipeline * Simple SNP caller * GATK variant eval to make venn diagrams * Contact Yurii for this; Let Jeroen take charge? == List of steps == [[TOC(SnpCallingPipeline/ReferencePreparation,SnpCallingPipeline/AlignmentAndCleaning,SnpCallingPipeline/VariantCalling,inline,noheading)]]