Version 41 (modified by 14 years ago) (diff) | ,
---|
Table of Contents
SNP calling pipeline
Status: Alpha
Authors: Freerk van Dijk, Morris Swertz
Based on Broad GATK pipeline.
To perform the analysis as fast and good as possible the pipeline has been divided into several small processes. These processes are all numbered and can be found below, including commands, input and output files starting with pre-alignment and ending with variation calling & filtering.
- SnpCallingPipeline/ReferencePreparation
- SnpCallingPipeline/AlignmentAndCleaning
- SnpCallingPipeline/VariantCalling
Simplified Overview
This simplified overview this schema hides intermediate sort and indexing steps and only shows data inputs/outputs first time they occur.
Workflow 1: genome reference file creation
This workflow creates reference files per chromosome including:
- genome, dbsnp and indel vcfs per chromosome
- realign targets for faster realignment target creation
- index files for samtools and bwa
Workflow inputs:
- genome.chr.fa - downloaded from genome supplier (now hg19)
- dbsnpXYZ.rod - downloaded reference SNPs from dbsnp (now 129)
- indelsXYZ.vcf - downloaded reference indels from 1KG
Workflow outputs:
- genome.chr.fa - cleaned headers
- genome.chr.fa.fa - index for samtools
- genome.chr.fa.<format> - multilple index files for bwa
- dbsnpXYZ.chr.rod - split per chromosome
- indelsXYZ.chr.vcf - split per chromosome
- genome.chr.realign.intervals - targets for realignment
clean-fasta-headers
Clean headers to only have '1' instead of Chr1, etc
tool: | |
inputs: | genome.chr.fa |
outputs: | genome.chr.fa |
doc: | internally developed |
split-vcf-chr for dbsnp and indels
Split vcf per chromosome
tool: | |
inputs: | dbsnpXYZ.rod, indelsXYZ.vcf |
outputs: | dbsnpXYz.chr.rod, indelsXYZ.vcf |
doc: |
Discussion:
Can we use http://vcftools.sourceforge.net/options.html ?
vcftools --vcf indelsXYZ.vcf --chr <i> --recode --out indelsXYZ.chr
index-chromosomes
Index reference sequence for each chromosome in the FASTA format
tool: | samtools faidx |
input: | genome.chr.fa |
output: | genome.chr.fa.fai |
doc: | http://samtools.sourceforge.net/samtools.shtml#3 |
bwa-index-chromosomes
Index reference sequence for each chromosome for bwa alignment
tool: | bwa index -a IS |
input: | genome.chr.fa |
output: | genome.chr.fa.xyz |
doc: | http://bio-bwa.sourceforge.net/bwa.shtml#3 |
RealignerTargetCreator
Generate realignment targets for known sites for each chromosome
tool: | GenomeAnalysisTK.jar -T RealignerTargetCreator? |
input: | genome.chr.fa, dbsnpXYz.chr.rod, indelsXYZ.vcf |
output: | genome.chr.realign.intervals |
doc: | http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Running_the_Indel_Realigner_only_at_known_sites |
Workflow 2: Alignment per Lane, per Chr
This workflow aligns reads per lane and chromosome, including:
- re-alignment to prevend false SNP calls caused by indels (using known indels)
- markduplicates to prevend false coverage caused by PCR errors (per library = lane)
- base quality recalibration to correct for false low scores caused by true variation
Workflow Inputs:
- lane.1.fq.gz - raw reads for lane, pair end 1
- lane.2.fq.gz - raw reads for lane, pair end 2
- genome.chr.fasta - reference genome split on chromosome
- genome.chr.realign.intervals - targets for realignment per chromosome
- genome.chr.dbsnpXYZ.rod - known snp variants, here from dpbsnp
- genome.chr.indelsXYZ.vcf - known indels from, here from 1KG
Workflow ouputs:
- lane.chr.1.sai - alignment index for first pair
- lane.chr.2.sai - alignment index for second pair
- lane.chr.sam - alignment map for
- lane.chr.bam - alignment map in binary format
- lane.chr.sorted.bam - sorted alignment map
- lane.chr.sorted.bai - sorted alignment index
- lane.chr.dedup.bam - marked duplicate PCR elements
- lane.chr.dedup.metrics - metrics describing deduplication
- lane.chr.realigned.bam - realigned based on known indels
- lane.chr.matefixed.bam - fixed the mate pair ends
- lane.chr.covariate_table.csv - table of countcovariates output for recalibration
- lane.chr.recal.bam - alignment map with recalibrated quality scores
align
Align each end of paired end.
tool: | bwa-align |
input: | chr.fasta, lane.1.fq.gz, lane.2.fq.gz |
output: | lane.chr.1.sai, lane.chr.2.sai |
docs: | http://bio-bwa.sourceforge.net/bwa.shtml |
align-pe
Align the pairs as one
tool: | bwa sampe |
inputs: | chr.fasta lane.1.fq.gz lane.2.fq.gz lane.chr.1.sai lane.chr.2.sai |
outputs: | lane.chr.sam |
docs: | http://bio-bwa.sourceforge.net/bwa.shtml |
sam-to-bam
Convert sam to bam
tool: | samtools view |
inputs: | lane.chr.sam |
outputs: | lane.chr.bam |
docs: | http://samtools.sourceforge.net/samtools.shtml |
(Question: can this not index and sort?)
sam-sort
Sort bam file on coordinate
tool: | samtools sort |
inputs: | lane.chr.bam |
outputs: | lane.chr.sorted.bam |
docs: | http://samtools.sourceforge.net/samtools.shtml |
sam-index
Index bam file for quicker access
tool: | samtools index |
inputs: | lane.chr.sorted.bam |
outputs: | lane.chr.sorted.bai |
docs: | http://samtools.sourceforge.net/samtools.shtml |
MarkDuplicates
Mark duplicate PCR fragments to be filtered in analysis
tool: | MarkDuplicates?.jar |
inputs: | lane.chr.sorted.bam |
outputs: | lane.chr.dedup.bam lane.chr.dedup.metrics |
docs: | http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates |
IndelRealigner-KnownsOnly
Improve the alignment using known indel information (will reduce false SNP calls)
tool: | GenomeAnalysisTK.jar -T IndelRealigner? |
inputs: | lane.chr.dedup.bam genome.chr.realign.intervals genome.chr.dbsnpXYZ.rod genome.chr.indelsXYZ.vcf |
outputs: | lane.chr.realigned.bam |
docs | http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Running_the_Indel_Realigner_only_at_known_sites |
FixMateInformation
Fix the paired end information as consequence of the realignment.
tool: | FixMateInformation?.jar |
inputs: | lane.chr.realigned.bam |
outputs: | lane.chr.matefixed.bam |
docs: | http://picard.sourceforge.net/command-line-overview.shtml#FixMateInformation, |
CountCovariates
Count covariants, such as machine cycle and bp position, to be used as basis for quality recalibration. Optionally: plot the results to pdf using AnalyzeCovariates?
tool: | GenomeAnalysisTK.jar -T CountCovariates?, AnalyzeCovariates?.jar |
inputs: | lane.chr.matefixed.bam genome.chr.dbsnpXYZ.rod |
outputs: | lane.chr.covariate_table.csv |
docs: | http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#CountCovariates |
TableRecalibration
Recalibrate quality scores based on the covariate table
tool: | GenomeAnalysisTK.jar -T TableRecalibration? |
inputs: | lane.chr.matefixed.bam lanec.chr.recal_table.csv chr.fasta |
outputs: | lane.chr.recal.bam |
docs: | http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#TableRecalibration |
Repeat: sam-sort, sam-index, countcovariates
See steps above for commands and docs.
inputs: | lane.chr.recal.bam |
outputs: | lane.chr.recal.sorted.bam, lane.chr.recal.sorted.bam.bai, lane.chr.recal.covariate_table.csv |
Discussion:
wy do we need to sort and index after recalibration? does it mess up the order of things?
Attachments (3)
- Figure1.png (349.2 KB) - added by 14 years ago.
- Figure2.png (311.5 KB) - added by 14 years ago.
- Figure3.png (224.0 KB) - added by 14 years ago.
Download all attachments as: .zip