74 | | == A. Pre-alignment == |
75 | | |
76 | | This process involves all needed steps to prepare the raw sequencing reads for an alignment using BWA. These steps are all conducted using PICARD. An overview of the steps can be found in Appendix A. |
77 | | === 1) Read converting === |
78 | | |
79 | | This step involves converting the output from the Illumina GAII, FASTQ format, to binary SAM format (BAM). To do this !FastqToSam.jar can be used. The following command was used: |
80 | | |
81 | | {{{ |
82 | | #!sh |
83 | | java -jar !FastqToSam.jar FASTQ=../testrun/s_2_sequence041090.txt QUALITY_FORMAT=Illumina OUTPUT=../testrun/s_2_sequence.bam SAMPLE_NAME=s_2_test_sample PLATFORM_UNIT=barcode_13434HWUSIsomething PLATFORM=illumina SORT_ORDER=coordinate |
84 | | }}} |
85 | | |
86 | | It is also possible to convert raw BUSTARD data to BAM using !BustardToSam.jar. The advantage of this tool is the multiple lane input, in theory all lanes from one flow cell can be converted parallel. To remove possible linkers from reads a tool named !ExtractIlluminaBarcodes.jar can be used. Since there isn’t any data including these barcodes this and the !BustardToSam haven’t been tested yet. |
87 | | |
88 | | === 2) Sort & index reads === |
89 | | |
90 | | The output of !FastqToSam.jar needs to be sorted and indexed in order to use it as input for the alignment. To sort these files SAMtools was used. The following commands can be executed on the BAM file: |
91 | | |
92 | | {{{ |
93 | | #!sh |
94 | | ./samtools sort ../testrun/s_2_sequence041090_aligned.bam ../testrun/s_2_aligned_sorted |
95 | | }}} |
96 | | |
97 | | The postfix .bam is automatically added to the output file. |
98 | | |
99 | | The corresponding BAM index file can be created using: |
100 | | |
101 | | {{{ |
102 | | #!sh |
103 | | ./samtools index ../testrun/s_2_aligned_sorted.bam ../testrun/s_2_aligned_sorted.bam.bai |
104 | | }}} |
105 | | |
106 | | == B. Alignment == |
107 | | This consists of alignment using BWA, coverting steps and the sorting and indexing of the output BAM file. |
108 | | |
109 | | === 3) Alignment] === |
110 | | |
111 | | '''The three commands in this section are identical. Copy&Paste error? -- Leon''' |
112 | | |
113 | | To align the reads to a reference genome the GATK uses BWA. The indices used for this step were already made for an earlier Galaxy project. GATK uses an own version of BWA which doesn’t support multiple threading yet. To overcome this problem alignment was performed using the standard BWA. The following command was executed:''''''' |
114 | | |
115 | | {{{ |
116 | | #!sh |
117 | | ./bwa aln -l 101 -k 4 -t 16 ../Sting/dist/resources/index_hg36.fa ../testrun/s_2_sequence041090.txt -f ../testrun/s_2_sequence041090_aligned.sai |
118 | | }}} |
119 | | |
120 | | This results in an alignment file in .sai format. To convert this to SAM format the following command was used: |
121 | | |
122 | | {{{ |
123 | | #!sh |
124 | | ./bwa aln -l 101 -k 4 -t 16 ../Sting/dist/resources/index_hg36.fa ../testrun/s_2_sequence041090.txt -f ../testrun/s_2_sequence041090_aligned.sai |
125 | | }}} |
126 | | |
127 | | To convert this SAM file to BAM SAMtools was used, the command: |
128 | | |
129 | | {{{ |
130 | | #!sh |
131 | | ./bwa aln -l 101 -k 4 -t 16 ../Sting/dist/resources/index_hg36.fa ../testrun/s_2_sequence041090.txt -f ../testrun/s_2_sequence041090_aligned.sai |
132 | | }}} |
133 | | |
134 | | === 4) Sort & index alignment === |
135 | | After this converting step the BAM file needs to be sorted & indexed again using the following command: |
136 | | |
137 | | {{{ |
138 | | #!sh |
139 | | ./samtools sort ../testrun/s_2_sequence041090_aligned.bam ../testrun/s_2_aligned_sorted |
140 | | }}} |
141 | | |
142 | | '''Above should be ../testrun/s_2_aligned_sorted.bam? --Leon''' |
143 | | |
144 | | The index was created using: |
145 | | |
146 | | {{{ |
147 | | #!sh |
148 | | ./samtools index ../testrun/s_2_aligned_sorted.bam ../testrun/s_2_aligned_sorted.bam.bai |
149 | | }}} |
150 | | |
151 | | '''NOTE. Keep track of the reference build used. UCSC uses human build hg18 and hg19, which correspond to build 36 and 37 of the NCBI. These builds contains the same information and sequences but different headers etc. The dbSNP ROD files which need to be used further on in the pipeline can be downloaded from UCSC. To overcome any compatibility problems it is advised to use hg18 and hg19 as reference during alignment and further calculations.''' |
152 | | |
153 | | == C. Post-alignment == |
154 | | This step involves all processes executed on the generated alignments. These steps include quality value recalibration, SNP/indel calling, annotation and the calculation of statistics as well. |
155 | | |
156 | | === 5) Quality value recalibration (Count covariates)] === |
157 | | |
158 | | To determine the covariates influencing base quality scores in the BAM file obtained after alignment this step is performed. This results in a CSV file. The following command was used to execute this step: |
159 | | |
160 | | {{{ |
161 | | #!sh |
162 | | java -jar GenomeAnalysisTK.jar -R resources/index_hg36.fa --default_platform illumina --DBSNP resources/hg36/snp130.txt -I ../../testrun/s_2_aligned_sorted.bam --max_reads_at_locus 20000 -T !CountCovariates -cov !ReadGroupcovariate -cov !QualityScoreCovariate -cov !CycleCovariate -cov !DinucCovariate -recalFile ../../testrun/recal_data.csv |
163 | | }}} |
164 | | |
165 | | The parameters used for this calculation need to be tested more. This to see the influence of each parameter on the output results. |
166 | | |
167 | | === 6) Quality value recalibration (Table recalibration) === |
168 | | |
169 | | This step includes the actual quality value recalibration. The recalibration takes place using the generated CSV file as input. The output of this process is the alignment file with recalibrated quality values in BAM format. This step was executed using the following command: |
170 | | |
171 | | {{{ |
172 | | #!sh |
173 | | java -jar GenomeAnalysisTK.jar -R resources/index_hg36.fa --default_platform illumina -I ../../testrun/s_2_aligned_sorted.bam -T !TableRecalibration -outputBam ../../testrun/s_2_aligned_recalibrated.bam -recalFile ../../testrun/recal_data.csv |
174 | | }}} |
175 | | |
176 | | === 7) Sort & index recalibrated alignment] === |
177 | | |
178 | | The resulting BAM file needs to be sorted and indexed again. This step needs to be repeated every time an analysis or calculation has been performed on a BAM file. To sort and index this file use the commands explained in step 4. |
179 | | |
180 | | === 8) Quality value recalibration (Count covariates) === |
181 | | |
182 | | This step is executed on the obtained BAM file after the quality value recalibration. This file is used as input, the command used further is exactly the same as the command used in step 5. |
183 | | |
184 | | === 9) Generate graphs/stats === |
185 | | To generate several statistics and graphs on the recalibrated alignment the !AnalyzeCovariates.jar tool can be used. This tool uses the generated .CSV file as an input. The output is a directory containing several statistics and graphs about the quality values etc. The following command was used: |
186 | | |
187 | | {{{ |
188 | | #!sh |
189 | | Java –jar !AnalyzeCovariates.jar –recalFile ../../testrun/recal_data.csv –outputDir ../../testrun/analyzecovariates_output –resources resources/index_hg36.fa –ignoreQ 5 |
190 | | }}} |
191 | | |
192 | | === 10) Generate graphs/stats === |
193 | | This step calculates the same as step 9, but used the CSV file from step 5 as input. The output from this step can be compared to the output of step 9. By comparing the output from both above mentioned steps the increase and changes in quality value etc. can be visualized. |
194 | | |
195 | | == D. SNP & indel calling == |
196 | | This part of the pipeline involves the tools for SNP and indel calling. This part also includes local realignment and filtering for potential SNPs. The following steps divide this part in sub processes. |
197 | | |
198 | | === 11) Local realignment around indels === |
199 | | |
200 | | The first step is to determine small suspicious intervals in the alignment. These intervals will then be realigned to get an optimal output result. To do this, these intervals were calculated using the following command: |
201 | | |
202 | | {{{ |
203 | | #!sh |
204 | | java -jar GenomeAnalysisTK.jar -T !RealignerTargetCreator -I ../../testrun/s_2_aligned_recalibrated_sorted.bam -R resources/index_hg36.fa -o ../../testrun/forRealigner.intervals -D resources/hg36/snp130.txt |
205 | | }}} |
206 | | |
207 | | This process outputs a file containing intervals on which realignment should be executed. |
208 | | |
209 | | === 12) Cleaning BAM alignment file by realignment === |
210 | | |
211 | | For each detect interval a local realignment takes place. This results in a cleaned BAM file on which SNP and indel calling takes place. The local alignment was performed using the following command: |
212 | | |
213 | | {{{ |
214 | | #!sh |
215 | | java -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK.jar -I ../../testrun/s_2_aligned_recalibrated_sorted.bam -R resources/index_hg36.fa -T !IndelRealigner -targetIntervals ../../testrun/forRealigner.intervals --output ../../testrun/s_2_aligned_cleaned.bam -D resources/hg36/snp130.txt |
216 | | }}} |
217 | | |
218 | | === 13) Sort & index cleaned alignment === |
219 | | |
220 | | The cleaned BAM file is again sorted and indexed using the command as described in step 4. |
221 | | |
222 | | === 14) Generate raw indel calls === |
223 | | In this step raw indel calls are made. This output two bed files, the first containing raw indels, the second detailed output. This process was executed using the following command: |
224 | | |
225 | | {{{ |
226 | | #!sh |
227 | | Java –jar GenomeAnalysisTK.jar –T IndelGenotyperV2 –R resources/index_hg36.fa --I ../../testrun/s_2_aligned_cleaned.bam –O ../../testrun/indels.raw.bed –o ../../testrun/detailed.output.bed –verbose –mnr 1000000 |
228 | | }}} |
229 | | |
230 | | The generated raw indel file will later be used in step 17. |
231 | | |
232 | | === 15) Filter raw indels === |
233 | | To retrieve the actual list of indels this last step needs to be performed. This results in a filtered list containing all the detected indels. The following command (Perl script) was used: |
234 | | |
235 | | {{{ |
236 | | #!sh |
237 | | perl ../perl/filterSingleSampleCalls.pl --calls ../../testrun/detailed.output.bed --max_cons_av_mm 3.0 --max_cons_nqs_av_mm 0.5 --mode ANNOTATE > ../../testrun/indels.filtered.bed |
238 | | }}} |
239 | | |
240 | | === 16) Making SNP calls === |
241 | | To make SNP calls this process needs to be executed. This results in a VCF file containing raw SNPs. This list will be filtered in step 18, using output from step 17. To generate this list of raw SNPs the following command was used: |
242 | | |
243 | | {{{ |
244 | | #!sh |
245 | | Java –jar GenomeAnalysisTK.jar –R resources/index_hg36.fa –T !UnifiedGenotyper -I ../../testrun/s_2_aligned_cleaned.bam -D resources/hg36/snp130.txt –varout ../../testrun/s2_snps.raw.vcf –stand_call_conf 30.0 |
246 | | }}} |
247 | | |
248 | | === 17) SNP filter pre-processing === |
249 | | In this step the raw indel file created in step 14 was used. Based on the regions described in this file the tool filters out SNPs located near these potential indels. The output of this file is used for filtering the SNP calls in step 18. The following command was used: |
250 | | |
251 | | {{{ |
252 | | #!sh |
253 | | Python ../python/makeIndelMask.py ../../testrun/indels.raw.bed 10 ../../testrun/indels.mask.bed |
254 | | }}} |
255 | | |
256 | | The number in this command stands for the number of bases that will be included on either side of the indel. |
257 | | |
258 | | === 18) Filter SNP calls === |
259 | | The last step is the filtering on the list of SNPs. This filtering step includes a list of parameters which haven’t been tested yet. These parameters aren’t included in this command because they have to be tested. The following command was executed: |
260 | | |
261 | | {{{ |
262 | | #!sh |
263 | | Java –jar GenomeAnalysisTK.jar –T !VariantFiltration –R resources/index_hg36.fa –O ../../testrun/s2_snps.filtered.vcf –B ../../testrun/s2_snps.raw.vcf –B ../../testrun/indels.mask.bed –maskName !InDel –clusterWindowSize 10 –filterExpression “AB > 0.75 && DP > 40 |
264 | | ||DP > 100||MQ0 > 40||SB > -0.10” –filterName “!StandardFilters” –filterExpression “MQ0 >= 4 && ((1.0 * DP) > 0.1)” –filterName “HARD_TO_VALIDATE”|| |
265 | | }}} |
266 | | |
267 | | The output of this step is the list of SNPs found in this individual. More information on the parameters used can be found on: http://www.broadinstitute.org/gsa/wiki/index.php/VariantFiltrationWalker & http://www.broadinstitute.org/gsa/wiki/index.php/Using_JEXL_expressions |
268 | | |
269 | | == E. Calculating quality metrics & removing duplicate reads == |
270 | | |
271 | | To remove duplicate reads after alignment PICARD can be used. This tool can also calculate different metrics on alignment data. Therefore PICARD was installed and tested on an aligned and sorted BAM file. A process overview can be seen in figure 3. This process starts at step 4 of the analysis pipeline and uses the sorted alignment file as input. A complete testrun also was performed on this tool and developed pipeline. The command, results etc. can be found in Complete_Picard_test_run.doc. |
272 | | == Remarks & Suggestions == |
273 | | There are several other tools which can be implemented into the pipeline, one of these tools is an implementation for Beagle. This tool is only usable in an experimental way at the moment. The only use cases supported by this interface are: |
274 | | |
275 | | * Inferring missing genotype data from call sets (e.g. for lack of coverage in low-pass data) |
276 | | * Genotype inference for unrelated individuals. |
277 | | |
278 | | The basic workflow for this interface is as follows: |
279 | | |
280 | | * After variants are called and possibly filtered, the GATK walker !ProduceBeagleInputWalker will take the resulting VCF as input, and will produce a likelihood file in BEAGLE format. |
281 | | * User needs to run BEAGLE with this likelihood file specified as input. |
282 | | * After Beagle runs, user must unzip resulting output files (.gprobs, .phased) containing posterior genotype probabilities and phased haplotypes. |
283 | | * User can then run GATK walker BeagleOutputToVCFWalker to produce a new VCF with updated data. The new VCF will contain updated genotypes as well as updated annotations. |
284 | | |
285 | | GATK also has the ability to run process in parallel. This option is only supported by a limited amount of tools. It can be imported in every tool using parallelism, more information on this can be found on: http://www.broadinstitute.org/gsa/wiki/index.php/Parallelism_and_the_GATK |
286 | | |
287 | | |
288 | | To construct the correct read groups after merging BAM files obtained using BWA a patch can be used. The patch can be found here: http://www.broadinstitute.org/gsa/wiki/index.php/BWA_patch_to_generate_read_group |
289 | | In the next version of BWA this patch will be included in the standard BWA version. |
290 | | = Manuals, help etc = |
291 | | More information about the GATK can be found at: |
292 | | |
293 | | http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit |
294 | | |
295 | | A forum containing a lot of background information can be found at: |
296 | | |
297 | | http://getsatisfaction.com/gsa |
298 | | |
299 | | Information on the other tools used can be found at; |
300 | | |
301 | | SAMtools: http://samtools.sourceforge.net/ |
302 | | |
303 | | PICARD: http://picard.sourceforge.net/index.shtml |
304 | | |
305 | | Another source of valuable information on anything NGS related: |
306 | | |
307 | | http://seqanswers.com/forums/index.php |
308 | | = Appendix A – Schematic overview of the pipeline = |
309 | | |
310 | | == Figure 1. Overview of the analysis pipeline created using the GATK. == |
311 | | The steps/processes are all numbered. More information on these steps can be found in the corresponding text sessions.'' |
312 | | |
313 | | [[Image(wiki:SnpCallingPipeline:Figure1.png)]] |
314 | | |
315 | | == Figure 2. Overview of the analysis pipeline created using the GATK. == |
316 | | |
317 | | This part starts at step 7 from figure 1. The steps/processes are all numbered. More information on these steps can be found in the corresponding text sessions.'' |
318 | | |
319 | | [[Image(wiki:SnpCallingPipeline:Figure2.png)]] |
320 | | |
321 | | == Figure 3. Overview of the PICARD pipeline. == |
322 | | |
323 | | This tool performs several quality checks after the alignment step completed.'' |
324 | | |
325 | | [[Image(wiki:SnpCallingPipeline:Figure3.png)]] |
| 74 | == Workflow 1: genome reference file creation == |
| 75 | |
| 76 | This workflow creates reference files per chromosome including: |
| 77 | * genome, dbsnp and indel vcfs per chromosome |
| 78 | * realign targets for faster realignment target creation |
| 79 | * index files for samtools and bwa |
| 80 | |
| 81 | Workflow inputs: |
| 82 | * genome.chr.fa - downloaded from genome supplier (now hg19) |
| 83 | * dbsnpXYZ.rod - downloaded reference SNPs from dbsnp (now 129) |
| 84 | * indelsXYZ.vcf - downloaded reference indels from 1KG |
| 85 | |
| 86 | Workflow outputs: |
| 87 | * genome.chr.fa - cleaned headers |
| 88 | * genome.chr.fa.fa - index for samtools |
| 89 | * genome.chr.fa.<format> - multilple index files for bwa |
| 90 | * dbsnpXYZ.chr.rod - split per chromosome |
| 91 | * indelsXYZ.chr.vcf - split per chromosome |
| 92 | * genome.chr.realign.intervals - targets for realignment |
| 93 | |
| 94 | === clean-fasta-headers === |
| 95 | Clean headers to only have '1' instead of Chr1, etc |
| 96 | |
| 97 | ||tool: || || |
| 98 | ||inputs: ||genome.chr.fa || |
| 99 | ||outputs: ||genome.chr.fa || |
| 100 | ||doc: ||internally developed || |
| 101 | |
| 102 | === split-vcf-chr for dbsnp and indels === |
| 103 | Split vcf per chromosome |
| 104 | ||tool: || || |
| 105 | ||inputs: ||dbsnpXYZ.rod, indelsXYZ.vcf || |
| 106 | ||outputs: ||dbsnpXYz.chr.rod, indelsXYZ.vcf || |
| 107 | ||doc: || || |
| 108 | |
| 109 | Discussion: |
| 110 | > Can we use http://vcftools.sourceforge.net/options.html ? |
| 111 | >> vcftools --vcf indelsXYZ.vcf --chr <i> --recode --out indelsXYZ.chr |
| 112 | |
| 113 | === index-chromosomes === |
| 114 | Index reference sequence for each chromosome in the FASTA format |
| 115 | |
| 116 | ||tool: ||samtools faidx || |
| 117 | ||input: ||genome.chr.fa || |
| 118 | ||output: ||genome.chr.fa.fai || |
| 119 | ||doc: ||http://samtools.sourceforge.net/samtools.shtml#3 || |
| 120 | |
| 121 | === bwa-index-chromosomes === |
| 122 | Index reference sequence for each chromosome for bwa alignment |
| 123 | |
| 124 | ||tool: ||bwa index -a IS || |
| 125 | ||input: ||genome.chr.fa || |
| 126 | ||output: ||genome.chr.fa.xyz || |
| 127 | ||doc: ||http://bio-bwa.sourceforge.net/bwa.shtml#3 || |
| 128 | |
| 129 | === !RealignerTargetCreator === |
| 130 | Generate realignment targets for known sites for each chromosome |
| 131 | |
| 132 | ||tool: ||GenomeAnalysisTK.jar -T RealignerTargetCreator || |
| 133 | ||input: ||genome.chr.fa, dbsnpXYz.chr.rod, indelsXYZ.vcf || |
| 134 | ||output: ||genome.chr.realign.intervals || |
| 135 | ||doc: ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Running_the_Indel_Realigner_only_at_known_sites || |
| 136 | |
| 137 | == Workflow 2: Alignment per Lane, per Chr == |
| 138 | |
| 139 | This workflow aligns reads per lane and chromosome, including: |
| 140 | * re-alignment to prevend false SNP calls caused by indels (using known indels) |
| 141 | * markduplicates to prevend false coverage caused by PCR errors (per library = lane) |
| 142 | * base quality recalibration to correct for false low scores caused by true variation |
| 143 | |
| 144 | Workflow Inputs: |
| 145 | * lane.1.fq.gz - raw reads for lane, pair end 1 |
| 146 | * lane.2.fq.gz - raw reads for lane, pair end 2 |
| 147 | * genome.chr.fasta - reference genome split on chromosome |
| 148 | * genome.chr.realign.intervals - targets for realignment per chromosome |
| 149 | * genome.chr.dbsnpXYZ.rod - known snp variants, here from dpbsnp |
| 150 | * genome.chr.indelsXYZ.vcf - known indels from, here from 1KG |
| 151 | |
| 152 | Workflow ouputs: |
| 153 | * lane.chr.1.sai - alignment index for first pair |
| 154 | * lane.chr.2.sai - alignment index for second pair |
| 155 | * lane.chr.sam - alignment map for |
| 156 | * lane.chr.bam - alignment map in binary format |
| 157 | * lane.chr.sorted.bam - sorted alignment map |
| 158 | * lane.chr.sorted.bai - sorted alignment index |
| 159 | * lane.chr.dedup.bam - marked duplicate PCR elements |
| 160 | * lane.chr.dedup.metrics - metrics describing deduplication |
| 161 | * lane.chr.realigned.bam - realigned based on known indels |
| 162 | * lane.chr.matefixed.bam - fixed the mate pair ends |
| 163 | * lane.chr.covariate_table.csv - table of countcovariates output for recalibration |
| 164 | * lane.chr.recal.bam - alignment map with recalibrated quality scores |
| 165 | |
| 166 | === align === |
| 167 | Align each end of paired end. |
| 168 | |
| 169 | ||tool: ||bwa-align || |
| 170 | ||input: ||chr.fasta, lane.1.fq.gz, lane.2.fq.gz || |
| 171 | ||output: ||lane.chr.1.sai, lane.chr.2.sai || |
| 172 | ||docs: ||http://bio-bwa.sourceforge.net/bwa.shtml || |
| 173 | |
| 174 | === align-pe === |
| 175 | Align the pairs as one |
| 176 | |
| 177 | ||tool: ||bwa sampe || |
| 178 | ||inputs: ||chr.fasta [[BR]] lane.1.fq.gz [[BR]] lane.2.fq.gz [[BR]] lane.chr.1.sai [[BR]] lane.chr.2.sai || |
| 179 | ||outputs: ||lane.chr.sam || |
| 180 | ||docs: ||http://bio-bwa.sourceforge.net/bwa.shtml || |
| 181 | |
| 182 | === sam-to-bam === |
| 183 | Convert sam to bam |
| 184 | |
| 185 | ||tool: ||samtools view || |
| 186 | ||inputs: ||lane.chr.sam || |
| 187 | ||outputs: ||lane.chr.bam || |
| 188 | ||docs: ||http://samtools.sourceforge.net/samtools.shtml || |
| 189 | |
| 190 | (Question: can this not index and sort?) |
| 191 | |
| 192 | === sam-sort === |
| 193 | Sort bam file on coordinate |
| 194 | |
| 195 | ||tool: ||samtools sort || |
| 196 | ||inputs: ||lane.chr.bam || |
| 197 | ||outputs: ||lane.chr.sorted.bam || |
| 198 | ||docs: ||http://samtools.sourceforge.net/samtools.shtml || |
| 199 | |
| 200 | === sam-index === |
| 201 | Index bam file for quicker access |
| 202 | |
| 203 | ||tool: ||samtools index || |
| 204 | ||inputs: ||lane.chr.sorted.bam || |
| 205 | ||outputs: ||lane.chr.sorted.bai || |
| 206 | ||docs: ||http://samtools.sourceforge.net/samtools.shtml || |
| 207 | |
| 208 | === !MarkDuplicates === |
| 209 | Mark duplicate PCR fragments to be filtered in analysis |
| 210 | |
| 211 | ||tool: ||MarkDuplicates.jar || |
| 212 | ||inputs: ||lane.chr.sorted.bam || |
| 213 | ||outputs: ||lane.chr.dedup.bam [[BR]] lane.chr.dedup.metrics || |
| 214 | ||docs: ||http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates || |
| 215 | |
| 216 | === !IndelRealigner-!KnownsOnly === |
| 217 | Improve the alignment using known indel information (will reduce false SNP calls) |
| 218 | |
| 219 | ||tool: ||GenomeAnalysisTK.jar -T IndelRealigner || |
| 220 | ||inputs: ||lane.chr.dedup.bam [[BR]] genome.chr.realign.intervals [[BR]] genome.chr.dbsnpXYZ.rod [[BR]] genome.chr.indelsXYZ.vcf || |
| 221 | ||outputs: ||lane.chr.realigned.bam || |
| 222 | ||docs ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Running_the_Indel_Realigner_only_at_known_sites || |
| 223 | |
| 224 | === !FixMateInformation === |
| 225 | Fix the paired end information as consequence of the realignment. |
| 226 | |
| 227 | ||tool: ||FixMateInformation.jar || |
| 228 | ||inputs: ||lane.chr.realigned.bam |
| 229 | ||outputs: ||lane.chr.matefixed.bam || |
| 230 | ||docs: ||http://picard.sourceforge.net/command-line-overview.shtml#FixMateInformation, |
| 231 | |
| 232 | http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Fixing_Mate_Pairs || |
| 233 | |
| 234 | === !CountCovariates === |
| 235 | Count covariants, such as machine cycle and bp position, to be used as basis for quality recalibration. |
| 236 | Optionally: plot the results to pdf using AnalyzeCovariates |
| 237 | |
| 238 | ||tool: ||GenomeAnalysisTK.jar -T CountCovariates, AnalyzeCovariates.jar || |
| 239 | ||inputs: ||lane.chr.matefixed.bam [[BR]] genome.chr.dbsnpXYZ.rod || |
| 240 | ||outputs: ||lane.chr.covariate_table.csv || |
| 241 | ||docs: ||http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#CountCovariates [[BR]] |
| 242 | |
| 243 | http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#AnalyzeCovariates.jar || |
| 244 | |
| 245 | === !TableRecalibration === |
| 246 | Recalibrate quality scores based on the covariate table |
| 247 | ||tool: ||GenomeAnalysisTK.jar -T TableRecalibration || |
| 248 | ||inputs: ||lane.chr.matefixed.bam [[BR]]lanec.chr.recal_table.csv [[BR]]chr.fasta || |
| 249 | ||outputs: ||lane.chr.recal.bam |
| 250 | ||docs: ||http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#TableRecalibration || |
| 251 | |
| 252 | === Repeat: sam-sort, sam-index, countcovariates === |
| 253 | See steps above for commands and docs. |
| 254 | |
| 255 | ||inputs: ||lane.chr.recal.bam || |
| 256 | ||outputs: ||lane.chr.recal.sorted.bam, lane.chr.recal.sorted.bam.bai, lane.chr.recal.covariate_table.csv || |
| 257 | |
| 258 | Discussion: |
| 259 | > wy do we need to sort and index after recalibration? does it mess up the order of things? |
| 260 | |
| 261 | == Workflow 3: sample level variant calling == |
| 262 | This workflow will call variants for the samples including: |
| 263 | * sample level recalibration |
| 264 | * sample level realignment |
| 265 | N.B. no sample level MarkDuplicates is needed as lanes == libraries. |
| 266 | |
| 267 | Workflow inputs: |
| 268 | * lane.chr.recal.sorted.bam - for all sample lanes: dedupped, recalibrated, realigned, sorted and indexed bams (3) |
| 269 | * sample.chip.vcf - genotypes called from genotype chip |
| 270 | Reference: |
| 271 | * genome.chr.fasta - reference genome split on chromosome |
| 272 | * genome.chr.realign.intervals - targets for realignment per chromosome |
| 273 | * genome.chr.dbsnpXYZ.rod - known snp variants, here from dpbsnp |
| 274 | * genome.chr.indelsXYZ.vcf - known indels from, here from 1KG |
| 275 | |
| 276 | Workflow outputs: |
| 277 | * sample.chr.bam - merged bam files per sample |
| 278 | * sample.chr.realign.interval - realignment target intervals |
| 279 | * sample.chr.realigned.bam - realigned |
| 280 | * sample.chr.matesfixed.bam - fixed pairs in realignment |
| 281 | * sample.chr.indels.vcf - raw indels called |
| 282 | * sample.chr.indels.bed - raw indels annotations |
| 283 | * sample.chr.indels.txt - output from the indel calling |
| 284 | * sample.chr.indels.filtered.bed - indels filtered |
| 285 | * sample.chr.snps.vcf - raw snps called |
| 286 | * sample.chr.snps.filtered.vcf - snps filtered |
| 287 | |
| 288 | === merge-lanes === |
| 289 | Merge lanes into one sample bam |
| 290 | |
| 291 | ||tool: ||sam merge || |
| 292 | ||inputs: ||lane.chr.recal.sorted.bam || |
| 293 | ||outputs: ||sample.chr.bam || |
| 294 | ||docs: ||http://samtools.sourceforge.net/samtools.shtml || |
| 295 | |
| 296 | === !RealignerTargetCreator === |
| 297 | Create realignment targets based on the data (so not only knowns) |
| 298 | |
| 299 | ||tool: ||GenomeAnalysisTK.jar -T RealignerTargetCreator || |
| 300 | ||inputs: ||sample.chr.bam [[BR]]genome.chr.fa [[BR]]dbsnpXYz.chr.rod [[BR]]indelsXYZ.vcf |
| 301 | ||outputs: ||sample.chr.realign.intervals || |
| 302 | ||doc: ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Creating_Intervals || |
| 303 | |
| 304 | === !IndelRealigner === |
| 305 | Realign based on realignment targets in previous step |
| 306 | |
| 307 | ||tool: ||GenomeAnalysisTK.jar -T IndelRealigner || |
| 308 | ||inputs: ||sample.chr.bam [[BR]]genome.chr.realign.intervals [[BR]] genome.chr.dbsnpXYZ.rod [[BR]] genome.chr.indelsXYZ.vcf || |
| 309 | ||outputs: ||sample.chr.realigned.bam || |
| 310 | ||doc: ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Realigning || |
| 311 | |
| 312 | === !FixMateInformation === |
| 313 | See description in workflow2, now applied to sample |
| 314 | |
| 315 | ||inputs: ||sample.chr.realigned.bam || |
| 316 | ||ouputs: ||sample.chr.matesfixed.bam || |
| 317 | |
| 318 | === !IndelGenotyperV2 === |
| 319 | Call indels |
| 320 | |
| 321 | ||tool: ||GenomeAnalysisTK.jar -T IndelGenotyperV2 || |
| 322 | ||inputs: ||sample.chr.matesfixed.bam [[BR]]genome.chr.fa || |
| 323 | ||outputs: ||sample.chr.indels.vcf [[BR]]sample.chr.indels.bed [[BR]]sample.chr.indels.txt || |
| 324 | ||doc: ||http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0 [[BR]] |
| 325 | |
| 326 | http://www.broadinstitute.org/gsa/wiki/index.php/Firehose_Parameters#SampleIndelGenotyper || |
| 327 | |
| 328 | === !filterSingleSampleCalls === |
| 329 | Filter indels |
| 330 | |
| 331 | ||tool: ||filterSingleSampleCalls.pl || |
| 332 | ||inputs: ||sample.chr.indels.bed || |
| 333 | ||outputs: ||sample.chr.indels.filtered.bed || |
| 334 | ||doc: ||http://www.broadinstitute.org/gsa/wiki/index.php/Firehose_Parameters#SampleIndelGenotyper || |
| 335 | |
| 336 | === !UnifiedGenotyper === |
| 337 | Call SNPs |
| 338 | |
| 339 | ||tool: ||GenomeAnalysisTK.jar -T UnifiedGenotyper || |
| 340 | ||inputs: ||sample.chr.matesfixed [[BR]]genome.chr.fa [[BR]]dbsnpXYz.chr.rod || |
| 341 | ||outputs: ||sample.chr.snps.vcf || |
| 342 | ||doc: ||http://www.broadinstitute.org/gsa/wiki/index.php/Firehose_Parameters#SetUnifiedGenotypertoEval [[BR]] |
| 343 | |
| 344 | http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper || |
| 345 | |
| 346 | === !makeIndelMask === |
| 347 | Make indel mask |
| 348 | |
| 349 | ||tool: ||makeIndelMask.py || |
| 350 | ||inputs: ||sample.chr.indels.bed || |
| 351 | ||outputs: ||sample.chr.indels.mask.bed || |
| 352 | ||doc: ||http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0#Creating_a_indel_mask_file || |
| 353 | |
| 354 | === !VariantFiltration === |
| 355 | Filter variants to get the best calls possible |
| 356 | |
| 357 | ||tool: ||GenomeAnalysisTK.jar -T VariantFiltration || |
| 358 | ||inputs: ||sample.chr.snps.vcf [[BR]]genome.chr.fa [[BR]]dbsnpXYz.chr.rod || |
| 359 | ||outputs: ||sample.chr.snps.filtered.vcf || |
| 360 | ||doc: ||http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2#Integrating_analyses:_getting_the_best_call_set_possible |
| 361 | |
| 362 | || |
| 363 | |
| 364 | === !MergeVcfs === |
| 365 | === !ChipVcf === |
| 366 | Produce vcf for the chips |
| 367 | |
| 368 | === !VariantEval === |
| 369 | Create summary information on the variations called for evaluation. |
| 370 | Run per sample.snps.filtered.vcf against chip. |
| 371 | |
| 372 | ||tool: ||GenomeAnalysisTK.jar -T VariantEval || |
| 373 | ||inputs: ||sample.snps.vcf [[BR]]sample.chip.vcf [[BR]]genome.chr.fa [[BR]]dbsnpXYz.chr.rod|| |
| 374 | ||outputs: ||sample.snps.eval || |
| 375 | ||doc: ||http://www.broadinstitute.org/gsa/wiki/index.php/VariantEval || |
| 376 | |
| 377 | |
| 378 | Discussion: |
| 379 | > Do we call SNPs based on the filtered indels or the raw indels? |
| 380 | > Should we realign AGAIN after merge of lanes? |
| 381 | > BAQ? |
| 382 | > MINDEL/PINDEL? |
| 383 | |
| 384 | |
| 385 | |
| 386 | |
| 387 | |
| 388 | |
| 389 | |