Changes between Version 3 and Version 4 of SnpCallingPipeline


Ignore:
Timestamp:
Sep 12, 2010 11:46:31 PM (11 years ago)
Author:
Morris Swertz
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SnpCallingPipeline

    v3 v4  
    1717
    1818{{{
     19#!sh
    1920java -jar !FastqToSam.jar FASTQ=../testrun/s_2_sequence041090.txt QUALITY_FORMAT=Illumina OUTPUT=../testrun/s_2_sequence.bam SAMPLE_NAME=s_2_test_sample PLATFORM_UNIT=barcode_13434HWUSIsomething PLATFORM=illumina SORT_ORDER=coordinate
    2021}}}
     
    2627The output of !FastqToSam.jar needs to be sorted and indexed in order to use it as input for the alignment. To sort these files SAMtools was used. The following commands can be executed on the BAM file:
    2728
     29{{{
     30#!sh
    2831./samtools sort ../testrun/s_2_sequence041090_aligned.bam ../testrun/s_2_aligned_sorted
     32}}}
    2933
    3034The postfix .bam is automatically added to the output file.
     
    3236The corresponding BAM index file can be created using:
    3337
     38{{{
     39#!sh
    3440./samtools index ../testrun/s_2_aligned_sorted.bam ../testrun/s_2_aligned_sorted.bam.bai
     41}}}
    3542
    3643== B. Alignment ==
     
    4047To align the reads to a reference genome the GATK uses BWA. The indices used for this step were already made for an earlier Galaxy project. GATK uses an own version of BWA which doesn’t support multiple threading yet. To overcome this problem alignment was performed using the standard BWA. The following command was executed:'''''''
    4148
     49{{{
     50#!sh
    4251./bwa aln -l 101 -k 4 -t 16 ../Sting/dist/resources/index_hg36.fa ../testrun/s_2_sequence041090.txt -f ../testrun/s_2_sequence041090_aligned.sai
     52}}}
    4353
    4454This results in an alignment file in .sai format. To convert this to SAM format the following command was used:
    4555
     56{{{
     57#!sh
    4658./bwa aln -l 101 -k 4 -t 16 ../Sting/dist/resources/index_hg36.fa ../testrun/s_2_sequence041090.txt -f ../testrun/s_2_sequence041090_aligned.sai
     59}}}
    4760
    4861To convert this SAM file to BAM SAMtools was used, the command:
    4962
     63{{{
     64#!sh
    5065./bwa aln -l 101 -k 4 -t 16 ../Sting/dist/resources/index_hg36.fa ../testrun/s_2_sequence041090.txt -f ../testrun/s_2_sequence041090_aligned.sai
     66}}}
    5167
    5268=== 4) Sort & index alignment ===
    5369After this converting step the BAM file needs to be sorted & indexed again using the following command:
    5470
     71{{{
     72#!sh
    5573./samtools sort ../testrun/s_2_sequence041090_aligned.bam ../testrun/s_2_aligned_sorted
     74}}}
    5675
    5776The index was created using:
    5877
     78{{{
     79#!sh
    5980./samtools index ../testrun/s_2_aligned_sorted.bam ../testrun/s_2_aligned_sorted.bam.bai
     81}}}
    6082
    6183'''NOTE. Keep track of the reference build used. UCSC uses human build hg18 and hg19, which correspond to build 36 and 37 of the NCBI. These builds contains the same information and sequences but different headers etc. The dbSNP ROD files which need to be used further on in the pipeline can be downloaded from UCSC. To overcome any compatibility problems it is advised to use hg18 and hg19 as reference during alignment and further calculations.'''
     
    6890To determine the covariates influencing base quality scores in the BAM file obtained after alignment this step is performed. This results in a CSV file. The following command was used to execute this step:
    6991
     92{{{
     93#!sh
    7094java -jar GenomeAnalysisTK.jar -R resources/index_hg36.fa --default_platform illumina --DBSNP resources/hg36/snp130.txt -I ../../testrun/s_2_aligned_sorted.bam --max_reads_at_locus 20000 -T !CountCovariates -cov !ReadGroupcovariate -cov !QualityScoreCovariate -cov !CycleCovariate -cov !DinucCovariate -recalFile ../../testrun/recal_data.csv
     95}}}
    7196
    7297The parameters used for this calculation need to be tested more. This to see the influence of each parameter on the output results.
     
    76101This step includes the actual quality value recalibration. The recalibration takes place using the generated CSV file as input. The output of this process is the alignment file with recalibrated quality values in BAM format. This step was executed using the following command:
    77102
     103{{{
     104#!sh
    78105java -jar GenomeAnalysisTK.jar -R resources/index_hg36.fa --default_platform illumina -I ../../testrun/s_2_aligned_sorted.bam -T !TableRecalibration -outputBam ../../testrun/s_2_aligned_recalibrated.bam -recalFile ../../testrun/recal_data.csv
     106}}}
    79107
    80108=== 7) Sort & index recalibrated alignment] ===
     
    89117To generate several statistics and graphs on the recalibrated alignment the !AnalyzeCovariates.jar tool can be used. This tool uses the generated .CSV file as an input. The output is a directory containing several statistics and graphs about the quality values etc. The following command was used:
    90118
     119{{{
     120#!sh
    91121Java –jar !AnalyzeCovariates.jar –recalFile ../../testrun/recal_data.csv –outputDir ../../testrun/analyzecovariates_output –resources resources/index_hg36.fa –ignoreQ 5
     122}}}
    92123
    93124=== 10) Generate graphs/stats ===
     
    101132The first step is to determine small suspicious intervals in the alignment. These intervals will then be realigned to get an optimal output result. To do this, these intervals were calculated using the following command:
    102133
     134{{{
     135#!sh
    103136java -jar GenomeAnalysisTK.jar -T !RealignerTargetCreator -I ../../testrun/s_2_aligned_recalibrated_sorted.bam -R resources/index_hg36.fa -o ../../testrun/forRealigner.intervals -D resources/hg36/snp130.txt
     137}}}
    104138
    105139This process outputs a file containing intervals on which realignment should be executed.
     
    109143For each detect interval a local realignment takes place. This results in a cleaned BAM file on which SNP and indel calling takes place. The local alignment was performed using the following command:
    110144
     145{{{
     146#!sh
    111147java -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK.jar -I ../../testrun/s_2_aligned_recalibrated_sorted.bam -R resources/index_hg36.fa -T !IndelRealigner -targetIntervals ../../testrun/forRealigner.intervals --output ../../testrun/s_2_aligned_cleaned.bam -D resources/hg36/snp130.txt
    112 
    113 === 13) Sort & index cleaned alignment === The cleaned BAM file is again sorted and indexed using the command as described in step 4.
     148}}}
     149
     150=== 13) Sort & index cleaned alignment ===
     151
     152The cleaned BAM file is again sorted and indexed using the command as described in step 4.
    114153
    115154=== 14) Generate raw indel calls ===
    116155In this step raw indel calls are made. This output two bed files, the first containing raw indels, the second detailed output. This process was executed using the following command:
    117156
     157{{{
     158#!sh
    118159Java –jar GenomeAnalysisTK.jar –T IndelGenotyperV2 –R resources/index_hg36.fa --I ../../testrun/s_2_aligned_cleaned.bam –O ../../testrun/indels.raw.bed –o ../../testrun/detailed.output.bed –verbose –mnr 1000000
     160}}}
    119161
    120162The generated raw indel file will later be used in step 17.
     
    123165To retrieve the actual list of indels this last step needs to be performed. This results in a filtered list containing all the detected indels. The following command (Perl script) was used:
    124166
     167{{{
     168#!sh
    125169perl ../perl/filterSingleSampleCalls.pl --calls ../../testrun/detailed.output.bed --max_cons_av_mm 3.0 --max_cons_nqs_av_mm 0.5 --mode ANNOTATE > ../../testrun/indels.filtered.bed
     170}}}
    126171
    127172=== 16) Making SNP calls ===
    128173To make SNP calls this process needs to be executed. This results in a VCF file containing raw SNPs. This list will be filtered in step 18, using output from step 17. To generate this list of raw SNPs the following command was used:
    129174
     175{{{
     176#!sh
    130177Java –jar GenomeAnalysisTK.jar –R resources/index_hg36.fa –T !UnifiedGenotyper -I ../../testrun/s_2_aligned_cleaned.bam -D resources/hg36/snp130.txt –varout ../../testrun/s2_snps.raw.vcf –stand_call_conf 30.0
     178}}}
    131179
    132180=== 17) SNP filter pre-processing ===
    133181In this step the raw indel file created in step 14 was used. Based on the regions described in this file the tool filters out SNPs located near these potential indels. The output of this file is used for filtering the SNP calls in step 18. The following command was used:
    134182
     183{{{
     184#!sh
    135185Python ../python/makeIndelMask.py ../../testrun/indels.raw.bed 10 ../../testrun/indels.mask.bed
     186}}}
    136187
    137188The number in this command stands for the number of bases that will be included on either side of the indel.
     
    140191The last step is the filtering on the list of SNPs. This filtering step includes a list of parameters which haven’t been tested yet. These parameters aren’t included in this command because they have to be tested. The following command was executed:
    141192
     193{{{
     194#!sh
    142195Java –jar GenomeAnalysisTK.jar –T !VariantFiltration –R resources/index_hg36.fa –O ../../testrun/s2_snps.filtered.vcf –B ../../testrun/s2_snps.raw.vcf –B ../../testrun/indels.mask.bed –maskName !InDel –clusterWindowSize 10 –filterExpression “AB > 0.75 && DP > 40
    143 
    144196||DP > 100||MQ0 > 40||SB > -0.10” –filterName “!StandardFilters” –filterExpression “MQ0 >= 4 && ((1.0 * DP) > 0.1)” –filterName “HARD_TO_VALIDATE”||
     197}}}
    145198
    146199The output of this step is the list of SNPs found in this individual. More information on the parameters used can be found on: http://www.broadinstitute.org/gsa/wiki/index.php/VariantFiltrationWalker & http://www.broadinstitute.org/gsa/wiki/index.php/Using_JEXL_expressions
     
    158211The basic workflow for this interface is as follows:
    159212
    160 -          After variants are called and possibly filtered, the GATK walker !ProduceBeagleInputWalker will take the resulting VCF as input, and will produce a likelihood file in BEAGLE format.
    161 
    162 -          User needs to run BEAGLE with this likelihood file specified as input.
    163 
    164 -          After Beagle runs, user must unzip resulting output files (.gprobs, .phased) containing posterior genotype probabilities and phased haplotypes.
    165 
    166 -          User can then run GATK walker BeagleOutputToVCFWalker to produce a new VCF with updated data. The new VCF will contain updated genotypes as well as updated annotations.
     213 * After variants are called and possibly filtered, the GATK walker !ProduceBeagleInputWalker will take the resulting VCF as input, and will produce a likelihood file in BEAGLE format.
     214 * User needs to run BEAGLE with this likelihood file specified as input.
     215 *  After Beagle runs, user must unzip resulting output files (.gprobs, .phased) containing posterior genotype probabilities and phased haplotypes.
     216 * User can then run GATK walker BeagleOutputToVCFWalker to produce a new VCF with updated data. The new VCF will contain updated genotypes as well as updated annotations.
    167217
    168218GATK also has the ability to run process in parallel. This option is only supported by a limited amount of tools. It can be imported in every tool using parallelism, more information on this can be found on: http://www.broadinstitute.org/gsa/wiki/index.php/Parallelism_and_the_GATK
     
    187237http://seqanswers.com/forums/index.php
    188238
    189 '''[[BR]] ''' = [ Appendix A – Schematic overview of the pipeline] =
     239= Appendix A – Schematic overview of the pipeline =
    190240
    191241''Figure 1. Overview of the analysis pipeline created using the GATK. The steps/processes are all numbered. More information on these steps can be found in the corresponding text sessions.''