3 | | TODO: More info on statistic and how to access the dataset will be added here. |
4 | | BGI will shared with us: SNP calling results (VCF), SOAP format for indels. |
| 4 | == BGI hg19 Pipeline == |
| 5 | * Alignment: bwa aln -o 1 -e 63 -i 15 -L -l 31 -k 2 |
| 6 | * Parameters based on other projects for indel detection |
| 7 | * Remove duplicates: samtools, merge first then remove duplicates |
| 8 | * SNP Detection: [http://soap.genomics.org.cn/soapsnp.html SOAPsnp] |
| 9 | * Params: -r 0.0005 -3 0.001 -u -2 |
| 10 | * Depth > 4x |
| 11 | * Q > 20 |
| 12 | * CN < 2 |
| 13 | * >5bp between each SNP |
| 14 | * Indel Detection: [http://samtools.sourceforge.net/ samtools] pileup -ivcf |
| 15 | * RMS Qual > 20 |
| 16 | * Variation freq >= 0.1 |
| 17 | * Consensus quality > Q20 |
| 18 | * Supporting reads >= 2 |
| 19 | * Results: ~700k per individual |
| 20 | * CNV: [http://www.csie.ntu.edu.tw/~kmchao/tools/CNVDetector/ CNV_Detector] |
| 21 | * Parameters: -r *.fa -c *.depth.insgle.out -o *.cnv.txt -l 90 |
| 22 | * Depth-based: GC%, Avg. seq depth or every sliding window of the sequenced genome will be counted and the depth distribution be modeled to a norm distribution with an estimated mean dpeth and sd for each level of GC content. Regions below criteria are identidied as potention CNVS: |
| 23 | * Depth was significantly different from WG average with same GC content |
| 24 | * Flanking region... |
| 25 | * Results: ~150 CNV > 10k, ~100 > 100k per individual |
| 26 | |
| 27 | == BGI file format == |
| 28 | BGI delivered all its data in both the raw format from the tools they are using as well as a vcf version. Below is a summary of the data and formats along with possible notes and/or know caveats. |
| 29 | |
| 30 | * Alignment data |
| 31 | * BAM format |
| 32 | * CNV |
| 33 | * [http://www.csie.ntu.edu.tw/~kmchao/tools/CNVDetector/ CNV Detector] output format |
| 34 | * According to documentation, CNV detector uses Microarray intensity log 2 ratio values as input files. It is not clear what array/files were used for our samples at the moment. |
| 35 | * [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 VCF format] |
| 36 | * Not assessed yet. |
| 37 | * Indels |
| 38 | * Samtools pileup format |
| 39 | * Cannot find a matching description on the samtools website. To be completed when BGI provides the complete format specs. |
| 40 | * [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 VCF format] |
| 41 | * Not assessed yet. |
| 42 | * SNP calls |
| 43 | * [http://soap.genomics.org.cn/soapsnp.html#output2 SOAPsnp output format] |
| 44 | * |
| 45 | * [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 VCF format] |
| 46 | * Some of the files are in VCF3.0 format, others in VCF 4.0 format |
| 47 | * Some files are not sorted correctly; these need to be sorted again in order for most programs to work correctly |
| 48 | * Some files do not have a proper sample name set (only SAMPLE). This is usually problematic for any program that works on multiple VCF files at the same time. |
| 49 | * The filter column is either always 0 or always PASS; in both cases it just means that all the SNPs passed the filtering steps and hence does not give any information about unfiltered SNPs. |
| 50 | |
| 51 | = Overview BGI datasets'''''''''' = |
| 52 | TODO: More info on statistic and how to access the dataset will be added here. BGI will shared with us: SNP calling results (VCF), SOAP format for indels. |