    = '''Overview BGI datasets''' =
     = '''BGI Data''' =
     2Below is the description of the pipeline used by BGI for processing the GoNL data on hg19 and a short description of the format of the files delivered by BGI.
    3 TODO: More info on statistic and how to access the dataset will be added here.
    4 BGI will shared with us: SNP calling results (VCF), SOAP format for indels.
     4== BGI hg19 Pipeline ==
     5 * Alignment: bwa aln -o 1 -e 63 -i 15 -L -l 31 -k 2
     6   * Parameters based on other projects for indel detection
     7 * Remove duplicates: samtools, merge first then remove duplicates
     8 * SNP Detection: [ SOAPsnp]
     9   * Params: -r 0.0005 -3 0.001 -u -2
     10   * Depth > 4x
     11   * Q > 20
     12   * CN < 2
     13   * >5bp between each SNP
     14 * Indel Detection: [ samtools] pileup -ivcf
     15   * RMS Qual > 20
     16   * Variation freq >= 0.1
     17   * Consensus quality > Q20
     18   * Supporting reads >= 2
     19   * Results: ~700k per individual
     20 * CNV: [ CNV_Detector]
     21   * Parameters: -r *.fa -c *.depth.insgle.out -o *.cnv.txt -l 90
     22   * Depth-based: GC%, Avg. seq depth or every sliding window of the sequenced genome will be counted and the depth distribution be modeled to a norm distribution with an estimated mean dpeth and sd for each level of GC content. Regions below criteria are identidied as potention CNVS:
     23     * Depth was significantly different from WG average with same GC content
     24     * Flanking region...
     25   * Results: ~150 CNV > 10k, ~100 > 100k per individual
     27== BGI file format ==
     28BGI delivered all its data in both the raw format from the tools they are using as well as a vcf version. Below is a summary of the data and formats along with possible notes and/or know caveats.
     30 * Alignment data
     31   * BAM format
     32 * CNV
     33   * [ CNV Detector] output format
     34     * According to documentation, CNV detector uses Microarray intensity log 2 ratio values as input files. It is not clear what array/files were used for our samples at the moment.
     35   * [ VCF format]
     36     * Not assessed yet.
     37 * Indels
     38   * Samtools pileup format
     39     * Cannot find a matching description on the samtools website. To be completed when BGI provides the complete format specs.
     40   * [ VCF format]
     41     * Not assessed yet.
     42 * SNP calls
     43   * [ SOAPsnp output format]
     44     *
     45   * [ VCF format]
     46     * Some of the files are in VCF3.0 format, others in VCF 4.0 format
     47     * Some files are not sorted correctly; these need to be sorted again in order for most programs to work correctly
     48     * Some files do not have a proper sample name set (only SAMPLE). This is usually problematic for any program that works on multiple VCF files at the same time.
     49     * The filter column is either always 0 or always PASS; in both cases it just means that all the SNPs passed the filtering steps and hence does not give any information about unfiltered SNPs.
    654|| '''Batch''' || '''Samples''' || '''Lanes''' || '''Size''' || '''Groningen storage''' || '''Grid storage (VO)''' || '''Storage additional sites''' || '''Analysis site''' || '''Status''' ||
    1462== '''Directories''' ==
     91'''4th batch'''
