Version 14 (modified by 13 years ago) (diff) | ,
---|
BGI Data
Table of Contents
Below is the description of the pipeline used by BGI for processing the GoNL data on hg19 and a short description of the format of the files delivered by BGI.
BGI hg19 Pipeline
- Alignment: bwa aln -o 1 -e 63 -i 15 -L -l 31 -k 2
- Parameters based on other projects for indel detection
- Remove duplicates: samtools, merge first then remove duplicates
- SNP Detection: SOAPsnp
- Params: -r 0.0005 -3 0.001 -u -2
- Depth > 4x
- Q > 20
- CN < 2
- >5bp between each SNP
- Indel Detection: samtools pileup -ivcf
- RMS Qual > 20
- Variation freq >= 0.1
- Consensus quality > Q20
- Supporting reads >= 2
- Results: ~700k per individual
- CNV: CNV_Detector
- Parameters: -r *.fa -c *.depth.insgle.out -o *.cnv.txt -l 90
- Depth-based: GC%, Avg. seq depth or every sliding window of the sequenced genome will be counted and the depth distribution be modeled to a norm distribution with an estimated mean dpeth and sd for each level of GC content. Regions below criteria are identidied as potention CNVS:
- Depth was significantly different from WG average with same GC content
- Flanking region...
- Results: ~150 CNV > 10k, ~100 > 100k per individual
BGI file format
BGI delivered all its data in both the raw format from the tools they are using as well as a vcf version. Below is a summary of the data and formats along with possible notes and/or know caveats. Note that unless specified otherwise, all data is aligned on hg19.
- Alignment data
- BAM format
- CNV
- CNV Detector output format
- According to documentation, CNV detector uses Microarray intensity log 2 ratio values as input files. It is not clear what array/files were used for our samples at the moment.
- VCF format
- Not assessed yet.
- CNV Detector output format
- Indels
- Samtools pileup format
- Cannot find a matching description on the samtools website. To be completed when BGI provides the complete format specs.
- VCF format
- Not assessed yet.
- Samtools pileup format
- SNP calls
- SOAPsnp output format
- VCF format
- Some of the files are in VCF3.0 format, others in VCF 4.0 format
- Some files are not sorted correctly; these need to be sorted again in order for most programs to work correctly
- Some files do not have a proper sample name set (only SAMPLE). This is usually problematic for any program that works on multiple VCF files at the same time.
- The filter column is either always 0 or always PASS; in both cases it just means that all the SNPs passed the filtering steps and hence does not give any information about unfiltered SNPs.
- SOAPsnp output format
Overview BGI datasets
TODO: More info on statistic and how to access the dataset will be added here. BGI will shared with us: SNP calling results (VCF), SOAP format for indels.
Batch | Samples | Lanes | Size | Groningen storage | Grid storage (VO) | Storage additional sites | Analysis site | Status |
1st batch aka Pilot phase | 60 | 183 | yes | yes (vlemed) | UMGC, AMC (for comparison) | aligned, in progress | ||
2st batch | 90 | 295 | yes | yes (vlemed) | AMC | in progress | ||
3rd batch | 222 | 683 | 10TB | yes | no | UMGC | in progress | |
4th batch | 235 | 630 | 10TB | yes | yes (bbmri.nl) | Hubrecht, EMC | LUMC/TUdelft, UMGC, Hubrecht, EMC | in progress |
5th batch | 153 | no | no | not on storage yet | ||||
6th batch | 10 | no | not arrived yet |
Directories
1st batch
- Groningen
- Raw data (fastq)
- Results
- Grid (vlemed)
- Raw data SRM (fastq) srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/vlemed/gvnl/data (access to members of vlemed/gvnl VO)
- Raw data LFC (fastq) lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/data (access closed)
- Results LFC lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/data/bam/markduplicates/realignment/fixmates/recalibrate/sorted (access closed)
- Note: results will be copied to bbmri.nl VO when analysis is done
2nd batch
- Groningen
- Raw data (fastq)
- Results (bam)
- Grid (vlemed)
- Raw data SRM (fastq) srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/vlemed/gvnl/data (access to members of vlemed/gvnl VO)
- Raw data LFC (fastq) lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/data (access closed)
- Results LFC lfn://lfc.grid.sara.nl:5010/grid/vlemed/gvnl/data/bam/markduplicates/realignment/fixmates/recalibrate/sorted (access closed)
- Note: results will be copied to bbmri.nl VO when analysis is done
3rd batch
- Groningen
- Raw data (fastq)
- Results
4th batch
- Groningen
- Raw data (fastq)
- Results
- Grid (bbmri.nl)
- Raw data SRM (fastq) srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/bbmri.nl/fourth_batch (access to members of bbmri.nl VO)
- Raw data LFC (fastq) lfn://lfc.grid.sara.nl:5010/grid/bbmri.nl/gonl/data/input (access to members of bbmri.nl VO)
- Results LFC lfn://lfc.grid.sara.nl:5010/grid/bbmri.nl/gonl/data/output (access to members of bbmri.nl VO)