Version 7 (modified by 14 years ago) (diff) | ,
---|
This page is work-in-progress regarding the data Management of the GoNL.
UMCG SFTP
raw_data
Contains all the data coming from BGI, including their variant calls.
- /fastq
A4 trio fasta files */hg18 Pilot snp, cnv, indel, stat files sent by BGI at the beginning of 2011
- /hg19
A4 trio bam, snp, cnv files sent by BGI in April 2011
resources
- GoNL resources tarball (Thanks Freerk!)
results
Here is all the data that has gone through any kind processing at UMCG */bam/umcg/ A4 trio complete bam files pilot chromosomes 19, 20, X, Y, MT bam files
- /snp/hg18
Pilot cleaned up VCF files from the BGI on hg18(sorted, updated to VCF4.0)
- /snp/hg19
Pilot initial unfiltered calls from UMCG Lifted-over files from BGI
Millipede File Structure
Access rights
- All data should only be writable by their owners
- All tools and resources should be read/executable by the whole gcc group
- All project-specific data and results should be read/executable by the gvnl group
GCC-level Directory Structure
The root for all subsequent directories is data/gcc/
- /tools
- Contains all GCC tools including GoNL tools
- All tools should be put in a folder using the naming convention: toolname-version
- Ex: Picard v1.32 should be found in /data/gcc/tools/picard-tools-1.32/
- /resources
- Contains all GCC resources inlcluding GoNL resources
- All resources should be put in a folder precising their version. Normally, should follow resource-version.
- Ex: Human Genome build 19 should be found in /data/gcc/resources/hg-19/
GoNL-level Directory structure
The root for all subsequent directories is /data/gcc/projects/gonl/
- /rawdata
- Contains all the raw unprocessed data by batch
- Ex: All raw data for the 1st batch is located in /data/gcc/projects/gonl/rawdata/first_batch/
- Contains all the raw unprocessed data by batch
- /results
- Contains all the results after processing the data
- /results/BGI
- Contains all the results from the BGI pipeline (snps, indels, metrics, etc.)
- The XXX.snp.sorted.vcf files are sorted according to position.
- The XXX.annotated.txt are the sorted files annotated from ANNOVAR. The options that were used for the annotation are:
- --buildver hg18
- --annotationDirectory /data/gcc/tools/annovar_2011Jan31/annovar/humandb/
- --geneBasedAnnotations refgene,knowngene,ensgene
- --regionBasedAnnotations band,segdup,dgv,gwascatalog
- --filterBasedAnnotations snp130
Please refer to ANNOVAR website for a thorough description of the annotations: http://www.openbioinformatics.org/annovar/annovar_db.html
- /results/immunochip
- Contains all the results from the immunochip data (cleaned/QCed data, metrics, etc.)
- /results/pipeline
- Contains all the results from the sequence data through the GoNL pipeline by batch
- Ex: Results on the first batch are in /gcc/data/projects/gonl/results/pipeline/first_batch
- The subdirectory structure for each of the batches should be the following:
- All results related to a sample shoud go in /sample_name
- Ex: All results related to sample A2a (first batch) should go in /data/gcc/projects/gonl/results/pipeline/first_batch/A2a
- All results related to a lane of a sample should go in /sample_name/lane_name
- Ex: All results related to sample A2a (first batch), Lane FC20005_L1 should go in /data/gcc/projects/gonl/results/pipeline/first_batch/A2a/FC20005_L1/
- All results related to a sample shoud go in /sample_name
- Contains all the results from the sequence data through the GoNL pipeline by batch
Pipeline Result Files Naming Convention
The following convention applies to all files that are generated by the pipeline. For containing folders, see sections above.
- General convention
- Filenames are composed of tokens identifying their content. The tokens are separated by '.' and if necessary the words within the tokens can be separated by '_' for reading purpose.
- Except where it references specific names using another convention (ex: sample name), file names should be all small letters.
- Sample-level files should be named using: sample_name.step_id.step_name.genome_build.time_stamp.extension
- Ex: A vcf file for the sample A2a produced by the step vc02 (step 2 of variant calling) with the tool UnifiedGenotyper using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: A2a.vc02.unified_genotyper.human_g1k_v37.2011_02_01_12_00.snp
- Lane-level files should be named using: sample_name.lane_name.step_id.step_name.genome_build.time_stamp.extension
- Ex: A bam file for the lane FC20005_L1 of the sample A2a produced by the step pe03 (step 3 of paired-end alignment) with the tool BWA sampe using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: A2a.FC20005_L1.pe03.bwa_sampe.human_g1k_v37.2011_02_12_00.bam
- Log file names should correspond to their output counterparts and have the .log extension.
- Ex: log file for the vcf sample-level step above should be: A2a.vc02.unified_genotyper.human_g1k_v37.2011_02_01_12_00.log
- Ex: log file for the bam lane-level step above should be: A2a.FC20005_L1.pe03.bwa_sampe.human_g1k_v37.2011_02_12_00.log
Logging
The logging strategy is currently under development but will be composed of both file logs and database entries in a Molgenis platform. The status is described below.
Log Files
- At each step of the pipeline a single log is produced and contains:
- PBS out and err
- Tool out and err
- Other tool-produced log where applicable
- For log file naming, see section above.
Molgenis
The Molgenis platform will be used to provide a more advanced and general view of the status of the pipeline runs (including different views, sorting, etc.) The current status is:
- Molgenis instance created with proposed model
- Scripts for insertion under development