DataManagement – bbmri

Context Navigation

Version 12 (modified by Morris Swertz, 14 years ago) (diff)
--

This page is work-in-progress regarding the data Management of the GoNL.

See

DataManagement/ProjectData - how are rawdata, intermediate data and result data organized

UMCG SFTP (application20.target.rug.nl)

raw_data

Contains all the data coming from BGI, including their variant calls.

/fastq

A4 trio fasta files */hg18 Pilot snp, cnv, indel, stat files sent by BGI at the beginning of 2011

/hg19

A4 trio bam, snp, cnv files sent by BGI in April 2011

resources

GoNL resources tarball (Thanks Freerk!)

results

Here is all the data that has gone through any kind processing at UMCG */bam/umcg/ A4 trio complete bam files pilot chromosomes 19, 20, X, Y, MT bam files

/snp/hg18

Pilot cleaned up VCF files from the BGI on hg18(sorted, updated to VCF4.0)

/snp/hg19

Pilot initial unfiltered calls from UMCG Lifted-over files from BGI

Millipede File Structure (millipede.service.rug.nl)

Important note: /data/gcc on millipede is soon to be discontinued and will be replaced by /target/gpfs2/gcc (accessible from both Millipede and Application20)

Access rights

All data should only be writable by their owners
All tools and resources should be read/executable by the whole gcc group
All project-specific data and results should be read/executable by the gvnl group

GCC-level Directory Structure

The root for all subsequent directories is data/gcc/

/tools
- Contains all GCC tools including GoNL tools
- All tools should be put in a folder using the naming convention: toolname-version
  - Ex: Picard v1.32 should be found in /data/gcc/tools/picard-tools-1.32/
/resources
- Contains all GCC resources inlcluding GoNL resources
- All resources should be put in a folder precising their version. Normally, should follow resource-version.
  - Ex: Human Genome build 19 should be found in /data/gcc/resources/hg-19/

GoNL-level Directory structure

The root for all subsequent directories is /data/gcc/projects/gonl/

/rawdata
- Contains all the raw unprocessed data by batch
  - Ex: All raw data for the 1st batch is located in /data/gcc/projects/gonl/rawdata/first_batch/
/results
- Contains all the results after processing the data
/results/BGI
- Contains all the results from the BGI pipeline (snps, indels, metrics, etc.)
- The XXX.snp.sorted.vcf files are sorted according to position.
- The XXX.annotated.txt are the sorted files annotated from ANNOVAR. The options that were used for the annotation are:
  - --buildver hg18
  - --annotationDirectory /data/gcc/tools/annovar_2011Jan31/annovar/humandb/
  - --geneBasedAnnotations refgene,knowngene,ensgene
  - --regionBasedAnnotations band,segdup,dgv,gwascatalog
  - --filterBasedAnnotations snp130

Please refer to ANNOVAR website for a thorough description of the annotations: http://www.openbioinformatics.org/annovar/annovar_db.html

/results/immunochip
- Contains all the results from the immunochip data (cleaned/QCed data, metrics, etc.)
/results/pipeline
- Contains all the results from the sequence data through the GoNL pipeline by batch
  - Ex: Results on the first batch are in /gcc/data/projects/gonl/results/pipeline/first_batch
- The subdirectory structure for each of the batches should be the following:
  - All results related to a sample shoud go in /sample_name
    - Ex: All results related to sample A2a (first batch) should go in /data/gcc/projects/gonl/results/pipeline/first_batch/A2a
  - All results related to a lane of a sample should go in /sample_name/lane_name
    - Ex: All results related to sample A2a (first batch), Lane FC20005_L1 should go in /data/gcc/projects/gonl/results/pipeline/first_batch/A2a/FC20005_L1/

Pipeline Result Files Naming Convention

The following convention applies to all files that are generated by the pipeline. For containing folders, see sections above.

General convention
- Filenames are composed of tokens identifying their content. The tokens are separated by '.' and if necessary the words within the tokens can be separated by '_' for reading purpose.
- Except where it references specific names using another convention (ex: sample name), file names should be all small letters.
Sample-level files should be named using: sample_name.step_id.step_name.genome_build.time_stamp.extension
- Ex: A vcf file for the sample A2a produced by the step vc02 (step 2 of variant calling) with the tool UnifiedGenotyper using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: A2a.vc02.unified_genotyper.human_g1k_v37.2011_02_01_12_00.snp
Lane-level files should be named using: sample_name.lane_name.step_id.step_name.genome_build.time_stamp.extension
- Ex: A bam file for the lane FC20005_L1 of the sample A2a produced by the step pe03 (step 3 of paired-end alignment) with the tool BWA sampe using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: A2a.FC20005_L1.pe03.bwa_sampe.human_g1k_v37.2011_02_12_00.bam
Log file names should correspond to their output counterparts and have the .log extension.
- Ex: log file for the vcf sample-level step above should be: A2a.vc02.unified_genotyper.human_g1k_v37.2011_02_01_12_00.log
- Ex: log file for the bam lane-level step above should be: A2a.FC20005_L1.pe03.bwa_sampe.human_g1k_v37.2011_02_12_00.log

Logging

The logging strategy is currently under development but will be composed of both file logs and database entries in a Molgenis platform. The status is described below.

Log Files

At each step of the pipeline a single log is produced and contains:
- PBS out and err
- Tool out and err
- Other tool-produced log where applicable
For log file naming, see section above.

Molgenis

The Molgenis platform will be used to provide a more advanced and general view of the status of the pipeline runs (including different views, sorting, etc.) The current status is:

Molgenis instance created with proposed model
Scripts for insertion under development

Download in other formats:

Plain Text