This page is work-in-progress regarding the data Management of the GoNL.

== UMCG SFTP ==
=== raw_data ===
Contains all the data coming from BGI, including their variant calls.
* /fastq
**  A4 trio fasta files
*/hg18
** Pilot snp, cnv, indel, stat files sent by BGI at the beginning of 2011
* /hg19
** A4 trio bam, snp, cnv files sent by BGI in April 2011

=== resources ===
* GoNL resources tarball (Thanks Freerk!)

=== results ===
Here is all the data that has gone through any kind processing at UMCG
*/bam/umcg/
** A4 trio complete bam files
** pilot chromosomes 19, 20, X, Y, MT bam files
* /snp/hg18
** Pilot cleaned up VCF files from the BGI on hg18(sorted, updated to VCF4.0)
* /snp/hg19
** Pilot initial unfiltered calls from UMCG
** Lifted-over files from BGI

== Millipede File Structure ==
=== '''Access rights''' ===
 * All data should only be writable by their owners
 * All tools and resources should be read/executable by the whole ''gcc'' group
 * All project-specific data and results should be read/executable by the ''gvnl'' group

=== GCC-level Directory Structure ===
The root for all subsequent directories is data/gcc/

 * /tools
   * Contains all GCC tools including''' '''GoNL tools
   * All tools should be put in a folder using the naming convention: ''toolname-version''
     * Ex: Picard v1.32 should be found in'' /data/gcc/tools/picard-tools-1.32/''
 * /resources
   * Contains all GCC resources inlcluding GoNL resources
   * All resources should be put in a folder precising their version. Normally, should follow resource-version.
     * Ex: Human Genome build 19 should be found in'' /data/gcc/resources/hg-19/''

=== GoNL-level Directory structure ===
The root for all subsequent directories is /data/gcc/projects/gonl/

 * /rawdata
   * Contains all the raw unprocessed data by batch
     * Ex: All raw data for the 1st batch is located in'' /data/gcc/projects/gonl/rawdata/first_batch/''
 * /results
   * Contains all the results after processing the data
 * /results/BGI
   * Contains all the results from the BGI pipeline (snps, indels, metrics, etc.)
   * The XXX.snp.sorted.vcf files are sorted according to position.
   * The XXX.annotated.txt are the sorted files annotated from ANNOVAR. The options that were used for the annotation are:
       * --buildver hg18
       * --annotationDirectory /data/gcc/tools/annovar_2011Jan31/annovar/humandb/
       * --geneBasedAnnotations refgene,knowngene,ensgene
       * --regionBasedAnnotations band,segdup,dgv,gwascatalog
       * --filterBasedAnnotations snp130
       
       Please refer to ANNOVAR website for a thorough description of the annotations: http://www.openbioinformatics.org/annovar/annovar_db.html


 * /results/immunochip
   * Contains all the results from the immunochip data (cleaned/QCed data, metrics, etc.)
 * /results/pipeline
   * Contains all the results from the sequence data through the GoNL pipeline by batch
     * Ex: Results on the first batch are in'' /gcc/data/projects/gonl/results/pipeline/first_batch''
   * The subdirectory structure for each of the batches should be the following:
     * All results related to a sample shoud go in /sample_name
       * Ex: All results related to sample A2a (first batch) should go in'' /data/gcc/projects/gonl/results/pipeline/first_batch/A2a''
     * All results related to a lane of a sample should go in /sample_name/lane_name
       * Ex: All results related to sample A2a (first batch), Lane FC20005_L1 should go in'' /data/gcc/projects/gonl/results/pipeline/first_batch/A2a''/FC20005_L1/

=== Pipeline Result Files Naming Convention ===
The following convention applies to all files that are generated by the pipeline. For containing folders, see sections above.

 * General convention
   * Filenames are composed of tokens identifying their content. The tokens are separated by '.' and if necessary the words within the tokens can be separated by '_' for reading purpose.
   * Except where it references specific names using another convention (ex: sample name), file names should be all small letters.
 * Sample-level files should be named using: ''sample_name.step_id.step_name.genome_build.time_stamp.extension''
   * Ex: A vcf file for the sample A2a produced by the step vc02 (step 2 of variant calling) with the tool !UnifiedGenotyper using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: ''A2a.vc02.unified_genotyper.human_g1k_v37.2011_02_01_12_00.snp''
 * Lane-level files should be named using: ''sample_name.lane_name.step_id.step_name.genome_build.time_stamp.extension''
   * Ex: A bam file for the lane FC20005_L1 of the sample A2a produced by the step pe03 (step 3 of paired-end alignment) with the tool BWA sampe using genome build human_g1k_v37 on a  run that begun on February 1st 2011 at 12:00 should be named: ''A2a.FC20005_L1.pe03.bwa_sampe.human_g1k_v37.2011_02_12_00.bam''
 * Log file names should correspond to their output counterparts and have the .log extension.
   * Ex: log file for the vcf sample-level step above should be: ''A2a.vc02.unified_genotyper.human_g1k_v37.2011_02_01_12_00.log''
   * Ex: log file for the bam lane-level step above should be: ''A2a.FC20005_L1.pe03.bwa_sampe.human_g1k_v37.2011_02_12_00.log''
== Logging ==
The logging strategy is currently under development but will be composed of both file logs and database entries in a Molgenis platform. The status is described below.

=== Log Files ===
 * At each step of the pipeline a single log is produced and contains:
   * PBS out and err
   * Tool out and err
   * Other tool-produced log where applicable
 * For log file naming, see section above.

=== Molgenis ===
The Molgenis platform will be used to provide a more advanced and general view of the status of the pipeline runs (including different views, sorting, etc.) The current status is:

 * Molgenis instance created with proposed model
 * Scripts for insertion under development