Changes between Version 14 and Version 15 of DataManagement

Jul 15, 2011 8:50:23 AM (12 years ago)
Morris Swertz



  • DataManagement

    v14 v15  
    44* DataManagement/ProjectData - how are rawdata, intermediate data and result data organized on the compute clusters
    55* DataManagement/SftpServer - how to access the SFTP for data sharing
    7 === GCC-level Directory Structure ===
    8 The root for all subsequent directories is data/gcc/
    10  * /tools
    11    * Contains all GCC tools including''' '''GoNL tools
    12    * All tools should be put in a folder using the naming convention: ''toolname-version''
    13      * Ex: Picard v1.32 should be found in'' /data/gcc/tools/picard-tools-1.32/''
    14  * /resources
    15    * Contains all GCC resources inlcluding GoNL resources
    16    * All resources should be put in a folder precising their version. Normally, should follow resource-version.
    17      * Ex: Human Genome build 19 should be found in'' /data/gcc/resources/hg-19/''
    19 === Pipeline Result Files Naming Convention ===
    20 The following convention applies to all files that are generated by the pipeline. For containing folders, see sections above.
    22  * General convention
    23    * Filenames are composed of tokens identifying their content. The tokens are separated by '.' and if necessary the words within the tokens can be separated by '_' for reading purpose.
    24    * Except where it references specific names using another convention (ex: sample name), file names should be all small letters.
    25  * Sample-level files should be named using: ''sample_name.step_id.step_name.genome_build.time_stamp.extension''
    26    * Ex: A vcf file for the sample A2a produced by the step vc02 (step 2 of variant calling) with the tool !UnifiedGenotyper using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: ''A2a.vc02.unified_genotyper.human_g1k_v37.2011_02_01_12_00.snp''
    27  * Lane-level files should be named using: ''sample_name.lane_name.step_id.step_name.genome_build.time_stamp.extension''
    28    * Ex: A bam file for the lane FC20005_L1 of the sample A2a produced by the step pe03 (step 3 of paired-end alignment) with the tool BWA sampe using genome build human_g1k_v37 on a  run that begun on February 1st 2011 at 12:00 should be named: ''A2a.FC20005_L1.pe03.bwa_sampe.human_g1k_v37.2011_02_12_00.bam''
    29  * Log file names should correspond to their output counterparts and have the .log extension.
    30    * Ex: log file for the vcf sample-level step above should be: ''A2a.vc02.unified_genotyper.human_g1k_v37.2011_02_01_12_00.log''
    31    * Ex: log file for the bam lane-level step above should be: ''A2a.FC20005_L1.pe03.bwa_sampe.human_g1k_v37.2011_02_12_00.log''
    32 == Logging ==
    33 The logging strategy is currently under development but will be composed of both file logs and database entries in a Molgenis platform. The status is described below.
    35 === Log Files ===
    36  * At each step of the pipeline a single log is produced and contains:
    37    * PBS out and err
    38    * Tool out and err
    39    * Other tool-produced log where applicable
    40  * For log file naming, see section above.
    42 === Molgenis ===
    43 The Molgenis platform will be used to provide a more advanced and general view of the status of the pipeline runs (including different views, sorting, etc.) The current status is:
    45  * Molgenis instance created with proposed model
    46  * Scripts for insertion under development
     6* DataManagement/ProjectResources - where resources and tools that are used by the pipelines
     7* DataManagement/FileNameConventions - how are files named so we understand eachother