This page is work-in-progress regarding the data Management of the GoNL. == UMCG SFTP == === raw_data === Contains all the data coming from BGI, including their variant calls. * /fastq ** A4 trio fasta files */hg18 ** Pilot snp, cnv, indel, stat files sent by BGI at the beginning of 2011 * /hg19 ** A4 trio bam, snp, cnv files sent by BGI in April 2011 === resources === * GoNL resources tarball (Thanks Freerk!) === results === Here is all the data that has gone through any kind processing at UMCG */bam/umcg/ ** A4 trio complete bam files ** pilot chromosomes 19, 20, X, Y, MT bam files * /snp/hg18 ** Pilot cleaned up VCF files from the BGI on hg18(sorted, updated to VCF4.0) * /snp/hg19 ** Pilot initial unfiltered calls from UMCG ** Lifted-over files from BGI == Millipede File Structure == === '''Access rights''' === * All data should only be writable by their owners * All tools and resources should be read/executable by the whole ''gcc'' group * All project-specific data and results should be read/executable by the ''gvnl'' group === GCC-level Directory Structure === The root for all subsequent directories is data/gcc/ * /tools * Contains all GCC tools including''' '''GoNL tools * All tools should be put in a folder using the naming convention: ''toolname-version'' * Ex: Picard v1.32 should be found in'' /data/gcc/tools/picard-tools-1.32/'' * /resources * Contains all GCC resources inlcluding GoNL resources * All resources should be put in a folder precising their version. Normally, should follow resource-version. * Ex: Human Genome build 19 should be found in'' /data/gcc/resources/hg-19/'' === GoNL-level Directory structure === The root for all subsequent directories is /data/gcc/projects/gonl/ * /rawdata * Contains all the raw unprocessed data by batch * Ex: All raw data for the 1st batch is located in'' /data/gcc/projects/gonl/rawdata/first_batch/'' * /results * Contains all the results after processing the data * /results/BGI * Contains all the results from the BGI pipeline (snps, indels, metrics, etc.) * The XXX.snp.sorted.vcf files are sorted according to position. * The XXX.annotated.txt are the sorted files annotated from ANNOVAR. The options that were used for the annotation are: * --buildver hg18 * --annotationDirectory /data/gcc/tools/annovar_2011Jan31/annovar/humandb/ * --geneBasedAnnotations refgene,knowngene,ensgene * --regionBasedAnnotations band,segdup,dgv,gwascatalog * --filterBasedAnnotations snp130 Please refer to ANNOVAR website for a thorough description of the annotations: http://www.openbioinformatics.org/annovar/annovar_db.html * /results/immunochip * Contains all the results from the immunochip data (cleaned/QCed data, metrics, etc.) * /results/pipeline * Contains all the results from the sequence data through the GoNL pipeline by batch * Ex: Results on the first batch are in'' /gcc/data/projects/gonl/results/pipeline/first_batch'' * The subdirectory structure for each of the batches should be the following: * All results related to a sample shoud go in /sample_name * Ex: All results related to sample A2a (first batch) should go in'' /data/gcc/projects/gonl/results/pipeline/first_batch/A2a'' * All results related to a lane of a sample should go in /sample_name/lane_name * Ex: All results related to sample A2a (first batch), Lane FC20005_L1 should go in'' /data/gcc/projects/gonl/results/pipeline/first_batch/A2a''/FC20005_L1/ === Pipeline Result Files Naming Convention === The following convention applies to all files that are generated by the pipeline. For containing folders, see sections above. * General convention * Filenames are composed of tokens identifying their content. The tokens are separated by '.' and if necessary the words within the tokens can be separated by '_' for reading purpose. * Except where it references specific names using another convention (ex: sample name), file names should be all small letters. * Sample-level files should be named using: ''sample_name.step_id.step_name.genome_build.time_stamp.extension'' * Ex: A vcf file for the sample A2a produced by the step vc02 (step 2 of variant calling) with the tool !UnifiedGenotyper using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: ''A2a.vc02.unified_genotyper.human_g1k_v37.2011_02_01_12_00.snp'' * Lane-level files should be named using: ''sample_name.lane_name.step_id.step_name.genome_build.time_stamp.extension'' * Ex: A bam file for the lane FC20005_L1 of the sample A2a produced by the step pe03 (step 3 of paired-end alignment) with the tool BWA sampe using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: ''A2a.FC20005_L1.pe03.bwa_sampe.human_g1k_v37.2011_02_12_00.bam'' * Log file names should correspond to their output counterparts and have the .log extension. * Ex: log file for the vcf sample-level step above should be: ''A2a.vc02.unified_genotyper.human_g1k_v37.2011_02_01_12_00.log'' * Ex: log file for the bam lane-level step above should be: ''A2a.FC20005_L1.pe03.bwa_sampe.human_g1k_v37.2011_02_12_00.log'' == Logging == The logging strategy is currently under development but will be composed of both file logs and database entries in a Molgenis platform. The status is described below. === Log Files === * At each step of the pipeline a single log is produced and contains: * PBS out and err * Tool out and err * Other tool-produced log where applicable * For log file naming, see section above. === Molgenis === The Molgenis platform will be used to provide a more advanced and general view of the status of the pipeline runs (including different views, sorting, etc.) The current status is: * Molgenis instance created with proposed model * Scripts for insertion under development