Changes between Version 3 and Version 4 of ChipBasedQcPipelineIdea
- Timestamp:
- Sep 26, 2010 7:44:00 PM (14 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
ChipBasedQcPipelineIdea
v3 v4 3 3 Below a detailed outline of steps which should be included in ChipBasedQcPipeline is provided. This document also suggest the CipBasedQcPipelineWorkflow sequence and principal ideas for solutions. This document does not specify exact tools to be used; as most of operations are data manipulations, it will be up to the involved analysts to decide what tool may be more convenient for them and formulate the CipBasedQcPipelineWorkflow. The actual implementation of the workflow should allow automatic reproduction of the results and application of the same workflow to new data. This is not only important from good practice point of view, but also keeping in mind that more data will come to the same pipeline in the future. 4 4 5 The document assumes VcfGtDataF romat (in particular, VCF v.4 format) is used; “+” strand is used for sequencing data. It is assumed that chip data come in ChipGtDataFormat.5 The document assumes VcfGtDataFormat (in particular, VCF v.4 format) is used; “+” strand is used for sequencing data. It is assumed that chip data come in ChipGtDataFormat. 6 6 7 7 == CHIP-VCF BUILD AND DBSNP MATCHING TABLE == … … 45 45 * A1VCHIP: first allele the personal genotype, translated according to “+” strand on VCF build <single character, either “A”, “C”, “G” or “T”> 46 46 * A2VCHIP: second allele the personal genotype, translated according to “+” strand on VCF build <single character, either “A”, “C”, “G” or “T”> 47 * GTCHIP: genotype with alleles in alphabetic order, <two characters, each either “A”, “C”, “G” or “T”>47 * GTCHIP: genotype with alleles in alphabetic order, <two characters, each either “A”, “C”, “G” or “T”> 48 48 49 49 Questions: … … 73 73 * GTVCF: genotype with alleles in alphabetic order, <two characters, each either “A”, “C”, “G” or “T”>. This can be done by mapping the numbers provided in VCF GT field to REF and ALT and then ordering. 74 74 * GQ, DP: directly from VCF file 75 * …: factors potentially associated with quality of the sequencing data, summarized in FactorsRelatedToSeqDataQuality.75 * …: factors potentially associated with quality of the sequencing data, summarized in FactorsRelatedToSeqDataQuality. 76 76 77 77 Merge chip and VCF genotypic tables (“chip_genotypes_yyyy.mm.dd.txt” and “VCF_genotypes_yyyy.mm.dd.txt”) using ID and SNPV as key variables. Keep all chip genotypes, substituting missing (“.”) when no information is available from VCF. Name the table “merged_chip_and_VCF_genotypes_yyy.mm.dd.txt”.