Idea of the phasing pipeline
The main goal of this project is to provide highly accurately phased genotypes for imputations.
Three sources of phasing information are available for the phasing:
- the same (paired-end) reads, as presented in VCF (see VcfGtDataFormat) data
- the trio structure of the data; note that with trios, phasing is always unambiguous, and recombination points can not be inferred unless other sources of information are used
- the information from remote relatives, i.e. population linkage disequilibrium structure
If the sequencing data were free of errors and missing data, the solution for the task of phasing could have been easily formulated (though technical implementation is still demanding). In the first stage of this project, we will use and assume 'clean' data coming from VCF QC pipelines (ChipBasedQcPipeline and MendelianQcPipeline).
Next, we will explore an opportunity provided by the data to carry out phasing and call improvement at the same time, taking into account uncertainty of the sequence data. This will be done using the data, which did not yet pass MendelianQcPipeline, extending and integrating the MendelianQcPipeline into TrioAwarePhasingPipeline.
It is well recognized that at 12x there is an essential chance that a heterozygous genotype will not be called (estimated roughly as ~1%). Furthermore, for a given individual a certain proportion of the genome will not be covered well; the genotypes at these regions can not be called or will be called with low quality. The effects of such errors and missing data onto further imputations may be large. Other factor affecting quality of further imputations is quality of phasing.
These problems can be address in the same framework. Basically, phasing information provides us with the means to fill in missing genotypes and correct erroneously called ones. For example, if in a person coverage is low at a certain regions, we can use information from the first degree (and much more remote) relative to figure out what genotypes are there. Sequencing errors can be detected in very much the same way. Thus, phasing and imputations provide us with an attractive opportunity not only to phase the data, but also to minimize sequencing errors and proportion of missing data.
NOTE: We need to come up with the plan on utilizing these sources of information, and coordinating MendelianQcPipeline, TrioAwareVariantDiscoveryPipeline? and TrioAwarePhasingPipeline. The plan should be well-tailored towards ImputationPipeline.