Changes between Version 4 and Version 5 of ImputationPipeline

Sep 21, 2010 8:33:05 PM (14 years ago)
Morris Swertz



  • ImputationPipeline

    v4 v5  
    7070TODO: paste shell script descriptions of each step
     72== Discussion ==
     74=== Mixing platforms may influence imputation results ===
     76We into some troubles, resulting in our test statistic being highly inflated (which is indicative of false positive results). We thought of some possible causes which might explain this effect, although we should still test them:
     77 * '''SNPs with bad imputation quality''': we should remove SNPs with an R2 value < 0.90 prior to GWAS. These values are stored alongside the beagle imputation output. Taking a more stringent cutoff seemed to decrease the inflation, although you lose half of the SNPs.
     78 *  '''batch effects caused by overrepresentation of a certain haplotype within an imputation batch''': for each batch of samples, beagle estimates a best fitting model to predict the genotypes of the missing SNPs, which is dependent upon both the input data as the reference dataset. Cases and controls should be therefore randomly distributed across the batches. Another option is to use impute, rather than beagle, since its batches are across parts of the genome, instead of samples.
     79 * '''difference in source platform''': different platforms have different SNP content. When you impute datasets coming from different platforms, the resulting model which is based on the input data is also different. When associating traits in a GWAS meta-analysis, these differences may account for a platform specific effect. We should therefore remove the SNPs which are non-overlapping between such platforms, prior to imputation, and impute the samples after combining the datasets. This would remove such a platform-bias, although would also cause a huge loss of available SNPs, when the overlap between platforms is small. However, in my opinion, this problem is similar to the batch effect problem, and can possibly be resolved by randomizing the sample content of the batches: the model will then possibly be fitted to the data that is available. In any case the datasets that are used in a meta-analysis should be imputed together.