wiki:ImputationPipeline

Version 44 (modified by a.kanterakis, 13 years ago) (diff)

--

This page describes the Imputation pipelines developed by the GoNL - Impute team. Please contribute. For help with trac wiki formatting: http://trac.edgewall.org/wiki/WikiFormatting
All scripts presented here are located in our SVN repository: http://www.bbmriwiki.nl/svn/imputation
Minutes of our Team Calls: http://www.bbmriwiki.nl/wiki/Imputations/Minutes

Contributors and Teams

  • UMC Groningen: Alexandros Kanterakis alexandros.kanterakis@…

Study data

Reference data

  • 1000 Genomes October 2011 release
  • GoNL pilot 3, 48 trios, 192 haplotypes

Pre processing

Normalize beagle datasets

  • Location: http://www.bbmriwiki.nl/svn/Imputation/alex/scripts/Normalize_beagle_datasets.ftl
    Takes a list of beagle and marker files and applies the following checks:
  • Checks if the SNPs are compatible. If the compatibility cannot be corrected by SNP inversion then it is discarded.
  • Checks if SNP has null alleles, if so, SNP is removed from study data.
  • Checks if two SNPs with same reference code (rs) are in the same position.
  • Checks if two SNPs in the same position have the same reference code (rs).
  • Checks if a SNP in the study has MAF < MAF_minimum, HWE < HWE_minimum and CR < CR_minimum if any of these criteria are met, the SNP is discarded. (MAF = Minor Allele Frequency, HWE = Hardy Weinberg Equilibrium, CR = Call Rate)

It generates a log file with all inconsistencies found: At the end of this file there is a summary of the problems found:

  • SNPs inverted: For Example A/G SNPs in reference , T/C SNPs in study
  • Allele problems: Number of SNPs with inconsistent alleles in study and in reference that could not be fixed with flipping
  • Position problems (different references, same loci): As it says. These SNPs are NOT removed. We keep the reference (rs number) of the reference panel
  • Unresolved single alleles problems: SNPs in study that have only one allele. These SNPs are filtered out.
  • Double rs codes problems: As it says. This SNPs are filtered out.
  • SNPs in study with MAF < MAF_minimum: SNPs with MAF < MAF_minimum set.
  • SNPs in study with HWE < HWE_minimum: SNPs with HWE < HWE_minimum set.
  • SNPs in study with CR < CR_minimum: SNPs with Call Rate < CR_minimum set
  • SNPs that differ in Allele Frequencies: SNPs with difference in AF between reference and study over CR_minimum set.


Options:

  • input_beagle_study : The study in beagle format
  • input_beagle_reference : The reference in beagle format
  • input_markers_study : The study's markers in beagle format
  • input_markers_reference : The reference's markers in beagle format
  • output_beagle_study : The Normalized output of the study (Use this as "study" for imputation)
  • output_beagle_reference : The Normalized output of the reference (Normally you will not use this file)
  • output_markers_study : The Markers of the normalized study
  • output_markers_reference : The Markers of the normalized reference
  • output_log_filename : the log filename

Imputation software

  • Impute2
  • Beagle
  • Mach / Minimach

Quality metrics

Convert impute2 gprobs to TPED

This method is suitable to convert results from impute2 imputation to TPED. You can define an R2 threshold. The R2 is the allelic R2 according to http://www.sciencedirect.com/science/article/pii/S0002929709000123#sec2.7.2 . You can copy the TFAM from the original study in order to have a complete TPED / TFAM dataset.

Options:

  • input_impute2_gprobs_filename : The gprobs file generated from impute2
  • output_TPED_filename : The output TPED filename
  • output_stats_filename : The file where the R2 estimation will be printed. It will contain ALL the R2 values not only these surpassing the threshold
  • chromosome : The chromosome of this study
  • r2_threshold : The R2 threshold

Statistics_of_imputation_results

Computes several statistics of imputation results. This is suitable when we have "real" genotype data to benchmark our imputation pipeline. The computed statistics are:

  • Allelic R2 : according to http://www.sciencedirect.com/science/article/pii/S0002929709000123#sec2.7.2
  • Real_Allelic_R2 : Computes the R2 (or coefficient of determination) between a real and an imputed genotype.
  • Imputation_Allele_Frequency and Standardized_allele_frequency_error : (From: http://www.sciencedirect.com/science/article/pii/S0002929709000123) Allele-frequency error is the difference between the true allele frequency in the sample and the estimated allele frequency in the sample computed from the posterior genotype probabilities. If the three posterior genotype probabilities for an individual are denoted pAA, pAB, and pBB, then the estimated A allele frequency is found by summing (2pAA + pAB) over all individuals and dividing by twice the number of individuals. However, allele-frequency error is difficult to interpret unless the true allele frequency and sample size are known. abs(p - q) / sqrt( ( p * (1-p))/ (2*n)). p is the allele frequency in the sample of n individuals from a population in Hardy-Weinberg equilibrium. q is the estimated allele frequency obtained from the imputed posterior genotype probabilities.


Options:

  • input_beagle_dosage_filename : The output of the beagle imputation
  • input_beagle_unimputed_filename : The beagle file with the "real", un-imputed genotypes
  • output_filename : Output filename for the stats

Complete pipelines

Results

References

See also