wiki:Test imputation pipeline

Version 10 (modified by a.kanterakis, 13 years ago) (diff)

--

Introduction

The purpose if this run is to test the efficiency of the existing imputation pipelines in the Grid.

Datasets

Reference

The reference dataset has been created from the raw VCF data of 1000 Genomes data.

Study panel

The study panel is an artificial genotype dataset. The dataset contains the SNPs set of the Illumina Hap550 platform. To generate it we followed the following steps:

The study panel is an artificial genotype dataset. Created by

  • This command requires 80G of memory.
  • Filter the generated gen files with the SNPs from the hap550 file. According to the documentation of hapgen2 you can specify a list of SNPs and limit the generation of the files in these SNPs (the -t option). I didn't use this option and I filtered the generated data afterwards to the positions of hap550 in order to check if the positions indeed matched. Some stats:
    SNPs in 1000Genomes and in Hap550: 546,233
    SNPs in 1000Genomes but not in Hap550: 36,265,509
    SNPs not in 1000Genomes but in Hap550: 935
    

The study panel is an artificial genotype dataset. Created by