wiki:GoNL_Immunochip_Data_Preparation

GoNL Immunochip Data Preparation for Concordance

This page describes the necessary steps to get a VCF Hg19 file containing the GoNL Immunochip data from the raw/QC'ed Immunochip data in PED format. This is using tools as available in early 2011 and should get much simpler when PLINK/Seq is released.

Here, the procedure is shown for a FORWARD strand PED file. If you have a TOP/TOP PED file, you will not be able to include strand-ambiguous SNPs (A/T, C/G) in your output.

PED to VCF

The following steps explain how to produce a VCF file from PLINK ped files. It is a rather cumbersome process at the moment and should be streamlined when PLINK/Seq is released.

Initial PED to VCF

The only tool to have an easy (yet not completely correct) conversion from PED to VCF is the beta version of PLINK/Seq available -along with instructions- here: http://www.broadinstitute.org/gsa/wiki/index.php/Converting_ped_to_vcf

  • plink108 --file in_plink_filename --recode-vcf --out out_filename

This pre-compiled version can only be run on Linux64 machines and some dependency problems may occur.

Correct the initial VCF file

The initial VCF file produced by PLINK 1.08 does contain the right information, however it is not actually in VCF format. The problem here is that PLINK files specify the Ref/Alt? alleles relative to the dataset where VCF specifies the Ref/Alt? alleles relative to the Human Genome Reference it is aligned on. To correct this VCF file, it is necessary to modify the initial VCF file so that the alleles are relative to the Human Genome Reference and not the dataset anymore.

Create a tab-delimited file containing your loci using BEDTools

First it has to be clear that here BED refers to the UCSC BED format and NOT the PLINK binary file format. To be able to sort the alleles with the Human Genome Reference, we need to access it. As it is a big file, BEDTools function fastaFromBed can extract only the loci of interest (in this case those on the chip) and report them in tab-delimited file.

fastaFromBed needs a UCSC BED file as input. This file is tab-delimited and contains 3 columns: Chrom Start_seq End_seq. As we are only interested in specific loci, Start_seq and End_seq will be 1 base appart so that only the locus of interest is reported in the output file. This file can very easily be generated either from the initial VCF file or the PLINK BIM file:

  • From VCF: grep -v '^#' in.vcf | awk '{OFS="\t";print $1,$2,$2+1}' > out.bed
  • OR (depending on the ref seq) From VCF: grep -v '^#' in.vcf | awk '{OFS="\t";print $1,$2-1,$2}' > out.bed

Once you have the input file, simply run fastaFromBed on it giving the Human Reference corresponding to the chip data as the other input. For more information on fastaFromBed, see the BEDTools Manual.

  • fastaFromBed -fi reference.fa -bed in.bed -fo out.fa.tab -tab

Re-arrange Ref/Alt? alleles based on the Human Genome Reference

Now that we have the Human Reference Genome loci, it is trivial to re-arrange the alleles so that the Ref and Alt alleles correspond to the Human Genome Reference. I wrote a small script, align-vcf-to-ref.pl that does the work provided the correct input. Note that when flipping the order of alleles in the VCF Ref/Alt? columns, one must also flipped the genotypes correctly.

If using align-vcf-to-ref.pl, one also has the option to flip the strand where relevant if the data was not all on forward strand. Note that by doing so, all strand-ambiguous (A/T,C/G) will be removed.

[Optional] Update SNP IDs

Depending on the chip platform, the SNP IDs might not correspond to dbSNP IDs (ex. Illumina Immonchip). Illumina provides a list of the corresponding Illumina-dbSNP names. A small script reannotate-vcf-snp-ids.pl can update the SNP IDs so that they correspond to dbSNP based on the Illumina list (not on SNP location).

[Optional] Flip Strand

A small script, flip-vcf-snp.pl, is available in case you have the following:

  • A VCF file coming from a TOP/TOP PLINK file set
  • A BIM file corresponding to the same dataset but in forward strand

The script can be used to flip the strand according to the BIM file.

Liftover file

The last step in preparing the immunochip data for comparison with the sequence data is to liftover the VCF file to the same Human Genome Reference as the Sequence data so that comparisons can be made. Here is how:

  1. Get the appropriate chain files
  2. Get the appropriate fasta files (you'll need both from-and-to builds fasta files)
  3. Index the fasta files appropriately to get .fai (samtools) and .dict files (Picard) as described on the GATK wiki
  4. Get and run the GATK LiftOver tool
  5. IMPORTANT: Check your VCF file header as some versions of the GATK liftover tool (e.g. v1.0.5083) might mix the individuals in the header (sort alphabetically rather than preserve original order). If the order is changed, then you should copy/paste the original order from the source VCF file.

Note that once the liftover VCF has successfully been created, it can be used to liftover the PLINK files. To do so:

  1. Remove all SNPs that are not present in the new reference VCF file (using plink --extract)
  2. Use the liftover VCF as an input to the liftover-bim.pl tool .

Downloads

All tools developed inhouse at UMCG are available here: http://www.bbmriwiki.nl/svn/ped_to_vcf

Last modified 13 years ago Last modified on Feb 13, 2012 2:30:43 PM