18 | | To calculate the concordance between the different files, [http://vcftools.sourceforge.net/ VCFTools] was used. More specifically: |
19 | | <pre>vcftools --vcf /data/lfrancioli/immunochip/hg19/GvNL.hg19.final.vcf --indv ${sample} --diff /data/lfrancioli/results/pilot/${sample}.human_g1k_v37.immuno.vcf --diff-site-discordance --diff-indv-discordance --diff-discordance-matrix</pre> |
20 | | This computes the concordance per file, site and individual as well as a discordance matrix. This was applied on a sample level so only the file, site and discordance matrix where actually used. |
| 18 | To calculate the concordance between the different files, [http://vcftools.sourceforge.net/ VCFTools] was used. More specifically: <pre>vcftools --vcf /data/lfrancioli/immunochip/hg19/GvNL.hg19.final.vcf --indv ${sample} --diff /data/lfrancioli/results/pilot/${sample}.human_g1k_v37.immuno.vcf --diff-site-discordance --diff-indv-discordance --diff-discordance-matrix</pre> This computes the concordance per file, site and individual as well as a discordance matrix. This was applied on a sample level so only the file, site and discordance matrix where actually used. |
40 | | For reporting purpose, R scripts were created. These scripts all take files created using vcftools-diff_site-condordance.pl or vcftools-discordance-matrix.py as input. |
41 | | The following scripts are available: |
42 | | * plot_shared_loci.R |
43 | | ** Plots the shared/unique loci in the two datasets per individual as a barplot |
44 | | ** Usage: Rscript plot_shared_loci.R <concordance_file> <out_plot.jpg> [name_dataset1] [name_dataset2] |
45 | | * plot_geno_concordance.R |
46 | | ** Plots the genotype concordance between two datasets per individual as a barplot |
47 | | ** Usage: Rscript plot_geno_concordance.R <concordance_file> <out_plot.jpg> [plot_title] |
48 | | * plot_discordance_matrix.R |
49 | | ** Plots the genotype discordance by "discordance type" (0/0 -> 0/1, 0/0 -> 1/1, 0/1 -> 0/0, etc.) |
50 | | ** Usage: Rscript plot_discordance_matrix.R <discordance_matrix_file> <out_plot.jpg> [dataset1_name] [dataset2_name] [show_concordant_data=FALSE] [<concordance_file>] |
51 | | ** Note: |
52 | | *** The last optional argument is a concordance file over the same data to plot as 'unknown' all loci that were not captured by the concordance matrix since the alleles were not exact matches (e.g. if one of the allele was monomorphic in one set). |
| 38 | For reporting purpose, R scripts were created. These scripts all take files created using vcftools-diff_site-condordance.pl or vcftools-discordance-matrix.py as input. The following scripts are available: |
| 39 | |
| 40 | * plot_shared_loci.R |
| 41 | |
| 42 | ** Plots the shared/unique loci in the two datasets per individual as a barplot ** Usage: Rscript plot_shared_loci.R <concordance_file> <out_plot.jpg> [name_dataset1] [name_dataset2] |
| 43 | |
| 44 | * plot_geno_concordance.R |
| 45 | |
| 46 | ** Plots the genotype concordance between two datasets per individual as a barplot ** Usage: Rscript plot_geno_concordance.R <concordance_file> <out_plot.jpg> [plot_title] |
| 47 | |
| 48 | * plot_discordance_matrix.R |
| 49 | |
| 50 | ** Plots the genotype discordance by "discordance type" (0/0 -> 0/1, 0/0 -> 1/1, 0/1 -> 0/0, etc.) ** Usage: Rscript plot_discordance_matrix.R <discordance_matrix_file> <out_plot.jpg> [dataset1_name] [dataset2_name] [show_concordant_data=FALSE] [<concordance_file>] ** Note: *** The last optional argument is a concordance file over the same data to plot as 'unknown' all loci that were not captured by the concordance matrix since the alleles were not exact matches (e.g. if one of the allele was monomorphic in one set). |
57 | | * Groningen |
58 | | ** Produced using Groningen pipeline on hg19 |
59 | | ** SNPs were only filtered for quality > Q10 (to avoid extreme numbers in VCF). It is considered an 'unfiltered' set and is expected to contain many false positives, however sensitivity should be excellent as any SNP not reported at this point will not be detected further in the pipeline. |
60 | | * BGI |
61 | | ** Produced using BGI pipeline on b36, then lifted over to hg19 |
62 | | ** SNPs filtered using standard BGI filter setup |
| 55 | |
| 56 | * Groningen |
| 57 | |
| 58 | ** Produced using Groningen pipeline on hg19 ** SNPs were only filtered for quality > Q10 (to avoid extreme numbers in VCF). It is considered an 'unfiltered' set and is expected to contain many false positives, however sensitivity should be excellent as any SNP not reported at this point will not be detected further in the pipeline. |
| 59 | |
| 60 | * BGI |
| 61 | |
| 62 | ** Produced using BGI pipeline on b36, then lifted over to hg19 ** SNPs filtered using standard BGI filter setup |
80 | | * Groningen |
81 | | ** Produced using Groningen pipeline on hg19 |
82 | | ** SNPs were only filtered for quality > Q10 (to avoid extreme numbers in VCF). It is considered an 'unfiltered' set and is expected to contain many false positives, however sensitivity should be excellent as any SNP not reported at this point will not be detected further in the pipeline. |
83 | | ** Homozygous reference loci corresponding to the Immunochip were added to the dataset as well |
84 | | * Immunochip |
85 | | ** ~165K loci after QC (both SNPs and homozygous reference) |
86 | | *** SNP HWE p-val > 1e-3 |
87 | | *** SNP callrate > 99% |
88 | | ** Exported from Genome Studio, QC'ed and lifted over from hg18 to hg19 |
| 80 | |
| 81 | * Groningen |
| 82 | |
| 83 | ** Produced using Groningen pipeline on hg19 ** SNPs were only filtered for quality > Q10 (to avoid extreme numbers in VCF). It is considered an 'unfiltered' set and is expected to contain many false positives, however sensitivity should be excellent as any SNP not reported at this point will not be detected further in the pipeline. ** Homozygous reference loci corresponding to the Immunochip were added to the dataset as well |
| 84 | |
| 85 | * Immunochip |
| 86 | |
| 87 | ** ~165K loci after QC (both SNPs and homozygous reference) *** SNP HWE p-val > 1e-3 *** SNP callrate > 99% ** Exported from Genome Studio, QC'ed and lifted over from hg18 to hg19 |
96 | | * A3b, A7b samples are contaminated |
97 | | * A8a,A8c,R5A encountered a problem while processing one of their lanes, thus leading to 2/3 of the normal coverage. The figures should be updated when the lanes have been processed and these individuals corrected. |
| 95 | |
| 96 | * A3b, A7b samples are contaminated |
| 97 | * A8a,A8c,R5A encountered a problem while processing one of their lanes, thus leading to 2/3 of the normal coverage. The figures should be updated when the lanes have been processed and these individuals corrected. |
| 98 | |
107 | | * BGI |
108 | | ** Produced using BGI pipeline on b36, then lifted over to hg19 |
109 | | ** SNPs filtered using standard BGI filter setup. Note that no homozygous reference locus is reported. |
110 | | * Immunochip |
111 | | ** ~165K loci after QC (both SNPs and homozygous reference) |
112 | | *** SNP HWE p-val > 1e-3 |
113 | | *** SNP callrate > 99% |
114 | | ** Exported from Genome Studio, QC'ed and lifted over from hg18 to hg19 |
| 107 | |
| 108 | * BGI |
| 109 | |
| 110 | ** Produced using BGI pipeline on b36, then lifted over to hg19 ** SNPs filtered using standard BGI filter setup. Note that no homozygous reference locus is reported. |
| 111 | |
| 112 | * Immunochip |
| 113 | |
| 114 | ** ~165K loci after QC (both SNPs and homozygous reference) *** SNP HWE p-val > 1e-3 *** SNP callrate > 99% ** Exported from Genome Studio, QC'ed and lifted over from hg18 to hg19 |