Posts for the month of December 2010

Proposed Quality Control.


  • Duplicate input checking.
  • Quality scores histogram from the BGI.
    • Maybe other graphs/data provided by the BGI.
  • Quality scores with the FastqC toolkit.
    • GC content.
    • Quality scores per base.
    • Quality scores per read.
    • N-content.
    • Length distribution.
    • Over representation of reads.
    • A summary of this data provided by a script.


  • Percentage aligned.
  • Insert size distribution.
  • Coverage.
    • Distribution.
    • Visualisation as a wiggle track.
      • Intra sample distance calculation of the wiggle tracks.
  • Mapping quality distribution.
  • Look into Picard Tools.


  • Any available statistics from GATK?

Variant calling:

  • Transition / transversion rate.
  • X, Y coverage (check encoding of sample tags).
  • Mutation rate.
    • Autosomal.
    • Y.
    • M.
  • Distribution of SNPs found in dnSNP.
  • Indel / substitution rate.
  • Cross check with immuno-chip.

In all steps, cross check with the data provided by the BGI.