Changes between Initial Version and Version 1 of DeNovoVariationPipelineIdea


Ignore:
Timestamp:
Sep 26, 2010 9:55:37 PM (14 years ago)
Author:
Yurii Aulchenko
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DeNovoVariationPipelineIdea

    v1 v1  
     1[[TOC()]]
     2
     3= (how) Is discovery of ''de-novo'' mutations feasible in GvNL data? =
     4
     5Following the discussion of Yurii Aulchenko and Kai Ye, 2010.09.08 – 2010.09.12
     6
     7== PROBLEM ==
     8
     9When a ‘de-novo’ mutation occurs, we may see the following picture:
     10
     11Reads in one of the parents (r: reference. A: alternative variant)
     12{{{
     13rrrrrrrrrr
     14
     15rrrrrrrrrr
     16
     17rrrrrrrrrr
     18
     19rrrrrrrrrr
     20
     21rrrrrrrrrr
     22
     23rrrrrrrrrr
     24}}}
     25
     26Other parent reads
     27
     28{{{
     29rrrrrrrrrr
     30
     31rrrrrrrrrr
     32
     33rrrrrrrrrr
     34
     35rrrrrrrrrr
     36
     37rrrrrrrrrr
     38
     39rrrrrrrrrr
     40
     41rrrrrrrrrr
     42
     43}}}
     44
     45Reads in offspring
     46
     47{{{
     48rrrrrArrrr
     49
     50rrrrrArrrr
     51
     52rrrrrArrrr
     53
     54rrrrrArrrr
     55
     56rrrrrArrrr
     57
     58rrrrrrrrrr
     59
     60rrrrrrrrrr
     61
     62rrrrrrrrrr
     63
     64rrrrrrrrrr
     65
     66}}}
     67
     68The problem is that actually either of the parents can be a heterozygous carrier of “A”, and we have missed this allele just by chance (chance to miss it is estimated to be approximately ~1%). This means we will see above described scenario in tens of thousands of locations. Given expected number of ‘de novo’ mutations in a person is ~50, verification step will present a major problem.
     69
     70We see there may be two (potentially complementary) ways out with this problem.
     71
     72== (1) COVERAGE ==
     73
     74For few trios, increase coverage (potentially only in parents, or even one parent); this will decrease the chance that we miss a heterozygote. We did calculations of what coverage should be so we get chance of het missing becoming comparable to the mutation rate; e.g. we aim chance het missing = 1e-8 or so (see Box 1 below for computations). It appears that at ~32x only half of the situations described above will be attributable to inadequate coverage, while other half will be true ‘de novo’ mutations.
     75
     76
     77{{{
     78Box 1
     79
     80Assume the heterozygote call is made when at least two reads show the variant.
     81Let us also assume for the moment that coverage is always Nx. Denote reference
     82sequence as “R” and alternative as “A”, so in fact the person is R/A. Let us compute
     83the probability that we miss this heterozygote (i.e. will call it A/A or R/R):
     84
     85P(call R/A as R/R or A/A) = P(all N read R) + P([N-1] reads are R, and 1 read is A)
     86                                                + P(all N read A) + P([N-1] reads are A, and 1 read is R)
     87
     88Assuming that probability of reads follows binomial distribution, we get
     89
     90P(call R/A as R/R or A/A) = 2*(N+1)*(1/2)!^N
     91
     92P(call R/A as R/R or A/A) ~ 1e-8 at N ~ 32
     93}}}
     94
     95== (2) Exploit tagging of same-window reads ==
     96
     97Basically to detect ‘de novo’ we need a situation when WITHIN THE SAME READ WINDOW (or paired-end read window) both parental chromosomes are tagged by a variant, and we see the third variant appearing in this context in child only
     98
     99Below is a naïve example of a situation when we could be able to detect ‘de novo’ mutation “C” (in red). Note that this is only one situation when we can clearly see that “C” is ‘de novo’. More situations can be worked out following the same logic.
     100
     101r: reference. A, B: alternative variants tagging the sequence.
     102
     103Reads in one of the parents
     104
     105{{{
     106rrrrrArrrr
     107
     108rrrrrArrrr
     109
     110rrrrrArrrr
     111
     112rrrrrArrrr
     113
     114rrrrrArrrr
     115
     116rrrrrArrrr
     117
     118rrrrrrrrrr
     119
     120rrrrrrrrrr
     121
     122rrrrrrrrrr
     123
     124rrrrrrrrrr
     125
     126rrrrrrrrrr
     127
     128rrrrrrrrrr
     129
     130}}} 
     131
     132Other parent reads
     133
     134{{{
     135
     136rBrrrrrrrr
     137
     138rBrrrrrrrr
     139
     140rBrrrrrrrr
     141
     142rBrrrrrrrr
     143
     144rBrrrrrrrr
     145
     146rrrrrrrrrr
     147
     148rrrrrrrrrr
     149
     150rrrrrrrrrr
     151
     152rrrrrrrrrr
     153
     154rrrrrrrrrr
     155
     156rrrrrrrrrr
     157
     158rrrrrrrrrr
     159
     160}}}
     161
     162Reads in offspring
     163
     164{{{
     165
     166rrCrrArrrr
     167
     168rrCrrArrrr
     169
     170rrCrrArrrr
     171
     172rrCrrArrrr
     173
     174rrCrrArrrr
     175
     176rrCrrArrrr
     177
     178rrrrrrrrrr
     179
     180rrrrrrrrrr
     181
     182rrrrrrrrrr
     183
     184rrrrrrrrrr
     185
     186rrrrrrrrrr
     187
     188rrrrrrrrrr
     189
     190}}} 
     191
     192To address whether this is realistic scenario under which we can detect de novo mutations, we need to answer the question about probability that, given ‘de novo’ mutation occurs, what is the chance we will see that mutation in at least four reads (it is clear that for ‘de novo’ we must use more stringent calling criteria) and that in at least two of these 4 reads we will also see a heterozygote coming from a parent. Computations estimating this chance are provided below in the Box 2.
     193
     194From these computations, it appears that the chance to see ‘de novo’ in 4 or more reads, and see an existing (transmitted from a parent) variant in at least two of these reads is about 0.09. Thus, using outlined strategy we will be able to detect several de novo mutations per trio offspring, translating to hundreds (or thousands) de novo described from the whole data set. Note that in above we ignored the paired-end nature of our sequencing data, which, when properly accounted for, would probably double the numbers of detectable de novo mutations. Next, if cross-reads phasing works accurately and at longer distances this will allow to bring this proportion even higher.
     195
     196{{{
     197Box 2
     198
     199The probability that we see a ‘de novo’ in at least 4 reads out of 12 is 0.93.
     200The chance that an existing heterozygous site is covered in the same read
     201can be computed assuming the read length of 100, uniform distribution of
     202the read-start position across the genome, and heterozygote probability of
     2031/300 per site (Kai). Assume the ‘worst’ scenario of exactly 4 reads with
     204‘de novo’, what is the chance that in at least two of them we will see an
     205existing heterozygote?
     206
     207Denoting the ‘de novo’ position in the read as 0, the ‘coverable’ position
     208of a heterozygote may vary from -99 to +99. The chance that a heterozygote
     209at +99 is included in the read is 0.01; if heterozygote is at +1, the chance is
     2100.99. Thus, for a heterozygote at position ‘j’ (j in -99 to -1 and 1 to 99) the
     211chance to be included in the read is (1-abs(j/100)). We assume that a chance
     212to have a ‘linked’ alternative variant at a position is ½ * 1/300 = 1/600. Thus
     213the probability to detect a ‘linked’ variant in at least two reads out of 4 is:
     214
     215P(see variant in >=2 reads)
     216
     217= P(variant is at -99) * P(see variant in >=2 reads | variant is at -99)
     218   + P(variant is at -98) * P(see variant in >=2 reads | variant is at -98) +
     219   … + P(variant is at +99) * P(see variant in >=2 reads | variant is at +99)
     220
     221= 1/600 (P(see variant in >=2 reads | variant is at -99)
     222   + … P(see variant in >=2 reads | variant is at +99))
     223
     224= 1/600 [ 2 * SUM_{j=(1,99)} SUM_{k=2,4} (1-j/100)^k * (j/100)^(4-k) ]
     225
     226Evaluation of this expression gives 
     227
     228P(see variant in >=2 reads) = 0.09
     229
     230Thus the joint probability to see ‘de novo’ in >=4 reads and see an
     231established variant transmitted from a parent in at least 2 of these reads
     232is 0.93*0.09 = 0.086.
     233}}}
     234
     235== Conclusions ==
     236
     237From above computations it looks like both increasing coverage in parents (and may be offspring) of selected trios to >32x, and exploitation of information from the same reads may make detection of ‘de novo’ variants feasible.
     238
     239Note that above computations cannot be considered as final; multiple assumptions and approximations were made, it will depend on goodness of these assumptions and approximations how far off the true answer the provided figures are. Still, the true answer should be the same order of magnitude – which was our initial aim – to see if ‘de novo’ project looks at all feasible or not.
     240
     241Before any of these lines can be followed up, thorough discussion and further computations / simulations should be done to relax the assumption that N is constant (use N ~ Poisson(Lambda) instead) and also taking into account error probability (not considered above).