GENEALOGY-DNA-L ArchivesArchiver > GENEALOGY-DNA > 2011-02 > 1297371100
From: David Johnston <>
Subject: Re: [DNA] variance quiz game
Date: Thu, 10 Feb 2011 14:51:40 -0600
References: <002701cbc7a9$8c29ef30$c2482dae@Ken1><4D517E86.email@example.com> <014001cbc7d0$b20d4450$c2482dae@Ken1><015e01cbc7d2$ebe67640$c2482dae@Ken1> <4D51B0F0.firstname.lastname@example.org> <018b01cbc7dc$6cc407b0$c2482dae@Ken1><4D5369FB.email@example.com> <REME20110210145257@alum.mit.edu>
On 2/10/11 1:52 PM, John Chandler wrote:
> Dave wrote:
>> I am assuming that the chance of a non-mutation is (1-mu) and the chance
>> of an increase by 1 is mu/2 as is the chance of a decrease by one.
> The problem with this simple model is that father-son studies have
> shown two-step mutations to be roughly 5% of the total inventory.
> This makes a big difference for predicting the TMRCA of a pair of
> haplotypes with any discrepancy of two or more steps on one marker.
Yeah, it is easy to generalize to that. It is still a multi-Bernoulli
distribution. There are just more than three outcomes. However, you have
to decide how the two-step mutation works. Is there a separate mutation
rate for that or can you just assume it is the square of the first one?
>> but I have found a recurrence relation so that it is easily computed.
> Does your recursion formula generalize to the case of five-way
> branching, or better yet to seven-way branching (+3, -3, +2, -2, +1,
> -1, 0)? Brute-force calculation is doable up to the five-way case,
> but becomes impractical beyond that.
> John Chandler
Yeah, I think that is a trivial change. With just (+1,- 1, 0), the
recursion relation is
(D is the observed difference at the allele, N is the number of generations)
P(D | N) = (1-mu) *P(D | N-1) + mu/2*P(D-1 | N-1) + mu/2*P(D+1 | N-1)
It is just saying that the probability of being at the D state at time N
is sum of the probabilities of being at one of the other three states at
N-1 AND jumping to D from there.
So with 2*n+1 branches, this recursion relation becomes has 2n+1 terms
P(D | N) = (1-mu) *P(D | N-1) + SUM_OVER_i (nu_i/2) (P(D - i | N-1) +
P(D- i | N-1))
where mu=SUM_OVER_i nu_i.
Computation is not really a problem. If you want store all P(D|N)
values, you have N(N+3)/2 values to compute and store. If you have an
n-branch recursion relation, you are doing N*(N+3)/2*(2n+1) computations
approximately. I can easily do up to N=1000 in less than a second on my
laptop. That is 1000 generations. Going beyond that scales like N^2
obviously. For 3000 generations, it took about 13 seconds. At that
point, it is more about memory. But computing and storing every value is
surely not needed. I am assuming that after 1000 generations, you could
be a 1000 markers away. Obviously, there is no need to compute past 20
moves or so. So I believe it can be made to scale like N instead of N^2.
If you have different mutations at different markers, you would need to
compute and store for each but this at worst makes it 67 times slower if
you have 67 markers and less is some alleles share the same mutation rate.