GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2011-06 > 1307548682

From: "Kenneth Nordtvedt" <>
Subject: [DNA] Asymptotic Distributions for General Mutation Models
Date: Wed, 8 Jun 2011 09:58:02 -0600

Consider the general one step mutation model with m[k] r[k] being probability an STR with k repeats “down” mutates to k-1 repeats, and m[k] (1-r[k]) is probability of “up
” mutation to k+1 repeats.
Both m[k] and r[k] vary arbitrarily with k. m[k] is total mutation rate from k repeats; r[k]=1/2 would be an STR’s k repeats being equally likely to mutate “down” as well as “up”.

If there is a stationary distribution of repeats with f[k] being the frequency of k repeats, the frequencies obey the equations:

m[k] f[k] = m[k+1] r[k+1] f[k+1] + m[k-1] (1-r[k-1]) f[k-1]
This equation is basically “outflow = inflow”

The product quantities g[k] = m[k] f[k] can be defined to simplify the difference equations by hiding the mutation rates m[k] from the problem:

g[k] = r[k+1] g[k+1] + (1-r[k-1]) g[k-1]

These equations are solved by the recursive relations:

g[k] = g[k-1] (1-r[k-1])/r[k]

with solution g[k] = g[ks] (1-r[ks])(1-r[ks+1])(1-r[ks+2]).........(1-r[k-1]) / {r[ks+1]r[ks+2]r[ks+3]........r[k]}
using any starting place ks.

and then the frequencies f[k] are obtained as g[k] / m[k]

The sum over k of f[k] from very low values to very high values must be finite and therefore normalizable to 1. That simply requires r[k] to be finitely above 1/2 in limit of large k and r[k] to be finitely below 1/2 in limit of low k.

For an STR starting with any initial number of repeats, after a very large number of generations the probability distribution of for the repeat value to which that STR will have mutated to is given by this stationary distribution.

An interesting feature appears about such stationary distributions. While the first equation of this message is the statement that the flow of probability into any repeat number’s frequency is equaled by the outflow, a relationship involving three neighboring repeat frequencies, the stationary distribution actually has balance between all neighboring pairs of repeat frequencies. Their bilateral flows even balance:

m[k] (1-r[k]) f[k] = m[k+1] r[k+1] f[k+1] for all k

With a distribution approaching a stationary one for large generation number, the distribution’s variance approaches a limit and stops its growth with time. But for very short times, variance always starts its growth linear in generations with relationship Var = m[kf] G with kf being the repeat value for the populations MRCA.

For now this is all fun and games. We don’t know the probability matrix p(a,b) for individual STRs yet. And it is the actual shape of the variance versus generations curve that we really want in order to make better TMRCA estimations. It will only be the very rare, extremely fast mutating marker which comes close to producing anything like its stationary asymptotic distribution in the lifetime of our y clades.