GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2008-02 > 1203007581

From: "Ken Nordtvedt" <>
Subject: [DNA] Two Variances Better than One
Date: Thu, 14 Feb 2008 09:46:21 -0700

We can use a combinatioin of both types of haplotype variances to provide upper and lower limits to population ages.

Expectation value of a marker's variance from a fixed, assumed founding repeat value is:

Vf = (mu+md) G + (mu-md)^2 G^2 with mu and md being up and down mutation rates of the marker. Vf / (mu+md) then provides an upper limit for G, the population age since the MRCA.

Expectation value of a marker's variance from the mean repeat value is:

V = (mu+md) (G - Sum over c of fc^2) with The Sum being a correction dependent on the early history details of the descendant population of the MRCA. V/(mu+md) then provides a lower limit for G, the population age since the MRCA.

Doing both variances together then brackets the true age G, above and below.

But we don't (or shouldn't) do age estimates from the variance of single markers. We should use haplotypes with as many markers as possible and employ all the basic, single copy markers in making an estimate of G. This is because the actualy distribution of variances one will get for each individual marker after G generations is a very asymmetric (lop-sided) distribution whose peak (place of most likely outcome) deviates from the expected value of the distribution. And the width of the individual marker distribution of variance outcomes is quite broad (meaning very wide confidence interval).

But one of the most amazing theorems of probability mathematics ---Lyapunov's central limit theorem --- states that no matter the shape of individual probability distributions for independent outcomes, the probability distribution for the sum of those individual outcomes approaches the Gaussian distribution (normal) distribution with expected value of the sum being sum of expected values for each outcome, and squared standard deviation for the sum of outcomes being the sum of squared standard deviations of the individual outcomes. In these previous sentences the individual outcomes means the variances for the various markers of one's haplotypes.

So by summing the variances of individual markers over all the markers, and summing the up plus down mutation rates of all the markers, we have two estimators for age since the MRCA with one estimator being above and the other below the true value.

G(upper) = Sum over markers i of Vf(i) / M
G(lower) = Sum over markers i of V(i) / M
with M being Sum over markers i of mu(i)+md(i)
V(i) for each marker must mathematically always be less than Vf(i)

Let rf(i) be the assumed fixed, founding repeat value for the ith marker of the MRCA's founding haplotype
Let <r(i)> be the average value of the ith marker repeat value from the haplotype population

A little algebra then gives the spacing between G(upper) and G(lower) estimates for G

G(upper) - G(lower) = Sum over i of ( rf(i) - <r(i)> )^2 / M
This width of G(upper) - G(lower) quantitatively captures all those complications difficult or impossible to model with present knowledge.

Those differences of the assumed integer repeat values rf(i) of the founding haplotype and the fractional <r(i)> obtained from the population averages will generally be fractions less than 1 if the sample population well represents the total population, and you have an isolated clade population without very strong sub-clades left within it.

Ken