Archiver > GENEALOGY-DNA > 2010-02 > 1265558076

From: "Ken Nordtvedt" <>
Subject: [DNA] Intraclade Age Sigma "Unknowable"
Date: Sun, 7 Feb 2010 08:54:36 -0700

One way to estimate TMRCA of a clade is to find the sum of STR variances of a sample of clade haplotypes of today, with variances measured to an assumed founding haplotype. Then divide by sum of STR mutation rates:

Gest = Sum i [r(i,m) - rf(m)]^2 / NM == Var / M

r(i,m) is the repeat value of the mth STR of the ith haplotype. N is number of haplotypes, M is sum of STR mutation rates. For young clades the variances become essentially GD counts.
rf(m) is founder haplotype's repeat value for the mth STR.

But due to the stocastic (random) nature of STR mutations, the right hand side of the above equation (sum of STR variances) will be a distribution which sometimes falls above its average value and sometimes below. We want to know the width of that distribution, so we can get a sense for the statistical uncertainty of the age estimate which is based on what happens "on average" The more STRs we use in our haplotypes the better we can assume to be near average, but there is always this statistical uncertainty. How big is it for the intraclade age estimate? Basically we can not tell without knowing the early demographics of the y tree which starts with the haplotype sample population's MRCA and ends with the N sample haplotypes G generations later.

The analytic formula for the statistical confidence interval for reasonably young clades is given by:

Variance of Var = M { Sum c f(c)^2 }
And the 1 sigma confidence interval for Gest is then SquareRoot {Variance of Var} / M

Variance of Var is simply unknowable without knowing the tree demographics --- the f(c), particularly their values in the tree's earliest generations. The fractions f(c) are relatively large early in the tree and get smaller and smaller as we approach the end of the tree, being 1/N on each of the N branch segments which terminate with our N sample haplotypes.

The label c stands for each male in the y tree which ends with our sample population of N haplotypes. f(c) is the fraction of those N haplotypes for which male "c" is an ancestor. The sum over c can be done as a sum over branch lines in each generation of the tree and then a sum over generations from 1 to G. The number of branch lines is 2 in the first generation after the MRCA, and it increases by one every time a tree node on one of the branch lines comes along, and that branch line number ends up being N in the last generation before the present. We can simply call that number of branch lines each generation P(G), the tree population in generation G.

Consider the sum of f(c)^2 in any particular generation of the tree. While the sum of f(c) for any particular generation must be one, the sum of squares can be as big as one if one particular branch line hogs almost all the ancestry, but it can be no smaller than 1 / P(G). So we can produce an expression for the minimum size that Variance of Var can be under the most democratic tree ancestry scenario in which every ancestor in the tree in every generation shares equally in the fraction of ultimate descendants.

Variance of Var > M { Sum g from 1 to G of 1 / P(g) }

Even this lower limit for the Variance of Var depends on the "unknowable" --- the tree population each generation, and especially the earliest tree generations when P(g) is small. In particular, the Variance of Var is basically determined by how fast the early tree population grows from its initial size of 2.

Note that Variance of Var does not keep getting smaller toward zero as N gets larger. If we used for our sample population the entire population today for some clade, perhaps numbering millions, we still have a Variance of Var lower limit dominated by the reciprocal of the tree population in the early generations. P(g) begins at 2, regardless. In fact, a good sample population of haplotypes which covers the early generations of the tree as does the full population is all that is required to statistically do about as well as one practically needs to do.

Note that the only thing you can do to drive down the statistical confidence interval more and more is to increase M, the sum of STR mutation rates of your haplotypes.


PS: Interclade statistical confidence intervals on the other hand can be conservatively quoted by an upper bound which depends on no demographic knowledge. One can forget about the dispersion of the tree within each clade and consider the simple "V tree" consisting of two branch lines from the interclade node to the present. The statistical confidence interval for the TMRCA of that simplified tree is then straightforwardly evaluatable and is used in Generations4 for the 1 sigma values of the interclade age estimates.

This thread: