GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2010-02 > 1265561212

From: "Ken Nordtvedt" <>
Subject: Re: [DNA] Intraclade Age Sigma "Unknowable"
Date: Sun, 7 Feb 2010 09:46:52 -0700
References: <009601caa80d\$da9ca2a0\$5e82af48@Ken1> <4B6EEA3B.3090005@san.rr.com><000901caa814\$2dd18ac0\$5e82af48@Ken1>

I tried to divide up the factors which contribute to the distribution of
pairwise TMRCAs of a population of N haplotypes. Suppose we could obtain
arbitrarily large numbers of STRs for our haplotypes. You could drive down
the statistical confidence intervals for each of your pairwise TMRCAs. But
you would still be left with a distributions of TMRCAs which reflects the
various true time depths of the common ancestors of your N haplotypes taken
two at a time. That latter distribution is what demographics of the tree
determines. It's still there after eliminating the statistical aspect of
the distribution by going to the imaginary world of 500 STR haplotypes.

----- Original Message -----
From: "Ken Nordtvedt" <>
To: <>
Sent: Sunday, February 07, 2010 9:39 AM
Subject: Re: [DNA] Intraclade Age Sigma "Unknowable"

> If you do the interclade age estimates for all the N(N-1)/2 pairs of
> haplotypes of your collection, you will get a distribution of age
> estimates --- that distribution being the result of two factors. 1) the
> values;
> and 2) the actual distribution of true TMRCAs between the pairs of
> haplotypes of your sample collection. For instance; if you happen to
> include Mr Jones and his great uncle in your sample collection, their
> actual
> TMRCA will be just a couple generations, so their estimated TMRCA will be
> the distribution appropriate to a G of 2, regardless of the whole
> collections TMRCA.
>
>
> ----- Original Message -----
> From: "Al Aburto" <>
> To: <>
> Sent: Sunday, February 07, 2010 9:28 AM
> Subject: Re: [DNA] Intraclade Age Sigma "Unknowable"
>
>
>> We sure do need a sigma though. We have the data ... attached to it
>> there is indeed a sigma, but you are saying it is unknowable from
>> current haplotypes.
>>
>> I have tried doing TMRCA estimates in pairs of (unique) haplotypes
>> (using Walsh's infinite allele method) on a group of "n" haplotypes and
>> then getting the mean and sigma from that for the group (cluster) of "n"
>> haplotypes. Is this meaningful?
>> Al
>>
>> > On 2/7/2010 7:54 AM, Ken Nordtvedt wrote:
>>> One way to estimate TMRCA of a clade is to find the sum of STR variances
>>> of a sample of clade haplotypes of today, with variances measured to an
>>> assumed founding haplotype. Then divide by sum of STR mutation rates:
>>>
>>> Gest = Sum i [r(i,m) - rf(m)]^2 / NM == Var / M
>>>
>>> r(i,m) is the repeat value of the mth STR of the ith haplotype. N is
>>> number of haplotypes, M is sum of STR mutation rates. For young clades
>>> the variances become essentially GD counts.
>>> rf(m) is founder haplotype's repeat value for the mth STR.
>>>
>>> But due to the stocastic (random) nature of STR mutations, the right
>>> hand
>>> side of the above equation (sum of STR variances) will be a distribution
>>> which sometimes falls above its average value and sometimes below. We
>>> want to know the width of that distribution, so we can get a sense for
>>> the statistical uncertainty of the age estimate which is based on what
>>> happens "on average" The more STRs we use in our haplotypes the better
>>> we can assume to be near average, but there is always this statistical
>>> uncertainty. How big is it for the intraclade age estimate? Basically
>>> we can not tell without knowing the early demographics of the y tree
>>> which starts with the haplotype sample population's MRCA and ends with
>>> the N sample haplotypes G generations later.
>>>
>>> The analytic formula for the statistical confidence interval for
>>> reasonably young clades is given by:
>>>
>>> Variance of Var = M { Sum c f(c)^2 }
>>> And the 1 sigma confidence interval for Gest is then SquareRoot
>>> {Variance
>>> of Var} / M
>>>
>>> Variance of Var is simply unknowable without knowing the tree
>>> demographics --- the f(c), particularly their values in the tree's
>>> earliest generations. The fractions f(c) are relatively large early in
>>> the tree and get smaller and smaller as we approach the end of the tree,
>>> being 1/N on each of the N branch segments which terminate with our N
>>> sample haplotypes.
>>>
>>>
>>> The label c stands for each male in the y tree which ends with our
>>> sample
>>> population of N haplotypes. f(c) is the fraction of those N haplotypes
>>> for which male "c" is an ancestor. The sum over c can be done as a sum
>>> over branch lines in each generation of the tree and then a sum over
>>> generations from 1 to G. The number of branch lines is 2 in the first
>>> generation after the MRCA, and it increases by one every time a tree
>>> node
>>> on one of the branch lines comes along, and that branch line number ends
>>> up being N in the last generation before the present. We can simply
>>> call
>>> that number of branch lines each generation P(G), the tree population in
>>> generation G.
>>>
>>> Consider the sum of f(c)^2 in any particular generation of the tree.
>>> While the sum of f(c) for any particular generation must be one, the sum
>>> of squares can be as big as one if one particular branch line hogs
>>> almost
>>> all the ancestry, but it can be no smaller than 1 / P(G). So we can
>>> produce an expression for the minimum size that Variance of Var can be
>>> under the most democratic tree ancestry scenario in which every ancestor
>>> in the tree in every generation shares equally in the fraction of
>>> ultimate descendants.
>>>
>>> Variance of Var> M { Sum g from 1 to G of 1 / P(g) }
>>>
>>> Even this lower limit for the Variance of Var depends on the
>>> "unknowable" --- the tree population each generation, and especially the
>>> earliest tree generations when P(g) is small. In particular, the
>>> Variance of Var is basically determined by how fast the early tree
>>> population grows from its initial size of 2.
>>>
>>> Note that Variance of Var does not keep getting smaller toward zero as N
>>> gets larger. If we used for our sample population the entire population
>>> today for some clade, perhaps numbering millions, we still have a
>>> Variance of Var lower limit dominated by the reciprocal of the tree
>>> population in the early generations. P(g) begins at 2, regardless. In
>>> fact, a good sample population of haplotypes which covers the early
>>> generations of the tree as does the full population is all that is
>>> required to statistically do about as well as one practically needs to
>>> do.
>>>
>>> Note that the only thing you can do to drive down the statistical
>>> confidence interval more and more is to increase M, the sum of STR
>>> mutation rates of your haplotypes.
>>>
>>> Ken
>>>
>>> PS: Interclade statistical confidence intervals on the other hand can
>>> be
>>> conservatively quoted by an upper bound which depends on no demographic
>>> knowledge. One can forget about the dispersion of the tree within each
>>> clade and consider the simple "V tree" consisting of two branch lines
>>> from the interclade node to the present. The statistical confidence
>>> interval for the TMRCA of that simplified tree is then straightforwardly
>>> evaluatable and is used in Generations4 for the 1 sigma values of the
>>>
>>>
>>
>>
>> -------------------------------
>> To unsubscribe from the list, please send an email to
>> with the word 'unsubscribe' without
>> the
>> quotes in the subject and the body of the message
>>
>
>
>
> -------------------------------
> To unsubscribe from the list, please send an email to
> with the word 'unsubscribe' without the
> quotes in the subject and the body of the message
>