GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2010-02 > 1265560793


From: "Ken Nordtvedt" <>
Subject: Re: [DNA] Intraclade Age Sigma "Unknowable"
Date: Sun, 7 Feb 2010 09:39:53 -0700
References: <009601caa80d$da9ca2a0$5e82af48@Ken1> <4B6EEA3B.3090005@san.rr.com>


If you do the interclade age estimates for all the N(N-1)/2 pairs of
haplotypes of your collection, you will get a distribution of age
estimates --- that distribution being the result of two factors. 1) the
statistical distribution of interclade age estimates about the true values;
and 2) the actual distribution of true TMRCAs between the pairs of
haplotypes of your sample collection. For instance; if you happen to
include Mr Jones and his great uncle in your sample collection, their actual
TMRCA will be just a couple generations, so their estimated TMRCA will be
the distribution appropriate to a G of 2, regardless of the whole
collections TMRCA.


----- Original Message -----
From: "Al Aburto" <>
To: <>
Sent: Sunday, February 07, 2010 9:28 AM
Subject: Re: [DNA] Intraclade Age Sigma "Unknowable"


> We sure do need a sigma though. We have the data ... attached to it
> there is indeed a sigma, but you are saying it is unknowable from
> current haplotypes.
>
> I have tried doing TMRCA estimates in pairs of (unique) haplotypes
> (using Walsh's infinite allele method) on a group of "n" haplotypes and
> then getting the mean and sigma from that for the group (cluster) of "n"
> haplotypes. Is this meaningful?
> Al
>
> > On 2/7/2010 7:54 AM, Ken Nordtvedt wrote:
>> One way to estimate TMRCA of a clade is to find the sum of STR variances
>> of a sample of clade haplotypes of today, with variances measured to an
>> assumed founding haplotype. Then divide by sum of STR mutation rates:
>>
>> Gest = Sum i [r(i,m) - rf(m)]^2 / NM == Var / M
>>
>> r(i,m) is the repeat value of the mth STR of the ith haplotype. N is
>> number of haplotypes, M is sum of STR mutation rates. For young clades
>> the variances become essentially GD counts.
>> rf(m) is founder haplotype's repeat value for the mth STR.
>>
>> But due to the stocastic (random) nature of STR mutations, the right hand
>> side of the above equation (sum of STR variances) will be a distribution
>> which sometimes falls above its average value and sometimes below. We
>> want to know the width of that distribution, so we can get a sense for
>> the statistical uncertainty of the age estimate which is based on what
>> happens "on average" The more STRs we use in our haplotypes the better
>> we can assume to be near average, but there is always this statistical
>> uncertainty. How big is it for the intraclade age estimate? Basically
>> we can not tell without knowing the early demographics of the y tree
>> which starts with the haplotype sample population's MRCA and ends with
>> the N sample haplotypes G generations later.
>>
>> The analytic formula for the statistical confidence interval for
>> reasonably young clades is given by:
>>
>> Variance of Var = M { Sum c f(c)^2 }
>> And the 1 sigma confidence interval for Gest is then SquareRoot {Variance
>> of Var} / M
>>
>> Variance of Var is simply unknowable without knowing the tree
>> demographics --- the f(c), particularly their values in the tree's
>> earliest generations. The fractions f(c) are relatively large early in
>> the tree and get smaller and smaller as we approach the end of the tree,
>> being 1/N on each of the N branch segments which terminate with our N
>> sample haplotypes.
>>
>>
>> The label c stands for each male in the y tree which ends with our sample
>> population of N haplotypes. f(c) is the fraction of those N haplotypes
>> for which male "c" is an ancestor. The sum over c can be done as a sum
>> over branch lines in each generation of the tree and then a sum over
>> generations from 1 to G. The number of branch lines is 2 in the first
>> generation after the MRCA, and it increases by one every time a tree node
>> on one of the branch lines comes along, and that branch line number ends
>> up being N in the last generation before the present. We can simply call
>> that number of branch lines each generation P(G), the tree population in
>> generation G.
>>
>> Consider the sum of f(c)^2 in any particular generation of the tree.
>> While the sum of f(c) for any particular generation must be one, the sum
>> of squares can be as big as one if one particular branch line hogs almost
>> all the ancestry, but it can be no smaller than 1 / P(G). So we can
>> produce an expression for the minimum size that Variance of Var can be
>> under the most democratic tree ancestry scenario in which every ancestor
>> in the tree in every generation shares equally in the fraction of
>> ultimate descendants.
>>
>> Variance of Var> M { Sum g from 1 to G of 1 / P(g) }
>>
>> Even this lower limit for the Variance of Var depends on the
>> "unknowable" --- the tree population each generation, and especially the
>> earliest tree generations when P(g) is small. In particular, the
>> Variance of Var is basically determined by how fast the early tree
>> population grows from its initial size of 2.
>>
>> Note that Variance of Var does not keep getting smaller toward zero as N
>> gets larger. If we used for our sample population the entire population
>> today for some clade, perhaps numbering millions, we still have a
>> Variance of Var lower limit dominated by the reciprocal of the tree
>> population in the early generations. P(g) begins at 2, regardless. In
>> fact, a good sample population of haplotypes which covers the early
>> generations of the tree as does the full population is all that is
>> required to statistically do about as well as one practically needs to
>> do.
>>
>> Note that the only thing you can do to drive down the statistical
>> confidence interval more and more is to increase M, the sum of STR
>> mutation rates of your haplotypes.
>>
>> Ken
>>
>> PS: Interclade statistical confidence intervals on the other hand can be
>> conservatively quoted by an upper bound which depends on no demographic
>> knowledge. One can forget about the dispersion of the tree within each
>> clade and consider the simple "V tree" consisting of two branch lines
>> from the interclade node to the present. The statistical confidence
>> interval for the TMRCA of that simplified tree is then straightforwardly
>> evaluatable and is used in Generations4 for the 1 sigma values of the
>> interclade age estimates.
>>
>>
>
>
> -------------------------------
> To unsubscribe from the list, please send an email to
> with the word 'unsubscribe' without the
> quotes in the subject and the body of the message
>



This thread: