GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2003-12 > 1070583989


From:
Subject: Re: [DNA] Genetic Distance calculation - Comments re MacGregor and a further question
Date: Thu, 4 Dec 2003 19:26:38 -0500 (EST)
References: <1ec.13e271f9.2cf0ed2f@aol.com> <002b01c3b1c7$f967e610$799c89d9@helen> <REME20031201154816@alum.mit.edu> <00bc01c3b86f$b78fef90$3fb50818@c452380a> <REME20031202170433@alum.mit.edu> <014701c3b935$eb94f940$3fb50818@c452380a> <REME20031202220812@alum.mit.edu> <015b01c3b9c0$36087850$3fb50818@c452380a>
In-Reply-To: <015b01c3b9c0$36087850$3fb50818@c452380a> (ecbeaty@comcast.net)


Earl wrote:
> I am still having trouble with a point of principle. The numbers going into
> the formula are the genetic distances, rather than the numbers of mutations.
> (The difference has to do with the fast/slow designations.) Some mutations
> affect more than one testee, and I would expect to try to use the actual
> number of mutations for the group. Determination of the net number of
> mutations is not easy, and even if done with confidence, I don't see how see
> how to use the numbers in a variance calculation. You have suggested
> ignoring the issue of recent common ancestors for part of the group, and
> that is certainly convenient. Is there more to this than convenience? I am
> hoping you can say some more about the reasons behind the formula and the
> impact of better data on mutations rates.

If you take multiple samples at random from a modern cross-section of
a descendancy tree, you will be surprised if the first two share any
common ancestor more recent that the "root" of the whole tree, but the
chances of that eventuality rise with every new sample you take. The
more samples you have, the more likely you are to find pairs with more
and more recent common ancestors. This is unavoidable in a random
sampling, and it should therefore not be viewed as a flaw in the study.
The point is that every generation on every line adds a random variable
(with discrete values) to the genetic state. By hypothesis, the rate
of mutation has been constant all this time, and so these random
variables all have the same distribution. Therefore, the expected
variance of the sample population after t generations is just t times
the variance of that one-generation random variable. If your sample
is large enough and unbiased enough, you should get close to the
expected variance, despite the unavoidable recent shared ancestors.

Obviously, in order to estimate the mutation rate (as opposed to an
unknown TMRCA), you need to know exactly how everybody is related
and exactly how many mutations actually occurred. That is a separate
problem. If you are only guessing that people are on the same branch
because of a shared mutation, you can keep the starting assumption
that the sample was randomly chosen and go ahead with the TMRCA
estimate as planned.

If you want to know how to refine the TMRCA estimate in the future,
when separate mutation rates for the separate markers will be
available, it comes down to treating the locus-by-locus differences
with separate scale factors. Instead of squaring and averaging the
integer differences and then dividing by the average mutation rate at
the very end, you would divide each square by the appropriate rate
as you go. HOWEVER, such refinement is likely to make relatively
small changes in the results when compared to the statistical
uncertainty.

John Chandler


This thread: