GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2007-02 > 1171671904


From: (John Chandler)
Subject: Re: [DNA] TMRCA
Date: Fri, 16 Feb 2007 19:25:04 -0500 (EST)
References: <620805.49798.qm@web31509.mail.mud.yahoo.com>
In-Reply-To: <620805.49798.qm@web31509.mail.mud.yahoo.com> (message fromJonathan Day on Thu, 15 Feb 2007 23:48:46 -0800 (PST))


Jonathan wrote:
> You're technically correct. I'm assuming that there
> can be variations in mutation rates

Given the rest of your message, it's clear that you are mistaking rate
estimates with the rates themselves. There are indeed wide variations
in the rate estimates based on small data samples, but these
variations are *not* actually wide compared to the statistical
uncertainties in the estimates in question, only wide compared to the
uncertainties in estimates based on large data samples.

> Why is that important? Well, let's say you have a
> genetic distance of 4. There are five possible ways
> that can occur:
>
> a) Person A has 4 mutations, person B has none
> b) Person A has 3 mutations, person B has 1
> c) Person A has 2 mutations, person B has 2
> d) Person A has 1 mutation, person B has 3
> e) Person A has no mutations, person B has 4
>
> If you have to include all five possibilities (ie: the
> entire bell-curve of possibilities) then you have a
> gigantic range of possible times to consider, which is
> why TMRCA calculators generally produce very ugly
> results. They aren't able to factor in skew, so must
> offer the widest possible range of answers and the
> least possible confidence on any of them.

You're partly right. A TMRCA calculator is *correct* in offering a
large range of possible times in a situation like this. It's not
because there are five ways the mutations could be distributed between
two lines, but rather because the range naturally increases as the
number of mutations rises. Indeed, it isn't possible to determine the
actual distribution of the four mutations by inspecting the two
haplotypes, and so your implied promise of somehow "doing better" is
illusory.

> Let's take our case of having a genetic distance of 4,
> and let's say that this is a freak case and it turns
> out that the changes map onto case (a) above.
>
> The greatest possible time to TMRCA is the greatest
> possible time in which you can sensibly have the
> second group be subject to absolutely no mutations
> over the N markers being looked at.

Now you're confusing absolute probabilities with conditional
probabilities. When you use the word "sensibly" above, you are
obviously referring to the unconstrained case where the number of
mutations in line A is unknown. With knowledge comes constraint. The
"sensible" probabilities are necessarily different in this case. By
declaring that Person A's line had four mutations in about the same
time that Person B's line had none, you have dramatically increased
the "sensible" expectation of time elapsed for line B, compared with
the unconstrained case. The fact is that the time estimate does *not*
depend on how the quota of four mutations is distributed between the
two lines. The more mutations you take away from Person B, the more
you have to give to Person A (within the imposed condition that the
total number of mutations is four).

The bottom line is that you have *one* time quantity to describe, not
two. You have pointed out correctly that the generational time can
vary, but you have neglected the the tendency of such variations to
average out in the long run. Therefore, you cannot make two estimates
with different assumptions and then arbitrarily "combine" them in some
artificial way. The only proper way to estimate TMRCA is to make a
single estimate (or rather a single description of its probability
distribution) using all of the information at once.

John Chandler


This thread: