GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2009-02 > 1235599813


From: James Heald <>
Subject: [DNA] Puzzle, resolved? (was: TMRCAs for groups of haplotypes)
Date: Wed, 25 Feb 2009 22:10:13 +0000
References: <49A2D0B4.5070808@ucl.ac.uk>
In-Reply-To: <49A2D0B4.5070808@ucl.ac.uk>


Question:

Suppose you have a perfect 25/25 match, and use your favourite method to
estimate a TMRCA we can call T.

In ASCII- art

/\ = T
/ \

Now suppose you find a third perfect match. Should the new group TMRCA
be more, less, or the same as the previous T ?

/\
/ \ = ?
/\ \
/ \ \

A couple of days ago, I made the following argument, which suggests that
it should be *larger* that the original T:

> Even if you knew (from an infallible oracle) how
> far back you had to go to get back to only two lines left standing,
> there would still be a remaining uncertainty in the coalescence time for
> the last two lines -- viz the confidence interval in the TMRCA for a
> 12/12 match or a 25/25 match or a 37/37 match (as appropriate) that one
> can look up on Bruce Walsh's page at
> http://nitro.biosci.arizona.edu/ftdna/TMRCA.html
> So, just on the basis of this, there must a 95% interval of at least 650
> years in any 37 marker calculation, 975 years in any 25 marker
> calculation, and 2225 years in a calculation based only on 12 markers.
>
> (In fact, reasonable uncertainties must be substantially
> larger, since this is the unavoidable contribution to the uncertainty
> from only the very last step of the coalescence. But even having a
> minimum ballpark figure is at least a start).


On the other hand, reading through some of the discussions from
June/July last year, it is striking how successful the
total-number-of-mutations divided by number-of-haplotypes divided by
mutation-rate can be, and of course its big brother the variance method.

(The zero in the total number of mutations isn't a problem:. If you do
the maths properly to estimate the parameter lambda in a Poisson
distribution given n counts, the likelihood function you get for lambda
is a chi-squared distribution with k=2n+2. This has a maximum at
lambda=n, but more meaningful is its mean at lambda=n+1. For making
estimates, we ought therefore to add 1 to the count of total number of
mutations).

So this would suggest, with an extra 25/25 perfect match haplotype, the
estimated TMRCA should get *smaller*. We should change our estimate
from 1/2m to 1/3m.


Furthermore, back in July, Sandy Paterson ran some simulations, where he
apparently did *not* see an irreducible 2225 year uncertainty in his 95%
confidence interval.


So what's going on here, and what is the right answer to the question of
the TMRCA ?


I puzzled over this for a bit, but I think I have come to a resolution,
which I think explains why the two lines of thought give different
answers, and I think also may be quite revealing about simulations.

So here goes.

There would actually be *two* important inputs affecting the TMRCA, if
one was doing a full-on Bayesian coalescent sampling using something
like BATWING, that would affect the relative probabilities being
assessed for the possible histories being generated. One of course is
the mutation rate, which would act directly on the number of mutations.
But a second is the branching rate at which ultimately-surviving
branches appear at a given time t in the history. Thinking about it
forwards, this would depends on the branching probability in the Poisson
process, times the Galton-Watson survival probability for a branch from
time t to the end of the history, times the probability the branch is
one of those picked up at the end from the overall final population.

The Coalescent approach rather neatly reverses this by instead looking
at the history backwards. If you trace a selected branch from the end
of the history back to time t, modelling how other branches merge in
with it (coalesce) as time goes backwards together, what is the
probability that it merges with another branch in the time step going
back from t+1 to time t? This is much the same as the branching
probability considered in the paragraph above, but the coalescent answer
is very simple. For each other branch at time t+1, the probability it
will merge with the one we are considering is simply (1/N) where N is
the size of the "effective population" at time t. And the argument for
this is quite simple: just assume a-priori that each of the population
at time t+1 could equally well have been fathered by any of the N
members of the population at time t. Then the probability that a second
member of the population at time t+1 has the same father is simply (1/N).

The model helps us to understand quite a lot of the structure we see in
real haplotype histories. Close to the present, there are a lot of
independent lines in play, each with a chance (1/N) of merging, so we
see a lot of merges (even more, if we model the population as undergoing
rapid growth). But as we go far back, there are very few lines leading
to our final set present in the population, so there are far fewer
mergers; for example somebody cited a clade in Haplogroup I that appears
to go 600 generations without a branch to any other currently known clade.

So what does all this have to do with the question at the top?

Well, I think it all depends on how much we let the data affect our
estimate of the branching rate at time t.

And how a simulation has been constructed may also strongly affect the
branching rate at time t.

If a simulation has been constructed with a fast rate of growth; quite a
small final population; and if all of that final population has been
used, rather than just a thin sampling, then that makes for a much
smaller effective population N than is biologically reasonable. This
makes for an artificially high branching ratio (or equivalently an
artificially short tree), which may not much affect the body of the
history, but is likely to suppress the potential for excursions into
long coalescence times in the earliest stages of the tree.

In the real world, Bruce Walsh's larger values of N are more realistic,
and with them the much larger uncertainties in the time intervals for
the final coalescences at the very start of the history.

On the other hand, the branching rate also sheds light on the TMRCA
puzzle at the top; because (I think) it means the third match /can/
change the inference about what the convergence time for the last two
lines will be, by feeding into the inference that the branching rate is
faster than would have been assessed with only two matching haplotypes -
perhaps because the effective population is smaller, or the overall
population growth rate is faster. But if we were to fix these
parameters in the inference, then the third haplotype could not have
this effect.

It would be interesting to experiment with BATWING, a Bayesian sampling
program which can estimate full probability distributions of everything
using the coalescent model, including a "skyline" effective population
size, and samples of typical histories.

There are certainly mtDNA papers where it has been used to give those
sorts of estimates. Has anybody tried it, or considered adapting it for
ySTR data?

-- James.



This thread: