GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2009-02 > 1235686790


From: James Heald <>
Subject: Re: [DNA] Puzzle, resolved?
Date: Thu, 26 Feb 2009 22:19:50 +0000
References: <mailman.32326.1235629349.2246.genealogy-dna@rootsweb.com><9893F71D7E7F40AC9001C114C67076E9@alfap43400ak>
In-Reply-To: <9893F71D7E7F40AC9001C114C67076E9@alfap43400ak>


Anatole Klyosov wrote:
> This reminds me on the Tortoise and Achilles
>
> Anatole
>

Maybe. But this evening here's a happy tortoise posting because he
thinks he has finally got there in the end!

I have been having a play with the BATWING software, which uses Bayesian
sampling to try to estimate probabilities of everything given a set of
haplotypes.

There was one crucial bit of information that wasn't in the manual,
which took me a while to spot, which was needed to make the results make
sense. But now I have got there, I now think I can see how everything
fits together, and that I was basically on the right track yesterday.


So here is what I got:

1. Put in two 12-marker haplotypes, with a 12/12 match, under the
assumption of their having sampled from a large final population, and no
population growth.

/\
/ \

Looking at the results from the BATWING program in R, I get a nice
probability distribution just like the one on Brian Walsh's page, and
the summary statistics,

Mean TMRCA = 22.86 generations
2.5% Quantile / Median / 97.5% Quantile: 0.6 / 15.36 / 87.29 generations

-- bang on the numbers on Walsh's site.

Total mutation rate = 12 * 0.002 = 0.024, and 1/(2 * 0.024) = 20.83, so
the mean TMRCA is just a little bit larger than that, due to the small
but non-zero possibility of back-mutations.

So far so good.


2. Putting in three 12-marker haplotypes, with a 12/12/12 match, again
under the assumption of no population growth, and their having been
sampled from a large final population.

/\
/ \
/\ \
/ \ \

Mean TMRCA = 36.38 generations
2.5% Quantile / Median / 97.5% Quantile: 4 / 30 / 108 generations.

This can be understood as follows. Under the assumption of a large
final population and known population growth, the time from 2->1 is
conditionally independent of the time from 3->2. (This is essentially
the key insight in coalescent theory). The mean time from 2->1 is
therefore again about 21 as above; the mean time from 3->2 is two-thirds
of that, so giving a total of roughtly 35.


3. Again, a three-way 12/12/12 match; but this time releasing the
population growth rate to be some unknown quantity.

/\
/ \
/\ \
/ \ \

Mean TMRCA = 15.92 generations
2.5% Quantile / Median / 97.5% Quantile: 0.6 / 12 / 99 generations.

Estimated population growth rate: 2.5%
(95% Confidence range 0.6% to 7.7%)

1/(3 * 0.024) = 13.89, so the TMRCA from the Batwing sampling is just a
little higher than the Klysov value.

Introducing the population growth rate parameter (with a Gamma(1,100)
prior distribution, mean 0.01) has allowed the simulation to adjust the
branching rate, making the tree much flatter. (Rapid population growth
means branches coalesce more rapidly). The prior distribution holds
back the population growth rate from going quite as high as the data
would like, so the TMRCA isn't reduced quite as far as just the data
would indicate. But close.

The 97.5% quartile remains high, at 99 generations, because with just
three data points the inference can't be *sure* that the population
growth rate is so high; and also the 3->2 stage might have happened
quite late, leaving the possibility of a 2->1 stage with a long TMRCA as
before.


4. A two-way 12/12 match, with a population growth rate of 2.5%

/\
/ \

Mean TMRCA = 9.9 generations
2.5% Quantile / Median / 97.5% Quantile: 1.0 / 11.0 / 14.7

It is the population growth rate, rather than the lack of mutations,
that is driving the probabilities here.

With a population growth rate of 2.5%, it is likely that two 12/12
matches will converge before the time for the "mutation budget" to be spent.

The mean TMRCA is very close to three-fifths the mean TMRCA in
experiment 3, the same ratio as experiment 1 to experiment 2. This
reflects the key idea, that for a given population growth and final
population, the time from 2->1 in a graph like (3) should have the same
distribution as the time from 2-> in a graph like (4).


So all this essentially resolves the puzzle I originally posted, and
shows why *both* ways of thinking were correct - but both need to be
aware that estimates can be altered by knowledge of the population
growth rate.

It also shows that a higher population growth rate may make what Ken
called the "intrinsic noise uncertainty" in the times for the earliest
branchings somewhat less than one might think from the Walsh charts.

On the other hand, though the population has been increasing fast
recently, was it increasing that fast at the time of the start of the
tree? Perhaps not, and if not then multiples of Bruce Walsh's
uncertainties are appropriate after all.

BATWING in fact has a third population model, which estimates a
break-point from the data, before which the population is assumed
constant, and after which it is assumed to have grown exponentially.
Newer software (such as "BEAST") goes even further, and introduces a
"skyline" population curve to estimate. Unfortunately, BEAST's data
interface is designed to accept DNA sequences, not STR repeat numbers -
good for mtDNAs, but not good for ySTRs. But from my afternoon's
entertainment, I am quite impressed just with what BATWING can do --
plus probabilities for tree configurations, too.


Footnote (the missing bit of the manual)

Rather than numbers of generations, internally Batwing uses and reports
a "scaled time", T , measured in numbers of generations divided by N,
the variable for the final population size. At least, that's what the
manual leads one to believe.

In fact what it actually does, I think, is to scale T by the *initial*
population size, /not/ the final size. If you do this, suddenly all the
shapes and numbers make sense, whereas otherwise they don't. It doesn't
make any difference for model 0 (constant population size), and I
haven't tried model 2 (break from constant to exponential). But if
you're analysing the data in R and using Batwing's model 1 (exponential
growth), I think you need a line like this to convert the output from
Batwing into a vector of samples of times in generations:

o <- read.table("out",header=T) # read in the data
o <- o[-c(1:1000),] # drop the first 1000 samples
t1 <- o[,"T"] * o[,"N"] # extract T and N and multiply them
alpha <- o[,"alpha"] # extract growth rate alpha
t2 <- t1 * exp(-alpha * t1) # adjust N to N_initial


But it seems to me Batwing is well worth a play.

-- James.


This thread: