GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2008-03 > 1204658813


From: James Heald <>
Subject: Re: [DNA] Central Limit Theorem in Action
Date: Tue, 04 Mar 2008 19:26:53 +0000
References: <013801c87d66$d40ce450$6400a8c0@Ken1><REME20080303195544@alum.mit.edu>
In-Reply-To: <REME20080303195544@alum.mit.edu>


John Chandler wrote:
> Ken wrote:
>
>>For two such markers each with that distribution, the probability
>>distribution for the sum s = v(1)+v(2) is the convolution integral
>
>
> True, but this isn't the problem that James pointed out the other day.
> Although the ASD for a combination of two markers is simply the
> average of the two ASDs computed for the markers individually, the
> combined estimate of TMRCA is not the average of the two
> individual-marker estimates. The issue can be seen most clearly by
> looking at the probability distribution for the TMRCA of two testees
> who match exactly on all markers tested. In this case, the most
> likely TMRCA is actually zero, regardless of how many markers are
> included in the test, and the shape of the distribution is
> approximately an exponential whose mean (expectation) value is the
> reciprocal of the sum of the individual mutation rates. Clearly, the
> mean value does move closer and closer to zero as more markers are
> added, but the distribution never acquires a flat top, or in any other
> way becomes more like a gaussian.
>
> John Chandler

Let me see if I can clarify:

I suspect Ken is quite right, that with enough markers,
P(T | t) rapidly becomes approximately Gaussian, because of the
Central Limit Theorem; with the mean of T = mean no of steps = mu t

But that is not the end of the story. We also need to consider the
variance of T. I suspect that is dominated by the Poisson noise in the
number of steps, which (because of the properties of a Poisson
distribution) for one marker will have a variance also equal to the mean
no of steps = mu t.

We can now apply Bayes' theorem, to find
P(t | T) ~ P(t | I) P(T | t).

If (for simplicity) we take a flat (uniform) prior for P(t | I),
then
P(t | T) ~ P(T | t)

The two sides of the equation have the same algebraic form (up to a
normalising constant).

But while P(T | t) is a Gaussian distribution for T (it only depends on
T through the square of the numerator of the exponential), the
probability for t given T, P(t | T) is *not* a Gaussian distribution --
because it will have a form something like

P(t | T) ~ 1/sqrt(t) exp - {(T - mu t)^2 / mu t) -- the t dependence is
*not* just in the numerator of the exponential.


As a result, P(t | T) is much more skew than P(T | t) -- even when
*lots* of markers are being tested.

Eventually, as John has shown, P(t | T) becomes less skew. But this
only happens when in effect P(T | t) becomes *so* sharply peaked around
T=t that essentially no significantly different values of t can contribute.

That takes many *many* more markers than are required just to get P(T |
t) to become Gaussian.

----

Incidentally, there is one other very important consequence if the
variance associated with the squared deviation statistic for each marker
is proportional to mu t.

It means that, in accordance with the usual rules for making averages of
normally-distributed sample measurements with different variances, we
should prefer the statistic

T = 1/n (Sum (X_i)^2 / mu_i )

rather than

T = Sum (X_i)^2) / Sum (mu_i)


The first statistic should be much less noisy than the second.

The second will tend to be dominated by the noise in the samples that
have the largest variance; but the first appropriately equalises the
contribution from each datum.


-- James.




This thread: