GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2007-02 > 1171630468


From: James Heald <>
Subject: [DNA] Mutation rates (particularly for John Chandler)
Date: Fri, 16 Feb 2007 12:54:28 +0000


John (and list),

On Feb 16, 2007 at 02:38 Vincent Vizachero wrote:

> I don't think there is much dispute that
> the rates published by John Chandler in the Journal of Genetic
> Genealogy last fall are the best estimates of mutation rates for the
> most commonly used markers.

On the basis of a quick look at the paper, I've got some anxieties:


1. The assumption of no back-mutations, that the exponential
distribution is based on.

-- Given that your cut-off was 25/37 matches and closer, this starts to
look unsafe. At least for the fastest mutators, there has to be a real
chance of a mutation and then a back-mutation in the time for the other
loci to have accumulated 12 differences.

This could cause a systematic under-estimate of the mutation rates for
the fastest mutators. But it should be relatively straightforward to
calculate a reasonable correction.


2. The error uncertainties (the most important issue).

-- I think these could be /way/ too low.

Least squares and chi-squared estimates are based on the assumption that
your observations are IID Gaussian, with N independently identically
distributed observations.

My fear is the paper hugely underestimates the effect of shared
histories, and shared mutation paths, in reducing the effective number
of independent observations.

The calculation is based on analysing the conditional probability P_AB =
P_AB(j,b-1|b) of a mismatch at locus j (and b-1 others) between two
haplotypes A and B, _given_ b mismatches in total.

*But*, once you know that a mismatch has occurred (or not) for locus j
on the path from A to B, this informs the question of whether a mismatch
may also have occurred for locus j on the path from A to C, because some
of that path may be shared with the path from A to B. It also informs
the chances of a mismatch on the path from C to D, if C is close to A,
because some of that path may be shared with the path from A to B.

Conclusion: given P_AB(j,b-1|b), you cannot assume P_CD(j,b-1|b) is
independent of P_AB, if haplotype C is anywhere near haplotype A. So
you cannot assume you have two independent parameter observations, given
these probabilities.

If the mutation only occurred once, and that even is included in both
the paths A->B and C->D, then you only have one observation of its rate,
not two.


According to the paper, the number of independent observations was taken
to be the smaller of "the number of pairs found in a given b-bin, and
the total number of haplotypes".

I am anxious that that may actually be a huge over-estimate of the real
effective number of independent observations, leading to a huge
under-estimate of the possible error uncertainty.


To assess this, one thing to try might be to do separate estimates for
the haplotypes from R1b, R1a and I1 (or whatever happen to be useful
groups to partition the data in).

How does the "Sigma N-1" estimated error for each mutation rate, based
just on those three numbers, compare with the standard errors you were
estimating as a whole ?


3. (Technical point)

The mutation rates mu are scale parameters which cannot go negative. So
probably you should be estimating log(mu), rather than mu itself.

An accurate final probability distribution for mu is more likely to be
log-normal than normal. So on a non-log scale, the distribution is
likely to appear slanted to the left, with a long tail to the right;
with the 95% quantile is much further from the median than the 5%
quantile the other way.

In such a situation, a least squares estimate for mu (or even a
straigntforward average) will overweight high values, and underweight
low ones.

This can be taken care of by estimating log(mu) rather than mu.


Interested to know what you think,

Best regards,

James.



This thread: