GENEALOGY-DNA-L Archives
Archiver > GENEALOGY-DNA > 2011-09 > 1315611326
From: David Johnston <>
Subject: Re: [DNA] DYS385 and Other Mysteries...
Date: Fri, 09 Sep 2011 18:35:26 -0500
References: <472b4.34a9633.3b9becd9@aol.com>
In-Reply-To: <472b4.34a9633.3b9becd9@aol.com>
The IAM model as published in Walsh actually allows for differing
mutation rates. However, it is pretty easy to show that in this model it
really only depends on the sum (or average) of the mutation rates and
the number of markers and number of mismatched. I think that is what Ken
is saying. That isn't exact. It is an approximation to the IAM which
itself is an approximation to the correct model. You get it from making
the approximation exp(T*mu) approx T*mu (if I remember correctly) and
that is indeed true unless mu*T begins to approach unity. When including
some of the fast markers, this may break down for some time scales. But
I don't think it is a big deal. This approximation to the IAM is good
enough.
However that doesn't mean that the IAM is necessarily a good
approximation itself. The IAM model is making the approximation that the
mutations are rare enough that only 1 will happen at a given allele.
That is obviously a good approximation to genealogical time-scales with
most markers though maybe not the fastest ones. It works fine for two
haplotypes where there are say 3 markers off by 1, a genetic distance of
3. Here "works fine" means the likelihood curve looks pretty much like
the exact calculation. But it won't be a good approximation way into the
tails of the distribution. IAM has an exponential tail (at high T). The
real results is power law at high T.
If you run into the case where a marker is off by 2 or more then you
have to decide how to apply IAM or whether to scrap it entirely.
Remember, that this is not supposed to happen in the IAM so you are
already outside of the scope of the idea. Some choose to call this just
one mutation. Some call it two. If you are going to stick with IAM, you
pretty much need to pick one of them. I think calling it two mutations
is better. If you have more than one marker being off by two then you
really should consider scrapping IAM or if at least ignore the fast
markers where this situation comes up more often.
How do you do it exactly? Well, I send a write-up around on this a while
ago.
http://www.scribd.com/doc/57151963/TMRCA-Estimates
The exact calculation involves modified Bessel functions. This takes
into account all possible mutation routes. I am pretty sure this is not
an original result but don't know of any published paper on it (not that
I would know where to look). That is, being off by 0 markers might mean
that it mutated up once and once back. Being off by two might mean it
jumped three in one direction and then back 1. I should say "exact" in
quotes. Really this is just a more general model. We are still assuming
that only 1 jump, either up or down, occurs per generation at some known
and fixed mutation rate. Whether that is correct is another issue. But
we aren't assuming that the mu*T is small enough that any marker will
not experience more than one mutation in time T.
The IAM result is indeed really simple to compute which is part of its
beauty. The exact calculation is somewhat computationally demanding.
Evaluating modified Bessel functions is not exactly easy and you have to
evaluate it at every allele and every time T. I have written some C code
which calculates it pretty quickly. I have a csv file with 17 people
with 67 alleles. I can compute the P(TMRCA) curves for all 17*16/2 pairs
in about 3 seconds. To run that on all of I1 (3000 people) would take
more than a day, well on my laptop anyway. Using IAM on all of I1 would
probably only take a few minutes.
I have some python wrappers which reads a csv file (spreadsheet) and
spits out a matrix of TMRCA values (at some specified percentile) which
can then be input to kitsch for example. I would be happy to share this
code though it is quite new and probably needs testing/debugging.
Dave
On 9/9/11 5:27 PM, wrote:
>
> Hi Dave,
>
>
> I saw your reply to the post about the DYS385 issue. I wonder if I could
> ask a clarifying question about what you're saying about the infinite allele
> calculation.
>
>
> In my study, I'm comparing my Y-111 marker tests between a small group of
> men sharing the same surname. The genetic distance is fairly small (anywhere
> from 2 to 5). I noticed that where the group differs is primarily on the
> same few markers (DYS 570, 413, 710, and 712). Are you suggesting in your
> reply back to Julian that you have a program that somehow takes into account
> mutation rates of specific genes in calculating the TMRCA? I would be
> interested in learning more.
>
>
> Rick Wilson
>
>
>
> In a message dated 9/9/2011 2:43:11 P.M. Eastern Daylight Time,
> writes:
>
> Message: 5
> Date: Fri, 09 Sep 2011 12:27:19 -0500
> From: David Johnston<>
> Subject: Re: [DNA] DYS385 and Other Mysteries...
> To:
> Message-ID:<>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
> On 9/9/11 12:07 PM, julian grubatz wrote:
>
>> Because of the 3-step discrepancy on the latter marker in conjunction
>>
> with the other 2 mismatches, the conclusion (as I understand it) is that a
> genetic distance of 5 is assumed. On that basis, they were classified as no
> match for the past 28? generations.
> This is an example of where the infinite alleles approximation used by
> most people (including FTDNA) breaks down completely.
>
> The fact is that different markers have different mutation rates.
> Genetic distance therefore is not informative enough. If you send me
> your data, I can run my code on it and give you a better estimate of the
> TMRCA probability density function. My guess is that it will not rule
> out the 3rd great-grandfather hypothesis at very high significance.
> Dave
>
>
>
> -------------------------------
> To unsubscribe from the list, please send an email to with the word 'unsubscribe' without the quotes in the subject and the body of the message
>
This thread:
| Re: [DNA] DYS385 and Other Mysteries... by David Johnston <> |