GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2011-06 > 1307776014


From: "Sandy Paterson" <>
Subject: [DNA] Asymptotic Distributions for General Mutation Models
Date: Sat, 11 Jun 2011 08:06:54 +0100


[Not sure what you mean by one factor model. Just saying that if you have
two haplotypes and no other data then the Bayesian method that I
described is the optimal solution. The IAM is just an approximation to
that method which works well enough for small time scales and works less
well for large time scales. I will just use the phrase Bayesian rather
than IAM.]

If you have two haplotypes and are looking at small time scales, then you
can be virtually 100% certain that the two people are from the same
haplogroup. Ignoring that information doesn't make sense to me.

By a factor, I mean for example number of matches. Another factor would be
genetic distance (sum of absolute marker differences). As far as I know, IA
uses only number of matches (ie it's a one-factor model). It's probably more
accurate to say that the only example I have seen that claims to be using
the infinite alleles model uses only one factor, namely number of matches.


[The Walsh paper shows how to use IAM for differing mutation rates. It is
possible to show that it gives nearly the
same results to simply use the mean mutation rate.]

Can it be shown empirically? I have plenty of simulated data that can be
used to check this. You're welcome to use it.

[But if you just have two haplotypes then you don't have other factors.
You don't know the modal when you just have two haplotypes. But if you
have a whole group of haplotypes, then it is true that there is more
information in the TMRCA between person A and B besides the haplotypes
of A and B. Using that information properly should indeed tighten the
confidence intervals.]

As far as I know, virtually all serious studies are done on specific
haplogroups or sub-clades. So I think you are saying then that virtually all
serious attempts at TMRCA estimation would benefit from using information
about the specific haplogroup that they are examining. If that's what you
are saying, I agree.

[Your answer is that you want to being in more data. That is fine. But it
doesn't mean you have found a better method of solving THAT problem.
Rather you have modified the problem.

So in fairness, if you are going to bring in more data, then you have to
allow for the Bayesian method to use that data too. Actually there is an
exact Bayesian solution to this problem. I won't elaborate too much but
it works like this. If you have a group of N haplotypes (the full data
set), you have a finite number of ways to put them into a tree
structure. There are programs that exists which will tabulate every
possible tree of N leaves. Then for every possible tree, there are
variables for the length of time between each node. Then there are the
haplotypes for each node. That is a complete list of the parameters of
the complete likelihood.]

I'm a bit lost here - I'll read it again later.

I get the impression though that what you are trying to do is to build
trees. I am not. All I am trying to do is to sharpen up pairwise TMRCA
estimation. The only TMRCA estimation I've seen that purports to use IA uses
only number of matches as input. That (perhaps incorrectly) led me to
believe that IA is seriously limited in value as a tool in pairwise TMRCA
estimation.

A clarification : I'm not using regression for this exercise. I simply
stratify and store data for lookups. That way no one can argue about the
mathematics behind it because there is none.

Just so you don't miss this, I'll repeat. You are welcome to use any of the
simulated data that I have for anything you want to examine or test.



-----Original Message-----
From:
[mailto:] On Behalf Of David Johnston
Sent: 10 June 2011 21:22
To:
Subject: Re: [DNA] Asymptotic Distributions for General Mutation Models



This thread: