GENEALOGY-DNA-L ArchivesArchiver > GENEALOGY-DNA > 2011-06 > 1307580758
From: "Alister John Marsh" <>
Subject: Re: [DNA] Asymptotic Distributions for General Mutation Models-NAUGHTY CHILDREN
Date: Thu, 9 Jun 2011 12:52:38 +1200
References: <972D673E3D084E2DBBD515DD90077C82@kenPC> <000001cc2606$b16669c0$14333d40$@com><EBA9BB3E99AC4CB99EA75C277824D7F5@kenPC><000301cc2622$cf657c60$6e307520$@com><F82EF8FEB4AC41B1B4D8D4478AA0EEBE@kenPC>
I still have difficulty with the mathematics of mutations.
I think of mutations as NAUGHTY CHILDREN. They never seem to do what the
formulas assume they do.
In my surname project, I have a person of my surname who according to
FTDNA's customer matching programme matches me at 12 markers (12/12), but
not at 25, not at 37, not at 67, but matches me at 111 markers (at 9
mutation "steps" from me). There has been no RecLOH that I can see. Based
on paper research, I am fairly certain this person shares a common direct
male line ancestor of my surname since 1500, but possibly even since 1700.
It appears my Y-DNA line did not have my surname until after about 1200, so
the common ancestor should at the longest stretch be since 1200.
This person has 9 mutation steps to me at 111 markers, but one of those
markers appears to have a somewhat rare 4 step mutation on a single marker.
So depending on whether it was a single mutation of 1 step, or 4 different
steps, (or something in between) it messes up TMRCA formula in different
ways. However, it is possible the 4 step difference included plus and minus
steps which have cancelled out, meaning the 4 step observed difference may
have resulted from 6, 8, 10 or more actual mutation events.
The actual number of mutation events between me and this person could be 6,
7, 8, 9, 11, 13, or 15 or more, before you start getting into speculations
on whether there have been parallel or back mutations on other of the 111
markers. Because most of the mutation activity is on a handful of the
faster mutating markers, the chances of parallel or back mutations in the
past 1000 years is I suggest much higher than one might intuitively think,
if you are thinking in terms of 111 markers which you don't distinguish
between very fast and very slow markers. In the cluster of same surname
Y-DNA matches I belong to, there are several cases of known parallel or back
mutations, but the question is how many do I not know about because they are
The mutation "experts" seem to think this situation is simple, either you
arbitrarily decide to use stepwise mutation models, or infinite alleles, or
arbitrarily delete that marker from the formula just for that case. And of
course, you arbitrarily disregard the possibility of hidden parallel or
back mutations, because not being able to see them, we can't talk about
them. It seems the thing to do, is pick over the list of possible mutation
assumptions until you find one that best fits the assumptions you have made
about the probable TMRCA. If it best fits my expectations to assume that
the 4 step difference on one marker is a single 4 step jump, and if it best
fits my expectations to assume that no parallel or back mutations have
occurred, then this is what I should do.
Now this seems fine, but if the formula cant predict relationship
probability blind without first telling it which outcome you require it to
find, then how much value is the formula, other than very broad ballpark
estimates. (For the record, in spite of my concerns, I will still continue
to use TMRCA formula as a valuable ball park tool, but accepting it's
But my concerns grow when we move into population genetics TMRCA formulas.
In population genetics, we don't have inferred genealogy from genealogical
time frame information or shared surnames to decide if steps are more likely
multiple or single steps, or conceal back mutations of parallel mutations.
And with population genetics, population growth history, bottlenecks, and
possible evolutionary selection may spice things up.
Ken may say that variance methods for population genetics overcome the
weaknesses of simple TMRCA formula, by having an inbuilt correction for
multiple step/ back/ parallel mutations. But in my ignorance of things
mathematical, I still can't help feeling we are still a bit vulnerable to
the very numerous unknowns we are making assumptions about. It all boils
down I guess to the fact that I am not a mathematical wizard, so I just
don't understand things well enough to make suitably informed comment.
THE VALUE OF TESTING 111 MARKERS:
But that aside, the interesting thing in this case is that a presumed
multiple step mutation has concealed this person as a match for me at 25,
37, and 67 markers, but he comes back into the frame at 111 markers. I
think it shows that for persons considering if it is worth testing to 111
markers, there is the possibility that they may find matches at 111 markers
which are not surfacing in the FTDNA customer database at less than 111
On the other side, a have very close matches to me at 25 markers, of what
looks like a variant spelling of my surname. However at 67 and 111 markers
this match moves to another planet, and in the end turned out to be a
different R1b haplogroup sub clade to me.
I think if it can be afforded, the 111 markers are useful to try. If you
have no matches at 12 to 67 markers, you can't say with certainty that you
will not have a match turn up in the FTDNA customer database at 111 markers.
Another option might be for FTDNA to give their current match predictions as
their standard most realistic matches, but allow customers who request to do
so (or even just project administrators) to have searches of the customer
database for matches say 2 steps greater than the current matching formula.
I guess that will never happen, but I would love it to be the case. I am in
particular looking for matches at the 1000 year range, as I have some lines
with genealogies that deep and deeper, and a wider match criteria could very
likely help me to crack some brick walls. So far, the most helpful matches
to me have been from different surname matches, which appear related to me
in the 1200 to 1350 time frame, when surnames were less regularly used.
It would also help if all FTDNA customers put all of their markers on
Y-Search, but I guess that is not going to happen either.
GENERAL SEARCH ENGINE FOR HAPLOTYPE MATCHES IN ALL PUBLIC SURNAME PROJECTS-
COMMENTS WELCOME ON THIS SUGGESTION:
Another option might be for FTDNA to allow haplotype searches which covered
in a single search query all of the members of all of their surname projects
which have allowed public access to results. Although a person may have his
haplotype viewable in say the Smith or Jones surname project, he may not
have it on Y-Search or allow matches in the FTDNA customer database, simply
because he forgot to tick a box. I once spent a few hours searching for
matches in some of the larger surname projects, particularly some of the
large regional projects. At the time I found a few matches of interest
which had not surfaced elsewhere, but it was a hard search method as I had
to visually search for matches on key family markers. It is not possible to
do a computer search for a 70% or 90% haplotype match on a surname project,
or a regional project with 3000 or more members.
If a person allows their data to be viewable on a public access surname
project, does this mean that there would be no legal impediment to FTDNA
setting up a search engine to search all public access surname projects?
The value of genealogical DNA testing is not in the testing, but in the
ability to find matches. If FTDNA were able to have a search engine capable
of searching all public access surname projects at their company for
haplotype matches, in the same search process as Y-Search, then it might
increase the number of success stories, which would in turn increase the
demand for genealogical DNA testing.
|Re: [DNA] Asymptotic Distributions for General Mutation Models-NAUGHTY CHILDREN by "Alister John Marsh" <>|