Archiver > GENEALOGY-DNA > 2010-02 > 1265688832

From: "Anatole Klyosov" <>
Subject: Re: [DNA] Variance Assessment of R:U106 DYS425Null Cluster
Date: Mon, 8 Feb 2010 23:13:52 -0500
References: <>

> From: "Lancaster-Boon" <>
> Dear Anatole
> Yes I do think the main problem is "talking past each other" and so I
> really want to thank you for not dropping this in frustration. I know
> that this must be very tempting. People drop conversations in these
> situations too quickly sometimes.

Dear Andrew,

It has been my profession for many many (good) years to explain thing to
people who often did not have a clue. Fortunately, I taught at the finest
(according to my taste) universities, that is Moscow University and Harvard
University, so there were not too many really damn folks. Nevertheless, a
certain patience was a part of the job. From my court, let me thank you for
being not only patient but thoughtful and positive. Unlike many people you
really want to find answers to your questions. Most people do not care about
answers, they just love to dismiss without much of a thinking. That is what
life means for them.

> You find my remarks about family trees mysterious and ask me to define
> what I mean. (...) the family tree question I am talking of
> "which sub-sets share common ancestors".
> So amongst all the things you have shown, the one I am questioning is,
> as per your latest listing of steps in your method "(4) how to dissect a
> tree to branches". And this is the step in your method I think you write
> the least about.

Because it is the most obvious one. One has to compose a haplotype tree, and
see if the tree consists of different distinct branches of haplotypes. If
the dissection is done correctly, one has a number (minimum two) subsets,
each one should be analyze separately. Each one is supposed to have its own
the most recent common ancestor. The logarithmic criteria can be applied to
each separate branch, to verify that it is derived from one common ancestor.

For example, the well-know work by Hammer, Behar, Skorecki et al on "Cohen
Modal Haplotype" and its TMRCA was done fundamentally incorrectly. They took
a series of J1/J2 haplotypes (J1 haplotypes some later), counted mutations
and calculated a TMRCA, which happened "to be" around 3200 ypb. In fact,
their dataset contained two distinctly different branches, one with TMRCA of
around 4,200 ybp, another around 1,000 ybp. None of them was not applicable
to "Aaron", contrary to what was claimed. They did not realize that they
mixed different "common ancestors" and as a result obtained some "phantom"

To mix "common ancestors", aka different branches, is the most frequent
mistake done by authors of "academic" papers. It is still worse that after
it they employ the "Zhivotovsky coefficient" and divide that mess by about
three. Typically they get no more (or no less) than a shear nonsense.

That was essentially a gist of my "Comment" in Human Genetics on the paper
by Hammer, Behar, Skorecki, Zhivotovski, Karafet, etc. in which they did
exactly that. In "Response" they "responded" that their method was good and
approved by the scientific community. That was a gist of their "response".

It all boils down to the amazing infancy of "academic" papers dealing with
TMRCA calculations. It is really a shame when a fine work on identifying new
subclades, typing hundreds or even thousands of haplotypes end up in a
complete misinformation and confusing regarding chronology of historical
events, because the authors are not educated in simple rules governing
dynamics of mutations in series of haplotypes.

> Obviously any counting of mutations requires
>this step FIRST. It is critical to everything.

I could not agree with you more.

> You ask for an example. Maybe this is a good idea.

No maybe's. It is the ONLY good idea.

> I think that there is no point showing marker values which are merely
> the same, because obviously you can not use these to make your
> phylogeny, so I'll show markers which differ in the same way for more
> than one haplotype.

Now, it IS a mistake. I do not know what "phylogeny" you are talking about,
which I "cannot use". Haplotypes should be shown in their entirety, not by
fragments. What is "merely the same"? You mean that only shown alleles are
different from each other, and all other alleles in 37- or 67-markers are
practically the same? Let me not to believe in it.

Haplotypes (fragments) that you have shown, apparently belong to different
subsets, each of which has its own common ancestor. I can easily compose a
haplotype tree and show it. My guess, though I cannot be certain with these
bastardized haplotypes, that they belong to several "shallow" branches (that
is several hundred years "old" each), however, they are derived from a
rather ancient "most recent" common ancestor. There is nothing like
"shooting in the dark" here.

> 24-12-17-18-18-17-37-38
> 24-12-17-18-19-17-38-40
> 25-12-17-18-19-17-38-39
> 25-12-17-18-19-17-38-38
> 25-12-17-18-18-17-37-39
> 25-12-17-18-18-18-38-38
> 25-12-17-18-19-18-37-38
> 25-12-16-18
> 25-11-16-19-20-17-38-38
> 25-11-16-19-18-18-37-40
> 25-11-17-19-19-18-38-39
> 25-11-17-19-19-17-38-38
> 25-12-17-19-19-18-37-38
> 26-12-17-19-18-17-37-38

> Now, when you remove the irrelevant markers...

There is no irrelevant markers in haplotypes. By "irrelevant" you apparently
meant something else, however, all of them, when available, form a unified,
integrated system.

> to me this real example looks very much like the example of Ken's which
> you criticized as a case of changing the subject by giving a sample which
> showed no obvious structure?

Absolutely not, the way I see it. Unfortunately, my repeated comment did
not go across. The thing was that we have discussed a simple case of almost
300 extended haplotypes, and my point was that for that many of extended
haplotypes it is easy to follow well-defined rules of DNA genealogy, and
illustrate basic principles of calculations of TMRCA's and margins of error.
At this point the opponent stopped, made a U-turn and shown an unwillingness
to confirm that my approach was basically correct. Instead, he made up a
deliberately distorted small "haplotype set" which, of course, would result
in a much higher margin of error. That is how I saw it, and that is why I
called it not a fair game.

In reality, of course, you can find all kinds of distorted haplotype series,
however, most of them can be treated following the same rules.

> In my example above, which mutation happened first and is more
> ancestral, on the following "slow moving markers":-
> *The 11 or 12 on the second marker?
> *The 16 or 17 on the third marker?
> *The 18 or 19 on the fourth marker?

This what can be shown on a haplotype tree. You cannot say what is "more
ancestral" since there are several ancestors there. THE ancient "the most
recent" common ancestor might not be even seen clearly from the listed
haplotypes. It can be deduced, though, by comparing base haplotypes of
several subsets, which can be identified on the haplotype tree. Some of them
can be poorly identifiable due to insufficient statistics.

> I'd say there has either been back mutations, or else the same mutation
> happened twice in parallel, or else one of the minority (non modal) values
> is actually ancestral? And this is on slow markers.

Back mutations produce a negligible effect on the first 26 generations, and
minimal one on the first 40-50 generations. I do not think that your family
genealogy goes that deep in time.

> As you can see, there are numerous possibilities. I understand that what
> you objected to in Ken's example is that there were numerous
> possibilities...

No. It means that you did not understand my objection. If so, I cannot help
with it.

Now, let ME give you a couple of examples, out of many. A reader sent me a
25-marker haplotype set of eight relatives in Britain. He asked me to
determine when a common ancestor lived, however, did not disclose the actual
date. The list was as follows:

13 25 14 10 11 14 12 12 10 13 11 30 15 9 10 11 11 23 14 20 35 15 15 15 16

13 25 15 10 11 14 12 12 10 13 11 29 15 9 10 11 11 23 14 20 35 15 15 15 16

13 25 15 10 11 14 12 12 10 13 11 30 15 9 9 11 11 23 14 20 35 15 15 15 15

13 25 15 10 11 14 12 12 10 13 11 30 15 9 10 11 11 23 14 20 35 15 15 15 16

13 25 15 10 11 14 12 12 10 13 11 30 15 9 10 11 11 23 14 20 35 15 15 15 16

13 25 15 10 11 14 12 12 10 13 11 30 15 9 10 11 11 23 14 20 35 15 15 15 16

13 25 15 10 11 14 12 12 10 13 11 30 15 9 10 11 11 23 14 20 35 15 15 15 16

13 25 15 10 11 14 12 12 10 13 11 30 15 9 10 11 11 23 14 20 35 15 15 15 16

Clearly, the base, ancestral haplotype is as follows:

13 25 15 10 11 14 12 12 10 13 11 30 15 9 10 11 11 23 14 20 35 15 15 15 16

All eight haplotypes have three mutations per 200 alleles. It gives
3/8/0.046 = 8 generation from a common ancestor. At the same time the series
contains five base haplotypes, which gives ln(8/5)/0.046 = 10 generations.
It gives the average value of 9 generation, plus-minus some margin of error.
However, a formula (given in my publication I have referred to) shows that
for a 200-marker series and three mutations in it, for a fully asymmetrical
mutations (which the series shows) a standard deviation theoretically equals
to 57.7%. At the 68% confidence level ("one sigma") we obtain that there
would be 9±3 generations, and at the 95% confidence level ("two sigma")
there would be 9±5 generations to the common ancestor. Therefore, the common
ancestor lived in 1784±75 (68% confidence), or in 1784±125 (95% confidence)
years ago, and, since he was born some 25 years earlier, it gives his birth
year around 1759, give or take a century. In fact, as I was later informed,
Robert, the common ancestor of all the eight individuals, was born in 1767.

I am not cherry-picking, I have tons of example like this one.

Now, a funny story. When my would-be publication in JOGG was going through a
review process, a reviewer, who did not believe in my approach, had sent me
his personal example, it order (apparently) to pin me down. His case
considered 19 relatives, all having the common ancestor who was born in...
well, he did not say at the beginning (naturally). Their 19 of 37-marker
haplotypes contained 23 mutations. Therefore, 23 mutations in 703 markers
would give 23/703/0.00243 = 13.464 generations (being ridiculously precise
for the sake of these particular calculations), that is 337 years to a
common ancestor. That is, their common ancestor lived around 1672, and was
born some 25 years before that, that is around 1647. That was a condition of
the reviewer - when the ancestor was born? I have tried to explain that I do
not operate with single generations, since a typical margin or error
precludes it, but the reviewer wanted to know. O.K.

It turned out that the ancestor of those 19 haplotypes was born in 1642. I
have obtained 1647. I fully understand that the margin of error, that was
plus-minus 78 years, would allow me to be less precise. However, I object
when people say that my error margins are "too optimistic". They just say it
without seeing data, and most of them without ever calculating margins of
errors. I have tons of data. Is there a difference?

I would say that for many cases my margins of error are toooo conservative.
It is O.K., I can live with it.

Now, the funny part of it. I had asked the reviewer to include those data
about his relatives (of course without names) and my calculations into my
would-be paper, to illustrate "genealogical" calculations. He did not want
it, so I dropped. He did not want me to give a positive example for the
paper. He still remained skeptical. His mindset did not allow him to believe
that mutations do follow certain rules. By the way, his name is well known
around here. Don't ask, don't tell.


Anatole Klyosov

This thread: