GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2010-02 > 1265500004


From: "Anatole Klyosov" <>
Subject: Re: [DNA] Variance Assessment of R:U106 DYS425Null Cluster
Date: Sat, 6 Feb 2010 18:46:44 -0500
References: <mailman.3199.1265484783.2099.genealogy-dna@rootsweb.com>


Dear Ken, Andrew and Sandy,

Apparently, I did not realize that things that are so clear to me look like
"absolutely shots in the dark", as Andrew put it. Maybe because my
professional fields are chemical kinetics and physical chemistry, I got used
to MUCH more severe complications in scientific problems, compared to which
dynamics of mutations in Y-chromosomal DNA is just peanuts. Time and again I
am surprised that such a clear and simple system causes such an exaggerating
disbelief among people who deal with haplotypes, that it is something SO
complicated that it cannot be comprehended.

In reality, the whole system (DNA genealogy) is based on a few very
straightforward assumptions, which in fact are proven to be experimental
facts. Or, rather, nobody has had proven that they are incorrect.

-- First, that mutations in haplotypes happen randomly, and actually serve
as a molecular clock for hundreds of thousand years, and probably for
million years.
-- Second, that mutations have their own mutation rate constant in each
locus, and by combining those loci (as haplotypes) we can sum those mutation
rates up and operate with average mutation rate constants per marker and per
haplotype. I have found that those mutation rate constants give the best
results at 0.022 mutations per the 12-marker haplotype (0.00183 mutations
per marker in that panel), 0.046 mutations per the 25-marker haplotype
(0.00183 mutations per marker), 0.090 mutations per the 37-marker haplotype
(0.00243 mutations per marker), and 0.0145 mutations per the 67-marker
haplotype (0.00216 mutation per marker).
-- Third, any given or chosen dataset (a series of haplotypes) could be
derived from a one (technically) common ancestor, or from several (two or
more) of them. There are several criteria to be employed in order to figure
out was it one or several common ancestors (lineages). One, for example, is
to compose a haplotype tree and to see whether it splits into several
distinct branches; another is to employ the logarithmic method and compare
it to the linear method.
-- Forth, the most frequent haplotype in the dataset is the most likely is
the ancestral one, BUT ONLY if the dataset has only one common ancestor (see
item three).
-- Forth, a number of mutations counted in a dataset with respect to the
base haplotype is connected directly, but not linearly, with a time span to
a common ancestor. "Not linearly" - because there were back mutations in the
system over that time span.
-- Fifth, back mutations occur randomly, and follow the same kinetics as the
"primary" mutations. In fact, there is no difference between "back" and
"primary" mutations. They have the same mutation rate constant.
-- Sixth, a transition from the base (ancestral) haplotype to a mutated
haplotype has two consequences: (1) mutations are accumulated in the
dataset, and (2) base haplotypes are disappeared from the dataset. The both
events are FIRMLY and QUANTITATIVELY connected to each other, when are
considered with a statistically sound number of haplotypes in a dataset.
That number can be as small as 30-40-50 haplotypes. Of course, 100, 200, 300
and 1000 haplotypes in a dataset is better, since it would give a lower
margin of error. This said connection is as follows: ln(N/n) = M/N, in which
N is a total number of haplotypes in the dataset, "n" is a number of base
haplotypes left in the dataset, M is a number of mutations in the dataset
(with respect to the base haplotype). This is a gist of the logarithmic
method. One can see that ln(N/n) does not involve mutations at all. It just
shows how far in time "n" has gone from "N". If ALL haplotypes in a dataset
are identical to each other, that is N=n, that ln(N/n)=0, that is the
haplotypes are very recent, within the respective margin of error time span.
In that case, naturally, M=0, and M/N = 0. The same conclusion, that is very
recent haplotypes.

In the example I gave earlier, for 284 of 12 marker haplotypes of R-U106,
N=284, n=12, hence, ln(284/12) = 3.16. Since the mutation rate constant is
0.022 for 12 marker haplotypes, 3.16/0.022 = 144 generations (without a
correction for back mutations), or 168 generations with the correction, that
is 4200 years to a common ancestor. I did not bother to calculate mutations
in the 12-marker haplotypes, however, in the 25-marker series (which I
prefer) there were 1853 mutations, that is 1853/254/0.046 = 142 generations
(w/out the correction), or 166 generations with the correction, that is 4150
years to a common ancestor.

Another example - R-L21/S145, 509 haplotypes in the dataset, all 25-marker
haplotypes contain 2924 mutations, which gives 2924/509/0.046 = 125
generations w/out correction, or 143 generations with the correction, that
is 3575 years to a common ancestor. The whole series contained 770 of
12-marker haplotypes, among them were 49 base haplotypes. [ln(770/49)]/0.022
= 125 generations w/out corrections... well, you get the picture. It is
exactly the same 3575 years to a common ancestor. With a margin of error it
gives 3575+/-370 ybp.

Where do you see "shooting in the dark"? If those things are handled
properly, there is nothing complicated about it.

The basis of the counting mutations is that where there is no mutations in a
dataset, a common ancestor was a very recent one. If there is a moderate
number of mutations in a dataset, say, 0.05 mutations per marker (in a
25-marker haplotype set), a common ancestor lived 700 years ago (plus-minus
error margin, which is determined by a size of the dataset). If there is
0.20 mutations per marker, a common ancestor lived 3000 years ago. If there
is 0.50 mutations per marker, a common ancestor lived 9300 years ago. If
there is 1.00 mutation per marker, a common ancestor lived 27,000 years ago.
All these figures are given corrected for back mutations. Mutations are
calculated on average for the whole dataset. Yes, which contains present-day
haplotypes. This is a beauty of the method. Present-day haplotypes have
accumulated those mutations over hundreds and thousands of years by their
predecessors.

The above is what I have called "peanuts". It is a very smooth and
straightforward system. However, as in every and each field of science,
there is a context in which experiments are conducted and interpretations
are made. If one climbs on top of a high mountain and find that water boils
at 80 C, instead of 100C, as he would expect, he should not blame science
and that scientists "absolutely shoot in the dark". However, the same
attitude haunts people who deal with haplotypes and mutations, and their
ignorance they take as "IT IS AWFULLY COMPLICATED", and "shooting in the
dark". They take small, statistically unsound datasets, sometimes as small
as two or three haplotypes, and expect to get something out of it; they do
not separate branches (lineages); they use wrong mutation rate constants;
they do not consider back mutations, etc., etc.

Now, I will consider some specific comments.

1. Ken said: "I maintain that you can not count mutations because they
happened in the past and would require an inference about the whole tree
history. One can
only count outcomes of mutations in the sample set of final haplotypes seen
today, after making an interence about a founding haplotype.

This is a typical misunderstanding. I would not call it "ignorance", this is
either a different mindset, or a simple misunderstanding. What we measure
(counting) is an average number of mutations accumulated in "final
haplotypes seen today". In none is accumulated, a common ancestor lived
recently. If a lot of mutations are accumulated, a common ancestor lived a
looooong time ago. I hope it makes sense to you now. There is nothing
mysterious about it.

2. Andrew said: "Apparently there is something "wrong" with the example Ken
gives, which
> makes it give a bigger margin of error than the real example you were
> discussing. You call it absurd and fuzzy. However it does not seem absurd
> or
> fuzzy to me as a genetic genealogist".

It was absurd in a way how Ken switched the subject. There is nothing absurd
in the haplotypes themselves, it is was just an improper and incorrect
selection (or, rather made up serious of haplotypes). Those haplotypes did
not represent a one lineage. When we have discussed U106, it was one
lineage. Absurd was in a manner of discussion. "Fuzziness" was a lack of
focusing on a subject of the preceding discussion.

> I presume this is because it implies no clear ancestral modal and no clear
> family tree structure? Is this correct?

Not quite. It was deliberately made that it does not have one ancestral
haplotype. It was deliberately made in such a way to justify that a margin
of error can be high. Of course it can be high, but not in a case of 284 of
25-marker haplotypes that we have discussed earlier. If there is just one
mutation in a dataset, margin of error will be 100%. So what?

> Or if not can you explain more precisely what is "wrong" with the example?

There were three "wrongs". First, Ken dismissed the margin of error which I
have quoted without any consideration. Second, when I suggested him to
provide me with HIS estimation of a margin of error based on 284 of
25-marker haplotypes and 2924 mutations in them, did not give an answer.
Third, he switched the subject and gave me an irrelevant and distorted
example. That what was wrong. And, unfortunately, it was a typical example
of how some people "discuss" here. Their goal is not to find an answer but
to dismiss, dismiss, and dismiss. This is very far from science. In proper
science there is acceptance, support, and/or further development of a
concept. It is not a bunch of spiders in a can. This is VERY unproductive,
when a burning desire to dismiss by any means shields the common sense.

> To me Ken's example looks like something we see all the time even within
> solidly defined family groups, or SNP defined clades.

It means that there is something very wrong with your approach. First, the
Ken's "example" can be approximately described with TMRCA of more than
10,,000 ybp. This is just cannot be in your family studies. Either you give
now a huge exaggeration, or you just actually saying "we do not know what we
are doing". You boil water on top of a mountain, figuratively speaking,
without realizing that you operate in a wrong context. Or you take two
haplotypes, look at five mutations between them and scratching your head.

> My experience in practice is that the implied family tree and ancestral
> haplotype coming from a particular set of, lets say 284 x 25 marker
> haplotypes, can be completely changed by just using a few different
> markers
> or a few different haplotypes.

Dear Andrew, I normally do not discuss on such fuzzy terms. What "completely
changed" means? What is "a few different markers"? What "a few different
haplotypes", particularly when you do refer to 284 x 25 market dataset. It
takes A LOT more than "just a few" haplotypes to change such a robust
system. forget about "completely change". Why would not you give a specific
example, to illustrate your point and to show what is "completely changed"
with 284 x 25 marker haplotypes?

> I am not only talking about small groups. As you know the E-M35 project
> has
> well over a thousand people, but we still see that predicting a family
> tree
> and ancestral modal for any group of these haplotypes is very sensitive to
> relatively small changes in the markers being looked at, or the
> individuals
> being considered.

I respectfully disagree, and disagree on two items. One - again with your
way of describing it; what is "predicting a family tree"? What does it
specifically mean? What is "relatively small changes? What is "very
sensitive"? Those are empty words, I am sorry, in such a context. Thousand
haplotypes is a heluvalot of a number, and provides huge possibilities in
describing the system. Second - I disagree on the attitude in general. Have
you head about archaeology? Do you think THEY do not change their
conclusions from time to time? How about linguistics? How about chemistry,
physics, and any other discipline? What makes you thing that DNA genealogy,
or population genetics, if you with (though I absolutely differentiate the
two in their methodology and goals) should be a holder of truth, once and
forever? Things, conclusions, will be changed, and will be changed big time.
What is wrong with that? This is what we call development of science.

> From: "Sandy Paterson" <>
>From Anatole's most recent posting.. it means that he observes what
> you've described as the outcome of mutations.

Dear Sandy, who cares how you call it? If you see a chunk of a corroded
iron, you can call it "the outcome of corrosion". Does it prevent you from
analysis of that chunk? Do you really want to see a progress of that
corrosion over the whole time period, in order to make your conclusion? When
I have two dataset of haplotypes, and in one an average number of mutations
is 0.01 mutations per marker, and in another one 0.236 mutations per marker,
it says me pretty much about a timespan to a common ancestor, particularly
after I built a haplotype tree and verified whether it contain a single
lineage or not.

>From: "Ken Nordtvedt" < >
> Some of those differences are probably the consequences of a singular past
> mutation.

This is only when you do not analyze the dataset properly. When a dataset
contains "a singular past mutation", it typically gives a separate branch of
the haplotype tree. I have considered this issue in detail in my paper in
JoGG, since a reviewer raised the same question. I even added a separate
section on it. It is a typical misunderstanding that "a singular past
mutations" causes an "overcounting", and one can do nothing about it. Take a
case of DYS388=10 in R1a1 haplogroup. Yes, this double mutation (12-->10 or
10-->12) stays with R1a1 for at least 4000 years, and makes a separate
branch of the tree. This branch contains only DYS388=10, hence, you do not
count it in the branch. There is no any "overcounting". Again and again -
one has to consider branches in haplotype datasets, otherwise results will
be incorrect.

Regards,

Anatole Klyosov


This thread: