Archiver > GENEALOGY-DNA > 2010-02 > 1266034170

From: "Anatole Klyosov" <>
Subject: Re: [DNA] Y Tree SNPs can not be counted
Date: Fri, 12 Feb 2010 23:09:32 -0500
References: <>

>From: "Ken Nordtvedt" <>

Dear Ken,

Many words, little substance.

I suspect that by "counting mutations" you mean something VERY different
compared with, e.g., what I mean.

By "counting mutations" I mean the followng. If we have, say, 500 of
67-marker haplotypes, and 14 haplotypes among them are identical to each
other (base haplotypes), and the other 486 mutated haplotypes have
(collectively) 1788 mutations from the base, then those 1788 mutations are
those that we count.

True or false? (a rhetorical question).

Thank you for your (anticipated) honest and direct answer.

Anatole Klyosov


> Since some have recently been repeatedly saying they make y tree age
> estimates and their statistical confidence intervals of those estimates by
> "counting mutations" in the tree (though I think they mean something quite
> different), I want to document the reasons why it is impossible to count
> mutations in a y tree such as one for I1 haplogroup, etc. It would not be
> good if this repeated claim led people to believe that proper TMRCA
> estimates for ancient clades involves trying to do the impossible ---
> count mutations. (see P.S. for young genealogical trees) The reasons that
> counting mutations in a y tree is impossible are multiple. What we can
> count are features seen in the tree's resulting haplotypes in hand.
> 1. A y tree from a single clade MRCA to the N present descending
> haplotypes of one's representative sample has a structure determined
> solely by the demographics through all of its history. First we have the
> full tree for
> the entire clade population which might be millions, and then we have the
> particular tree for the sampling of the present
> clade population one has taken. The latter tree is some pruning of the
> former tree; if the sampling has quality, its tree will capture important
> features of the full tree for the entire clade.

Even if SNPs and STRs did not exist, that
> y tree structure exists, whatever it is. How many male children did
> clade males have through time which produced next generation males, etc.
> in the face of all the forces which acted on the clade? Y mutations had
> nothing to do with what tree structure resulted. Births and deaths did.
> The mutations are only fortunate tools permitting us to peer back into the
> heart of the tree.

So the number of father/son transitions which took place in the tree
> population tree or full tree) is unknown, and the location of all the
> nodes in the tree --- places where a father and two or more sons are part
> of the tree --- are unknown. When mutations are eventually considered, we
> don't have a specific count of the number of chances each mutation has had
> to
> happen in the tree. But perhaps more importantly, because the node
> structure of the tree is unknown, and locations where the mutations took
> place are unknown, we can not say how many final haplotypes in our sample
> will show
> the consequences of most of the mutations which do happen in the tree. all
> nodes in the tree downstream of any mutation enlarge the
> number of final haplotypes in our sample which show the particular
> mutation.

A theory of the
> entire demographic history of the clade is a very iffy thing. And even
> then, some of the tree structure will be the result of ultimate luck ---
> such and such males had mostly sons or mostly daughters, etc. --- facts
> not determinable by the demographic theory.
> So the basic tree structure is unknown, not just its total time depth to
> origin --- this specifically mentioned tree property being perhaps the
> tree parameter standing the best chance of decently being estimated from
> our final evidence --- the N haplotypes.
> 2. Now we consider throwing STR mutations into the tree. Unlike the
> unknown demographic history, we do have a theory of occurrences of the STR
> mutations.
> At every father/son transition in the tree each STR will mutate in some
> way with some tiny probability, and with the remaining large probability
> stay the same. Because this behavior is probabilistic, it results in two
> uncertainties about the occurrences of mutations in the tree. How many
> times did each STR mutate in the tree? Where did those
> mutations take place in the tree? We can calculate only the statistical
> answers to these questions and get distributions of outcomes. Since nature
> has performed this tree once for
> the case at hand, it is just a specific case from the probabilistic
> distribution of outcomes. Example: A clade tree of about 140 generations
> in age ending with sample population of 64 could have in the ball park of
> 3000 father/son transitions in it. A fast STR with mutation rate of 1/150
> on average would have mutated about 20 times in the tree. But the 2-sigma
> statistical confidence interval for number of occurrences of that STR
> mutation spans the range of about 12 to 30 occurrences of the mutation.
> And to make matters even more uncountable, about half the mutations would
> have been up and half down, but the actual up/down split is also subject
> to statistical confidence interval. And on for each STR of our haplotypes.
> So even if knowing the tree structure (which we don't) we will not know
> the actual number of occurrences of each STR's mutations in the tree; we
> can only calculate or determine by simulations the distributions for those
> numbers, including
> the average or expected values of those distributions.
> Secondly, the locations of each STR's multiple mutations are the outcome
> of
> a random process Each father/son transition stands equal chance of being
> the site. So that fast STR with an expected 20 mutations in the tree has
> those 20 locations sprinkled randomly over the 3000 or so locations in the
> tree. The consequences of location distribution are immense, as are the
> impacts on the resulting haplotypes.

If an STR mutation happens
> to occur in the earliest branch segments from the two sons of the clade
> such a mutation on average will appear in (affect) 50 percent of the final
> haplotype
> sample population. But if in the other extreme, it occurs in one of those
> last N branch segments of
> the tree which terminate with one of the N haplotypes of the sample
> population,
> that mutation will appear in (affect) just the single haplotype at the
> terminal. So a mutation has potency on average of affecting from 50
> percent to 1/N of the observed haplotypes. So there is no way to go from
> the
> observed haplotypes backwards in this process and count occurrences of the
> underlying STR mutation from what's seen in the haplotypes. And of course
> that 50 percent consequence of a
> very early STR mutation itself is only an average or expected value; we'd
> have to know the tree structure in detail to know the true fraction of
> sample haplotypes affected by it; that fraction depends on what fraction
> of the final haplotypes descend from each of the sons of the MRCA, and
> which son's descending branch segment had the STR mutation in question.

> But fortunately, because we have a theory of the probabiltic occurrences
> of these STR mutations, we can determine by analytic calculation or by PC
> simulations the distribution of outcomes for various properties of the
> tree, including their expected values and statistical confidence
> intervals.
> In other words, analytics or PC simulations through a tremendous number of
> cases,
> can essentially consider all possible probabilistic insertions of the STR
> mutations into any given tree structure and produce the distributions of
> outcomes for key parameters.
> When the smoke clears, some of these parameter distributions have expected
> values which amazingly
> are independent of tree structure details, but less often does one find
> that the statistical confidence intervals are independent of tree details.
> I'll briefly describe the good and the ugly parameter properties.
> A. If you can accurately guess the MRCA's founding haplotype of a clade,
> it turns out that the expected value for the average variance of the N
> final haplotypes from the founding haplotype of the clade tree is
> INDEPENDENT of the tree structure
> other than its TMRCA; and <TMRCA> = Sum of STR Variances / Sum of STR
> mutation rates.
> B. If you have sample populations of haplotypes from two clades with
> interclade MRCA earlier than MRCAs of both individual clades, then the
> expected value of the variances between any and all pairs of haplotypes,
> one taken
> from one clade and the other taken from the other clade, is INDEPENDENT of
> the tree structure other than the TMRCA of the interclade node ancestral
> to both clades, and that interclade <TMRCA> = Sum of interclade STR
> Variances / 2 Sum of STR mutation rates. No founding haplotype need be
> guessed in this parameter estimation: all variances are between present
> day pairs of haplotypes which we see.
> C. If you evaluate the average sum of STR variances between the N(N-1)/2
> pairs of your N sample haplotypes of a single clade, and divide by sum of
> mutation rates, you get a Coalescence age (also called Expansion age)
> whose meaning is DEPENDENT on
> the tree structure details; this age estimate is of something necessarily
> younger than the clade's MRCA age.
> D. All statistical confidence intervals for the above estimates are
> DEPENDENT on the
> tree structure details which we don't know. However, the statistical
> confidence interval for B
> above can rather simply have a conservative upper limit determined in
> terms of the tree depth.
> For those who use GDs rather than variances, the above discussion is
> applicable as well, though slightly modifiable in some details. In fact,
> for trees that are not too old, even "young
> ancient" trees, GD counts are very close to same as variance measures.
> So for ancient trees we can not count number of STR mutations that
> happened or where those mutations happened in the tree. We can only assume
> a generic structure or class of structures for the tree, and then acquire
> the probabilistic distributions for number and locations of the STR
> mutations, and then obtain distributions for parameters of interest about
> the tree which depend on the first-said distributions about how many and
> where the STR mutations occured.
> Why are expected or average values of distributions so central in all
> this? Because of
> the central limit theorem of applied probability math. When extended
> haplotypes with many, many STRs are used in a combined fashion, the many
> independent stocastic distributions of the STRs can be combined to form a
> collective distribution for outcomes of interest which are much better
> behaved with most likely
> outcomes being squished toward the expected values of collective
> parameters composed from the many STR distributions.

> P.S. Genealogical trees for which the genealogical information permits
> robust determination of the tree structure can allow you to make informed
> guesses about the number of and placements of the STR mutations that
> happened in the tree.

This thread: