Archiver > GENEALOGY-DNA > 2010-02 > 1266022827

From: "Ken Nordtvedt" <>
Subject: [DNA] Y tree STR Mutations can not be counted
Date: Fri, 12 Feb 2010 18:00:27 -0700
References: <003d01caac47$8e3f5090$5e82af48@Ken1>

Title should have read: Y Tree STR Mutations can not be counted.

----- Original Message -----
From: "Ken Nordtvedt" <>
To: <>
Sent: Friday, February 12, 2010 5:57 PM
Subject: [DNA] Y Tree SNPs can not be counted

>> Since some have recently been repeatedly saying they make y tree age
>> estimates and their statistical confidence intervals of those estimates
>> by
>> "counting mutations" in the tree (though I think they mean something
>> quite
>> different), I want to document the reasons why it is impossible to count
>> mutations in a y tree such as one for I1 haplogroup, etc. It would not
>> be
>> good if this repeated claim led people to believe that proper TMRCA
>> estimates for ancient clades involves trying to do the impossible ---
>> count mutations. (see P.S. for young genealogical trees) The reasons that
>> counting mutations in a y tree is impossible are multiple. What we can
>> count are features seen in the tree's resulting haplotypes in hand.
>> 1. A y tree from a single clade MRCA to the N present descending
>> haplotypes of one's representative sample has a structure determined
>> solely by the demographics through all of its history. First we have the
>> full tree for
>> the entire clade population which might be millions, and then we have the
>> particular tree for the sampling of the present
>> clade population one has taken. The latter tree is some pruning of the
>> former tree; if the sampling has quality, its tree will capture important
>> features of the full tree for the entire clade.
> Even if SNPs and STRs did not exist, that
>> y tree structure exists, whatever it is. How many male children did
>> clade males have through time which produced next generation males, etc.
>> in the face of all the forces which acted on the clade? Y mutations had
>> nothing to do with what tree structure resulted. Births and deaths did.
>> The mutations are only fortunate tools permitting us to peer back into
>> the
>> heart of the tree.
> So the number of father/son transitions which took place in the tree
> (sample
>> population tree or full tree) is unknown, and the location of all the
>> nodes in the tree --- places where a father and two or more sons are part
>> of the tree --- are unknown. When mutations are eventually considered,
>> we
>> don't have a specific count of the number of chances each mutation has
>> had
>> to
>> happen in the tree. But perhaps more importantly, because the node
>> structure of the tree is unknown, and locations where the mutations took
>> place are unknown, we can not say how many final haplotypes in our sample
>> will show
>> the consequences of most of the mutations which do happen in the tree.
>> all
>> nodes in the tree downstream of any mutation enlarge the
>> number of final haplotypes in our sample which show the particular
>> mutation.
> A theory of the
>> entire demographic history of the clade is a very iffy thing. And even
>> then, some of the tree structure will be the result of ultimate luck ---
>> such and such males had mostly sons or mostly daughters, etc. --- facts
>> not determinable by the demographic theory.
>> So the basic tree structure is unknown, not just its total time depth to
>> origin --- this specifically mentioned tree property being perhaps the
>> tree parameter standing the best chance of decently being estimated from
>> our final evidence --- the N haplotypes.
>> 2. Now we consider throwing STR mutations into the tree. Unlike the
>> unknown demographic history, we do have a theory of occurrences of the
>> STR
>> mutations.
>> At every father/son transition in the tree each STR will mutate in some
>> way with some tiny probability, and with the remaining large probability
>> stay the same. Because this behavior is probabilistic, it results in two
>> uncertainties about the occurrences of mutations in the tree. How many
>> times did each STR mutate in the tree? Where did those
>> mutations take place in the tree? We can calculate only the statistical
>> answers to these questions and get distributions of outcomes. Since
>> nature
>> has performed this tree once for
>> the case at hand, it is just a specific case from the probabilistic
>> distribution of outcomes. Example: A clade tree of about 140 generations
>> in age ending with sample population of 64 could have in the ball park of
>> 3000 father/son transitions in it. A fast STR with mutation rate of 1/150
>> on average would have mutated about 20 times in the tree. But the 2-sigma
>> statistical confidence interval for number of occurrences of that STR
>> mutation spans the range of about 12 to 30 occurrences of the mutation.
>> And to make matters even more uncountable, about half the mutations would
>> have been up and half down, but the actual up/down split is also subject
>> to statistical confidence interval. And on for each STR of our
>> haplotypes.
>> So even if knowing the tree structure (which we don't) we will not know
>> the actual number of occurrences of each STR's mutations in the tree; we
>> can only calculate or determine by simulations the distributions for
>> those
>> numbers, including
>> the average or expected values of those distributions.
>> Secondly, the locations of each STR's multiple mutations are the outcome
>> of
>> a random process Each father/son transition stands equal chance of being
>> the site. So that fast STR with an expected 20 mutations in the tree has
>> those 20 locations sprinkled randomly over the 3000 or so locations in
>> the
>> tree. The consequences of location distribution are immense, as are the
>> impacts on the resulting haplotypes.
> If an STR mutation happens
>> to occur in the earliest branch segments from the two sons of the clade
>> MRCA,
>> such a mutation on average will appear in (affect) 50 percent of the
>> final
>> haplotype
>> sample population. But if in the other extreme, it occurs in one of those
>> last N branch segments of
>> the tree which terminate with one of the N haplotypes of the sample
>> population,
>> that mutation will appear in (affect) just the single haplotype at the
>> terminal. So a mutation has potency on average of affecting from 50
>> percent to 1/N of the observed haplotypes. So there is no way to go from
>> the
>> observed haplotypes backwards in this process and count occurrences of
>> the
>> underlying STR mutation from what's seen in the haplotypes. And of course
>> that 50 percent consequence of a
>> very early STR mutation itself is only an average or expected value; we'd
>> have to know the tree structure in detail to know the true fraction of
>> sample haplotypes affected by it; that fraction depends on what fraction
>> of the final haplotypes descend from each of the sons of the MRCA, and
>> which son's descending branch segment had the STR mutation in question.
> .
>> But fortunately, because we have a theory of the probabiltic occurrences
>> of these STR mutations, we can determine by analytic calculation or by PC
>> simulations the distribution of outcomes for various properties of the
>> tree, including their expected values and statistical confidence
>> intervals.
>> In other words, analytics or PC simulations through a tremendous number
>> of
>> cases,
>> can essentially consider all possible probabilistic insertions of the STR
>> mutations into any given tree structure and produce the distributions of
>> outcomes for key parameters.
>> When the smoke clears, some of these parameter distributions have
>> expected
>> values which amazingly
>> are independent of tree structure details, but less often does one find
>> that the statistical confidence intervals are independent of tree
>> details.
>> I'll briefly describe the good and the ugly parameter properties.
>> A. If you can accurately guess the MRCA's founding haplotype of a clade,
>> it turns out that the expected value for the average variance of the N
>> final haplotypes from the founding haplotype of the clade tree is
>> INDEPENDENT of the tree structure
>> other than its TMRCA; and <TMRCA> = Sum of STR Variances / Sum of STR
>> mutation rates.
>> B. If you have sample populations of haplotypes from two clades with
>> interclade MRCA earlier than MRCAs of both individual clades, then the
>> expected value of the variances between any and all pairs of haplotypes,
>> one taken
>> from one clade and the other taken from the other clade, is INDEPENDENT
>> of
>> the tree structure other than the TMRCA of the interclade node ancestral
>> to both clades, and that interclade <TMRCA> = Sum of interclade STR
>> Variances / 2 Sum of STR mutation rates. No founding haplotype need be
>> guessed in this parameter estimation: all variances are between present
>> day pairs of haplotypes which we see.
>> C. If you evaluate the average sum of STR variances between the N(N-1)/2
>> pairs of your N sample haplotypes of a single clade, and divide by sum of
>> STR
>> mutation rates, you get a Coalescence age (also called Expansion age)
>> whose meaning is DEPENDENT on
>> the tree structure details; this age estimate is of something necessarily
>> younger than the clade's MRCA age.
>> D. All statistical confidence intervals for the above estimates are
>> DEPENDENT on the
>> tree structure details which we don't know. However, the statistical
>> confidence interval for B
>> above can rather simply have a conservative upper limit determined in
>> terms of the tree depth.
>> For those who use GDs rather than variances, the above discussion is
>> applicable as well, though slightly modifiable in some details. In fact,
>> for trees that are not too old, even "young
>> ancient" trees, GD counts are very close to same as variance measures.
>> So for ancient trees we can not count number of STR mutations that
>> happened or where those mutations happened in the tree. We can only
>> assume
>> a generic structure or class of structures for the tree, and then acquire
>> the probabilistic distributions for number and locations of the STR
>> mutations, and then obtain distributions for parameters of interest about
>> the tree which depend on the first-said distributions about how many and
>> where the STR mutations occured.
>> Why are expected or average values of distributions so central in all
>> this? Because of
>> the central limit theorem of applied probability math. When extended
>> haplotypes with many, many STRs are used in a combined fashion, the many
>> independent stocastic distributions of the STRs can be combined to form a
>> collective distribution for outcomes of interest which are much better
>> behaved with most likely
>> outcomes being squished toward the expected values of collective
>> parameters composed from the many STR distributions.
>> P.S. Genealogical trees for which the genealogical information permits
>> robust determination of the tree structure can allow you to make informed
>> guesses about the number of and placements of the STR mutations that
>> happened in the tree.
> -------------------------------
> To unsubscribe from the list, please send an email to
> with the word 'unsubscribe' without the
> quotes in the subject and the body of the message

This thread: