GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2010-02 > 1266022664


From: "Ken Nordtvedt" <>
Subject: [DNA] Y Tree SNPs can not be counted
Date: Fri, 12 Feb 2010 17:57:44 -0700


> Since some have recently been repeatedly saying they make y tree age
> estimates and their statistical confidence intervals of those estimates by
> "counting mutations" in the tree (though I think they mean something quite
> different), I want to document the reasons why it is impossible to count
> mutations in a y tree such as one for I1 haplogroup, etc. It would not be
> good if this repeated claim led people to believe that proper TMRCA
> estimates for ancient clades involves trying to do the impossible ---
> count mutations. (see P.S. for young genealogical trees) The reasons that
> counting mutations in a y tree is impossible are multiple. What we can
> count are features seen in the tree's resulting haplotypes in hand.
>
> 1. A y tree from a single clade MRCA to the N present descending
> haplotypes of one's representative sample has a structure determined
> solely by the demographics through all of its history. First we have the
> full tree for
> the entire clade population which might be millions, and then we have the
> particular tree for the sampling of the present
> clade population one has taken. The latter tree is some pruning of the
> former tree; if the sampling has quality, its tree will capture important
> features of the full tree for the entire clade.

Even if SNPs and STRs did not exist, that
> y tree structure exists, whatever it is. How many male children did
> clade males have through time which produced next generation males, etc.
> in the face of all the forces which acted on the clade? Y mutations had
> nothing to do with what tree structure resulted. Births and deaths did.
> The mutations are only fortunate tools permitting us to peer back into the
> heart of the tree.

So the number of father/son transitions which took place in the tree
(sample
> population tree or full tree) is unknown, and the location of all the
> nodes in the tree --- places where a father and two or more sons are part
> of the tree --- are unknown. When mutations are eventually considered, we
> don't have a specific count of the number of chances each mutation has had
> to
> happen in the tree. But perhaps more importantly, because the node
> structure of the tree is unknown, and locations where the mutations took
> place are unknown, we can not say how many final haplotypes in our sample
> will show
> the consequences of most of the mutations which do happen in the tree. all
> nodes in the tree downstream of any mutation enlarge the
> number of final haplotypes in our sample which show the particular
> mutation.

A theory of the
> entire demographic history of the clade is a very iffy thing. And even
> then, some of the tree structure will be the result of ultimate luck ---
> such and such males had mostly sons or mostly daughters, etc. --- facts
> not determinable by the demographic theory.
>
> So the basic tree structure is unknown, not just its total time depth to
> origin --- this specifically mentioned tree property being perhaps the
> tree parameter standing the best chance of decently being estimated from
> our final evidence --- the N haplotypes.
>
> 2. Now we consider throwing STR mutations into the tree. Unlike the
> unknown demographic history, we do have a theory of occurrences of the STR
> mutations.
> At every father/son transition in the tree each STR will mutate in some
> way with some tiny probability, and with the remaining large probability
> stay the same. Because this behavior is probabilistic, it results in two
> uncertainties about the occurrences of mutations in the tree. How many
> times did each STR mutate in the tree? Where did those
> mutations take place in the tree? We can calculate only the statistical
> answers to these questions and get distributions of outcomes. Since nature
> has performed this tree once for
> the case at hand, it is just a specific case from the probabilistic
> distribution of outcomes. Example: A clade tree of about 140 generations
> in age ending with sample population of 64 could have in the ball park of
> 3000 father/son transitions in it. A fast STR with mutation rate of 1/150
> on average would have mutated about 20 times in the tree. But the 2-sigma
> statistical confidence interval for number of occurrences of that STR
> mutation spans the range of about 12 to 30 occurrences of the mutation.
> And to make matters even more uncountable, about half the mutations would
> have been up and half down, but the actual up/down split is also subject
> to statistical confidence interval. And on for each STR of our haplotypes.
> So even if knowing the tree structure (which we don't) we will not know
> the actual number of occurrences of each STR's mutations in the tree; we
> can only calculate or determine by simulations the distributions for those
> numbers, including
> the average or expected values of those distributions.
>
> Secondly, the locations of each STR's multiple mutations are the outcome
> of
> a random process Each father/son transition stands equal chance of being
> the site. So that fast STR with an expected 20 mutations in the tree has
> those 20 locations sprinkled randomly over the 3000 or so locations in the
> tree. The consequences of location distribution are immense, as are the
> impacts on the resulting haplotypes.

If an STR mutation happens
> to occur in the earliest branch segments from the two sons of the clade
> MRCA,
> such a mutation on average will appear in (affect) 50 percent of the final
> haplotype
> sample population. But if in the other extreme, it occurs in one of those
> last N branch segments of
> the tree which terminate with one of the N haplotypes of the sample
> population,
> that mutation will appear in (affect) just the single haplotype at the
> terminal. So a mutation has potency on average of affecting from 50
> percent to 1/N of the observed haplotypes. So there is no way to go from
> the
> observed haplotypes backwards in this process and count occurrences of the
> underlying STR mutation from what's seen in the haplotypes. And of course
> that 50 percent consequence of a
> very early STR mutation itself is only an average or expected value; we'd
> have to know the tree structure in detail to know the true fraction of
> sample haplotypes affected by it; that fraction depends on what fraction
> of the final haplotypes descend from each of the sons of the MRCA, and
> which son's descending branch segment had the STR mutation in question.

.
> But fortunately, because we have a theory of the probabiltic occurrences
> of these STR mutations, we can determine by analytic calculation or by PC
> simulations the distribution of outcomes for various properties of the
> tree, including their expected values and statistical confidence
> intervals.
> In other words, analytics or PC simulations through a tremendous number of
> cases,
> can essentially consider all possible probabilistic insertions of the STR
> mutations into any given tree structure and produce the distributions of
> outcomes for key parameters.
>
> When the smoke clears, some of these parameter distributions have expected
> values which amazingly
> are independent of tree structure details, but less often does one find
> that the statistical confidence intervals are independent of tree details.
> I'll briefly describe the good and the ugly parameter properties.
>
> A. If you can accurately guess the MRCA's founding haplotype of a clade,
> it turns out that the expected value for the average variance of the N
> final haplotypes from the founding haplotype of the clade tree is
> INDEPENDENT of the tree structure
> other than its TMRCA; and <TMRCA> = Sum of STR Variances / Sum of STR
> mutation rates.
>
> B. If you have sample populations of haplotypes from two clades with
> interclade MRCA earlier than MRCAs of both individual clades, then the
> expected value of the variances between any and all pairs of haplotypes,
> one taken
> from one clade and the other taken from the other clade, is INDEPENDENT of
> the tree structure other than the TMRCA of the interclade node ancestral
> to both clades, and that interclade <TMRCA> = Sum of interclade STR
> Variances / 2 Sum of STR mutation rates. No founding haplotype need be
> guessed in this parameter estimation: all variances are between present
> day pairs of haplotypes which we see.
>
> C. If you evaluate the average sum of STR variances between the N(N-1)/2
> pairs of your N sample haplotypes of a single clade, and divide by sum of
> STR
> mutation rates, you get a Coalescence age (also called Expansion age)
> whose meaning is DEPENDENT on
> the tree structure details; this age estimate is of something necessarily
> younger than the clade's MRCA age.
>
> D. All statistical confidence intervals for the above estimates are
> DEPENDENT on the
> tree structure details which we don't know. However, the statistical
> confidence interval for B
> above can rather simply have a conservative upper limit determined in
> terms of the tree depth.
>
> For those who use GDs rather than variances, the above discussion is
> applicable as well, though slightly modifiable in some details. In fact,
> for trees that are not too old, even "young
> ancient" trees, GD counts are very close to same as variance measures.
>
> So for ancient trees we can not count number of STR mutations that
> happened or where those mutations happened in the tree. We can only assume
> a generic structure or class of structures for the tree, and then acquire
> the probabilistic distributions for number and locations of the STR
> mutations, and then obtain distributions for parameters of interest about
> the tree which depend on the first-said distributions about how many and
> where the STR mutations occured.
>
> Why are expected or average values of distributions so central in all
> this? Because of
> the central limit theorem of applied probability math. When extended
> haplotypes with many, many STRs are used in a combined fashion, the many
> independent stocastic distributions of the STRs can be combined to form a
> collective distribution for outcomes of interest which are much better
> behaved with most likely
> outcomes being squished toward the expected values of collective
> parameters composed from the many STR distributions.

> P.S. Genealogical trees for which the genealogical information permits
> robust determination of the tree structure can allow you to make informed
> guesses about the number of and placements of the STR mutations that
> happened in the tree.



This thread: