**GENEALOGY-DNA-L Archives**

From:"Ken Nordtvedt" <>Subject:[DNA] Y Tree SNPs can not be countedDate:Fri, 12 Feb 2010 17:57:44 -0700> Since some have recently been repeatedly saying they make y tree age

> estimates and their statistical confidence intervals of those estimates by

> "counting mutations" in the tree (though I think they mean something quite

> different), I want to document the reasons why it is impossible to count

> mutations in a y tree such as one for I1 haplogroup, etc. It would not be

> good if this repeated claim led people to believe that proper TMRCA

> estimates for ancient clades involves trying to do the impossible ---

> count mutations. (see P.S. for young genealogical trees) The reasons that

> counting mutations in a y tree is impossible are multiple. What we can

> count are features seen in the tree's resulting haplotypes in hand.

>

> 1. A y tree from a single clade MRCA to the N present descending

> haplotypes of one's representative sample has a structure determined

> solely by the demographics through all of its history. First we have the

> full tree for

> the entire clade population which might be millions, and then we have the

> particular tree for the sampling of the present

> clade population one has taken. The latter tree is some pruning of the

> former tree; if the sampling has quality, its tree will capture important

> features of the full tree for the entire clade.

Even if SNPs and STRs did not exist, that

> y tree structure exists, whatever it is. How many male children did

> clade males have through time which produced next generation males, etc.

> in the face of all the forces which acted on the clade? Y mutations had

> nothing to do with what tree structure resulted. Births and deaths did.

> The mutations are only fortunate tools permitting us to peer back into the

> heart of the tree.

So the number of father/son transitions which took place in the tree

(sample

> population tree or full tree) is unknown, and the location of all the

> nodes in the tree --- places where a father and two or more sons are part

> of the tree --- are unknown. When mutations are eventually considered, we

> don't have a specific count of the number of chances each mutation has had

> to

> happen in the tree. But perhaps more importantly, because the node

> structure of the tree is unknown, and locations where the mutations took

> place are unknown, we can not say how many final haplotypes in our sample

> will show

> the consequences of most of the mutations which do happen in the tree. all

> nodes in the tree downstream of any mutation enlarge the

> number of final haplotypes in our sample which show the particular

> mutation.

A theory of the

> entire demographic history of the clade is a very iffy thing. And even

> then, some of the tree structure will be the result of ultimate luck ---

> such and such males had mostly sons or mostly daughters, etc. --- facts

> not determinable by the demographic theory.

>

> So the basic tree structure is unknown, not just its total time depth to

> origin --- this specifically mentioned tree property being perhaps the

> tree parameter standing the best chance of decently being estimated from

> our final evidence --- the N haplotypes.

>

> 2. Now we consider throwing STR mutations into the tree. Unlike the

> unknown demographic history, we do have a theory of occurrences of the STR

> mutations.

> At every father/son transition in the tree each STR will mutate in some

> way with some tiny probability, and with the remaining large probability

> stay the same. Because this behavior is probabilistic, it results in two

> uncertainties about the occurrences of mutations in the tree. How many

> times did each STR mutate in the tree? Where did those

> mutations take place in the tree? We can calculate only the statistical

> answers to these questions and get distributions of outcomes. Since nature

> has performed this tree once for

> the case at hand, it is just a specific case from the probabilistic

> distribution of outcomes. Example: A clade tree of about 140 generations

> in age ending with sample population of 64 could have in the ball park of

> 3000 father/son transitions in it. A fast STR with mutation rate of 1/150

> on average would have mutated about 20 times in the tree. But the 2-sigma

> statistical confidence interval for number of occurrences of that STR

> mutation spans the range of about 12 to 30 occurrences of the mutation.

> And to make matters even more uncountable, about half the mutations would

> have been up and half down, but the actual up/down split is also subject

> to statistical confidence interval. And on for each STR of our haplotypes.

> So even if knowing the tree structure (which we don't) we will not know

> the actual number of occurrences of each STR's mutations in the tree; we

> can only calculate or determine by simulations the distributions for those

> numbers, including

> the average or expected values of those distributions.

>

> Secondly, the locations of each STR's multiple mutations are the outcome

> of

> a random process Each father/son transition stands equal chance of being

> the site. So that fast STR with an expected 20 mutations in the tree has

> those 20 locations sprinkled randomly over the 3000 or so locations in the

> tree. The consequences of location distribution are immense, as are the

> impacts on the resulting haplotypes.

If an STR mutation happens

> to occur in the earliest branch segments from the two sons of the clade

> MRCA,

> such a mutation on average will appear in (affect) 50 percent of the final

> haplotype

> sample population. But if in the other extreme, it occurs in one of those

> last N branch segments of

> the tree which terminate with one of the N haplotypes of the sample

> population,

> that mutation will appear in (affect) just the single haplotype at the

> terminal. So a mutation has potency on average of affecting from 50

> percent to 1/N of the observed haplotypes. So there is no way to go from

> the

> observed haplotypes backwards in this process and count occurrences of the

> underlying STR mutation from what's seen in the haplotypes. And of course

> that 50 percent consequence of a

> very early STR mutation itself is only an average or expected value; we'd

> have to know the tree structure in detail to know the true fraction of

> sample haplotypes affected by it; that fraction depends on what fraction

> of the final haplotypes descend from each of the sons of the MRCA, and

> which son's descending branch segment had the STR mutation in question.

.

> But fortunately, because we have a theory of the probabiltic occurrences

> of these STR mutations, we can determine by analytic calculation or by PC

> simulations the distribution of outcomes for various properties of the

> tree, including their expected values and statistical confidence

> intervals.

> In other words, analytics or PC simulations through a tremendous number of

> cases,

> can essentially consider all possible probabilistic insertions of the STR

> mutations into any given tree structure and produce the distributions of

> outcomes for key parameters.

>

> When the smoke clears, some of these parameter distributions have expected

> values which amazingly

> are independent of tree structure details, but less often does one find

> that the statistical confidence intervals are independent of tree details.

> I'll briefly describe the good and the ugly parameter properties.

>

> A. If you can accurately guess the MRCA's founding haplotype of a clade,

> it turns out that the expected value for the average variance of the N

> final haplotypes from the founding haplotype of the clade tree is

> INDEPENDENT of the tree structure

> other than its TMRCA; and <TMRCA> = Sum of STR Variances / Sum of STR

> mutation rates.

>

> B. If you have sample populations of haplotypes from two clades with

> interclade MRCA earlier than MRCAs of both individual clades, then the

> expected value of the variances between any and all pairs of haplotypes,

> one taken

> from one clade and the other taken from the other clade, is INDEPENDENT of

> the tree structure other than the TMRCA of the interclade node ancestral

> to both clades, and that interclade <TMRCA> = Sum of interclade STR

> Variances / 2 Sum of STR mutation rates. No founding haplotype need be

> guessed in this parameter estimation: all variances are between present

> day pairs of haplotypes which we see.

>

> C. If you evaluate the average sum of STR variances between the N(N-1)/2

> pairs of your N sample haplotypes of a single clade, and divide by sum of

> STR

> mutation rates, you get a Coalescence age (also called Expansion age)

> whose meaning is DEPENDENT on

> the tree structure details; this age estimate is of something necessarily

> younger than the clade's MRCA age.

>

> D. All statistical confidence intervals for the above estimates are

> DEPENDENT on the

> tree structure details which we don't know. However, the statistical

> confidence interval for B

> above can rather simply have a conservative upper limit determined in

> terms of the tree depth.

>

> For those who use GDs rather than variances, the above discussion is

> applicable as well, though slightly modifiable in some details. In fact,

> for trees that are not too old, even "young

> ancient" trees, GD counts are very close to same as variance measures.

>

> So for ancient trees we can not count number of STR mutations that

> happened or where those mutations happened in the tree. We can only assume

> a generic structure or class of structures for the tree, and then acquire

> the probabilistic distributions for number and locations of the STR

> mutations, and then obtain distributions for parameters of interest about

> the tree which depend on the first-said distributions about how many and

> where the STR mutations occured.

>

> Why are expected or average values of distributions so central in all

> this? Because of

> the central limit theorem of applied probability math. When extended

> haplotypes with many, many STRs are used in a combined fashion, the many

> independent stocastic distributions of the STRs can be combined to form a

> collective distribution for outcomes of interest which are much better

> behaved with most likely

> outcomes being squished toward the expected values of collective

> parameters composed from the many STR distributions.

> P.S. Genealogical trees for which the genealogical information permits

> robust determination of the tree structure can allow you to make informed

> guesses about the number of and placements of the STR mutations that

> happened in the tree.

**This thread:**

**[DNA] Y Tree SNPs can not be counted by "Ken Nordtvedt" <>**- [DNA] Y tree STR Mutations can not be counted by "Ken Nordtvedt" <>
- Re: [DNA] Y Tree STR Mutations can not be counted by "Ken Nordtvedt" <>

- Re: [DNA] Y Tree STR Mutations can not be counted by "Anatole Klyosov" <>

- [DNA] Y Tree SNPs can not be counted by "Lancaster-Boon" <>
- Re: [DNA] Y Tree STR Mutations can not be counted by "Ken Nordtvedt" <>

- Re: [DNA] Y Tree SNPs can not be counted by "Anatole Klyosov" <>
- Re: [DNA] Y Tree SNPs can not be counted by Vincent Vizachero <>

- Re: [DNA] Y Tree SNPs can not be counted by "Ken Nordtvedt" <>

- Re: [DNA] Y Tree SNPs can not be counted by "Anatole Klyosov" <>

- Re: [DNA] Y Tree SNPs can not be counted by "Anatole Klyosov" <>
- Re: [DNA] Y Tree SNPs can not be counted by Sasson Margaliot <>

- Re: [DNA] Y Tree SNPs can not be counted by "Anatole Klyosov" <>

- Re: [DNA] Y Tree SNPs can not be counted by "Anatole Klyosov" <>

- [DNA] Y Tree SNPs can not be counted by "Lancaster-Boon" <>

- [DNA] Y Tree SNPs can not be counted by "Lancaster-Boon" <>

- [DNA] Y tree STR Mutations can not be counted by "Ken Nordtvedt" <>