**GENEALOGY-DNA-L Archives**

From:"Ken Nordtvedt" <>Subject:[DNA] Y tree STR Mutations can not be countedDate:Fri, 12 Feb 2010 18:00:27 -0700References:<003d01caac47$8e3f5090$5e82af48@Ken1>Title should have read: Y Tree STR Mutations can not be counted.

----- Original Message -----

From: "Ken Nordtvedt" <>

To: <>

Sent: Friday, February 12, 2010 5:57 PM

Subject: [DNA] Y Tree SNPs can not be counted

>

>

>

>> Since some have recently been repeatedly saying they make y tree age

>> estimates and their statistical confidence intervals of those estimates

>> by

>> "counting mutations" in the tree (though I think they mean something

>> quite

>> different), I want to document the reasons why it is impossible to count

>> mutations in a y tree such as one for I1 haplogroup, etc. It would not

>> be

>> good if this repeated claim led people to believe that proper TMRCA

>> estimates for ancient clades involves trying to do the impossible ---

>> count mutations. (see P.S. for young genealogical trees) The reasons that

>> counting mutations in a y tree is impossible are multiple. What we can

>> count are features seen in the tree's resulting haplotypes in hand.

>>

>> 1. A y tree from a single clade MRCA to the N present descending

>> haplotypes of one's representative sample has a structure determined

>> solely by the demographics through all of its history. First we have the

>> full tree for

>> the entire clade population which might be millions, and then we have the

>> particular tree for the sampling of the present

>> clade population one has taken. The latter tree is some pruning of the

>> former tree; if the sampling has quality, its tree will capture important

>> features of the full tree for the entire clade.

>

> Even if SNPs and STRs did not exist, that

>> y tree structure exists, whatever it is. How many male children did

>> clade males have through time which produced next generation males, etc.

>> in the face of all the forces which acted on the clade? Y mutations had

>> nothing to do with what tree structure resulted. Births and deaths did.

>> The mutations are only fortunate tools permitting us to peer back into

>> the

>> heart of the tree.

>

> So the number of father/son transitions which took place in the tree

> (sample

>> population tree or full tree) is unknown, and the location of all the

>> nodes in the tree --- places where a father and two or more sons are part

>> of the tree --- are unknown. When mutations are eventually considered,

>> we

>> don't have a specific count of the number of chances each mutation has

>> had

>> to

>> happen in the tree. But perhaps more importantly, because the node

>> structure of the tree is unknown, and locations where the mutations took

>> place are unknown, we can not say how many final haplotypes in our sample

>> will show

>> the consequences of most of the mutations which do happen in the tree.

>> all

>> nodes in the tree downstream of any mutation enlarge the

>> number of final haplotypes in our sample which show the particular

>> mutation.

>

> A theory of the

>> entire demographic history of the clade is a very iffy thing. And even

>> then, some of the tree structure will be the result of ultimate luck ---

>> such and such males had mostly sons or mostly daughters, etc. --- facts

>> not determinable by the demographic theory.

>>

>> So the basic tree structure is unknown, not just its total time depth to

>> origin --- this specifically mentioned tree property being perhaps the

>> tree parameter standing the best chance of decently being estimated from

>> our final evidence --- the N haplotypes.

>>

>> 2. Now we consider throwing STR mutations into the tree. Unlike the

>> unknown demographic history, we do have a theory of occurrences of the

>> STR

>> mutations.

>> At every father/son transition in the tree each STR will mutate in some

>> way with some tiny probability, and with the remaining large probability

>> stay the same. Because this behavior is probabilistic, it results in two

>> uncertainties about the occurrences of mutations in the tree. How many

>> times did each STR mutate in the tree? Where did those

>> mutations take place in the tree? We can calculate only the statistical

>> answers to these questions and get distributions of outcomes. Since

>> nature

>> has performed this tree once for

>> the case at hand, it is just a specific case from the probabilistic

>> distribution of outcomes. Example: A clade tree of about 140 generations

>> in age ending with sample population of 64 could have in the ball park of

>> 3000 father/son transitions in it. A fast STR with mutation rate of 1/150

>> on average would have mutated about 20 times in the tree. But the 2-sigma

>> statistical confidence interval for number of occurrences of that STR

>> mutation spans the range of about 12 to 30 occurrences of the mutation.

>> And to make matters even more uncountable, about half the mutations would

>> have been up and half down, but the actual up/down split is also subject

>> to statistical confidence interval. And on for each STR of our

>> haplotypes.

>> So even if knowing the tree structure (which we don't) we will not know

>> the actual number of occurrences of each STR's mutations in the tree; we

>> can only calculate or determine by simulations the distributions for

>> those

>> numbers, including

>> the average or expected values of those distributions.

>>

>> Secondly, the locations of each STR's multiple mutations are the outcome

>> of

>> a random process Each father/son transition stands equal chance of being

>> the site. So that fast STR with an expected 20 mutations in the tree has

>> those 20 locations sprinkled randomly over the 3000 or so locations in

>> the

>> tree. The consequences of location distribution are immense, as are the

>> impacts on the resulting haplotypes.

>

> If an STR mutation happens

>> to occur in the earliest branch segments from the two sons of the clade

>> MRCA,

>> such a mutation on average will appear in (affect) 50 percent of the

>> final

>> haplotype

>> sample population. But if in the other extreme, it occurs in one of those

>> last N branch segments of

>> the tree which terminate with one of the N haplotypes of the sample

>> population,

>> that mutation will appear in (affect) just the single haplotype at the

>> terminal. So a mutation has potency on average of affecting from 50

>> percent to 1/N of the observed haplotypes. So there is no way to go from

>> the

>> observed haplotypes backwards in this process and count occurrences of

>> the

>> underlying STR mutation from what's seen in the haplotypes. And of course

>> that 50 percent consequence of a

>> very early STR mutation itself is only an average or expected value; we'd

>> have to know the tree structure in detail to know the true fraction of

>> sample haplotypes affected by it; that fraction depends on what fraction

>> of the final haplotypes descend from each of the sons of the MRCA, and

>> which son's descending branch segment had the STR mutation in question.

>

> .

>> But fortunately, because we have a theory of the probabiltic occurrences

>> of these STR mutations, we can determine by analytic calculation or by PC

>> simulations the distribution of outcomes for various properties of the

>> tree, including their expected values and statistical confidence

>> intervals.

>> In other words, analytics or PC simulations through a tremendous number

>> of

>> cases,

>> can essentially consider all possible probabilistic insertions of the STR

>> mutations into any given tree structure and produce the distributions of

>> outcomes for key parameters.

>>

>> When the smoke clears, some of these parameter distributions have

>> expected

>> values which amazingly

>> are independent of tree structure details, but less often does one find

>> that the statistical confidence intervals are independent of tree

>> details.

>> I'll briefly describe the good and the ugly parameter properties.

>>

>> A. If you can accurately guess the MRCA's founding haplotype of a clade,

>> it turns out that the expected value for the average variance of the N

>> final haplotypes from the founding haplotype of the clade tree is

>> INDEPENDENT of the tree structure

>> other than its TMRCA; and <TMRCA> = Sum of STR Variances / Sum of STR

>> mutation rates.

>>

>> B. If you have sample populations of haplotypes from two clades with

>> interclade MRCA earlier than MRCAs of both individual clades, then the

>> expected value of the variances between any and all pairs of haplotypes,

>> one taken

>> from one clade and the other taken from the other clade, is INDEPENDENT

>> of

>> the tree structure other than the TMRCA of the interclade node ancestral

>> to both clades, and that interclade <TMRCA> = Sum of interclade STR

>> Variances / 2 Sum of STR mutation rates. No founding haplotype need be

>> guessed in this parameter estimation: all variances are between present

>> day pairs of haplotypes which we see.

>>

>> C. If you evaluate the average sum of STR variances between the N(N-1)/2

>> pairs of your N sample haplotypes of a single clade, and divide by sum of

>> STR

>> mutation rates, you get a Coalescence age (also called Expansion age)

>> whose meaning is DEPENDENT on

>> the tree structure details; this age estimate is of something necessarily

>> younger than the clade's MRCA age.

>>

>> D. All statistical confidence intervals for the above estimates are

>> DEPENDENT on the

>> tree structure details which we don't know. However, the statistical

>> confidence interval for B

>> above can rather simply have a conservative upper limit determined in

>> terms of the tree depth.

>>

>> For those who use GDs rather than variances, the above discussion is

>> applicable as well, though slightly modifiable in some details. In fact,

>> for trees that are not too old, even "young

>> ancient" trees, GD counts are very close to same as variance measures.

>>

>> So for ancient trees we can not count number of STR mutations that

>> happened or where those mutations happened in the tree. We can only

>> assume

>> a generic structure or class of structures for the tree, and then acquire

>> the probabilistic distributions for number and locations of the STR

>> mutations, and then obtain distributions for parameters of interest about

>> the tree which depend on the first-said distributions about how many and

>> where the STR mutations occured.

>>

>> Why are expected or average values of distributions so central in all

>> this? Because of

>> the central limit theorem of applied probability math. When extended

>> haplotypes with many, many STRs are used in a combined fashion, the many

>> independent stocastic distributions of the STRs can be combined to form a

>> collective distribution for outcomes of interest which are much better

>> behaved with most likely

>> outcomes being squished toward the expected values of collective

>> parameters composed from the many STR distributions.

>

>> P.S. Genealogical trees for which the genealogical information permits

>> robust determination of the tree structure can allow you to make informed

>> guesses about the number of and placements of the STR mutations that

>> happened in the tree.

>

>

>

> -------------------------------

> To unsubscribe from the list, please send an email to

> with the word 'unsubscribe' without the

> quotes in the subject and the body of the message

>

**This thread:**

- [DNA] Y Tree SNPs can not be counted by "Ken Nordtvedt" <>
**[DNA] Y tree STR Mutations can not be counted by "Ken Nordtvedt" <>**- Re: [DNA] Y Tree STR Mutations can not be counted by "Ken Nordtvedt" <>

- Re: [DNA] Y Tree STR Mutations can not be counted by "Anatole Klyosov" <>

- [DNA] Y Tree SNPs can not be counted by "Lancaster-Boon" <>
- Re: [DNA] Y Tree STR Mutations can not be counted by "Ken Nordtvedt" <>

- Re: [DNA] Y Tree SNPs can not be counted by "Anatole Klyosov" <>
- Re: [DNA] Y Tree SNPs can not be counted by Vincent Vizachero <>

- Re: [DNA] Y Tree SNPs can not be counted by "Ken Nordtvedt" <>

- Re: [DNA] Y Tree SNPs can not be counted by "Anatole Klyosov" <>

- Re: [DNA] Y Tree SNPs can not be counted by "Anatole Klyosov" <>
- Re: [DNA] Y Tree SNPs can not be counted by Sasson Margaliot <>

- Re: [DNA] Y Tree SNPs can not be counted by "Anatole Klyosov" <>

- Re: [DNA] Y Tree SNPs can not be counted by "Anatole Klyosov" <>

- [DNA] Y Tree SNPs can not be counted by "Lancaster-Boon" <>

- [DNA] Y Tree SNPs can not be counted by "Lancaster-Boon" <>