Archiver > GENEALOGY-DNA > 2010-02 > 1267381454

From: "Ken Nordtvedt" <>
Subject: [DNA] Sampling Sigma and Mutational Sigma
Date: Sun, 28 Feb 2010 11:24:14 -0700

There are at least two important sources of confidence interval uncertainties associated with the various kinds of variance or GD age estimates --- that due to the random nature of placements of the ydna STR mutations into the tree, and that due to using sample populations of haplotypes to represent the clade rather than the entire clade population (which could number in the millions in some cases). One can compare the age estimates which result from using separate and independent haplotype sample populations, and the variations in age estimates will give you an idea of the sigma (confidence interval) contribution due to sampling. But that will tell you nothing about the other (probably larger) confidence interval contribution due to the random nature of the placements of the STR mutations in the tree.

This latter statistical confidence interval (due to flucuations from the average in the STR mutations' numbers and locations) is there even if you knew all millions of the haplotypes of the clade today.
The clade's tree is a single tree which came to be once in nature. The STR mutations had a single opportunity each father/son transition to take place or not take place. One complete set of STR mutations did take place through that one tree in nature. From that one set of mutations that did take place a variance or average GD came to be --- that's what we can observe. But that one set of STR mutations which came to be is just one case from the distribution of all possible occurrences of the STR mutations which could have occured for that tree. How much does that one case vary in number and locations of its STR mutations from the average outcome? That's what the calculable intrinsic mutational statistical confidence interval tells us --- on average how much should we expect our observed variance or average GD to deviate from the expected (average) value of these quantities? It is, afterall, the expected (average) value of variance or GD which is the estimator of age, but we are using a single outcome taken from a distribution of possible outcomes and hoping it is close to the average (the estimator).

On top of this sigma (confidence interval) driven by the random nature of mutations, there is the additional sigma due to sample populations. Each sample population of haplotypes has its own tree which is a certain pruning of the underlying total tree of the clade. So each tree for a sample of haplotypes represents a slightly different simplified version of the total tree and uses in part (but only in part) different branch segments of the underlying tree. But all these different sample sets of haplotypes are the result of the same single history of actual STR mutations placed in the underlying tree. So the variances or GDs obtained from different samples are driven in part by the same actual STR mutations which occurred on the branch segments of the underlying tree which they used in common, with the differences of these variances and GDs being driven just by the differences of STR mutations which happened on the branch segments not shared by the separate pruned trees pertinent to the separate samples of haplotypes.

So in a nutshell, comparing the variances or GDs which result from different sample haplotype populations is an interesting new statistic which should be explored in its own right. I don't think this has been explored very much in the literature, if at all.

But it is not telling us about the intrinsic statistical confidence interval valid for the underlying total clade tree. Since that particular clade tree and all its STR mutations happened only once in nature, the only way to find its intrinsic statistical confidence interval due to the random nature of STR mutations, for variance or average GD, is to evaluate or obtain by simulation the distribution of variances or GDs which could have occured given the rules of mutation for the STRs.

This thread: