GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2009-07 > 1247680242


From: "Ken Nordtvedt" <>
Subject: [DNA] For Age Estimates beware of tossing STRs away
Date: Wed, 15 Jul 2009 11:50:42 -0600


To Illustrate how statistical confidence intervals quickly kill you if you throw away all the fast and even medium rate STRs, draw a picture of an interclade tree with root node 15,000 years ago but with two clade populations of just a few thousand years old spreading out at the two "present" ends. That represents many of the applications being made recently --- there really aren't any clade populations with great ages.

You should see that most of the mutating between any pair of haplotypes --- one from clade A and the other from clade B --- occurs on the single ancestral branch lines that go back to the root node. No matter how many haplotypes you use from clade A and from clade B to try to "smooth" out the noise, you can't smooth out the noise on those ancestral branch lines, because all haplotype pairs share those same mutations occuring there. All that can be smoothed out is the contribution to variance growth during the brief clade lives on each "present" end of the tree.

So to get an idea of the interclade variance estimate, just imagine the two clades are represented by a single haplotype each. Make the situation simple. This should be called the "basic V tree" to streamline our discussions.

The basic interclade V tree is just a line back from one haplotype to the interclade node, and then another line from there forward to the other haplotype; the problem is isomorphic to TMRCA estimates for a pair of haplotypes; the only difference being that much deeper times are usually being estimated for interclade nodes which should be helpful to fractional accuracy in the TMRCA estimate.

If you keep just the 18 slowest STRs of the common 75 STR set (FTDNA and SMGF) your sum of mutation rates is about 1/200. The two clades with a TMRCA of about 15,000 years ago (500 generations) are separated by 1000 chances for mutations to have occured. So the total number of mutations to expect between them among the 18 slowest STRs is 5 mutations. That means about 13 of your 18 STRs did not mutate at all. But the probability is such that getting 4 or 6, or 3 or 7, etc. mutations is appreciable. In fact the 67 percent confidence interval is about plus or minus square root of 5 = 2.23 or 45 percent of estimate. The 95 percent confidence interval is about twice that but somewhat skewed asymmetrically between the high and low error side.

The mutational behavior of some of the STRs could very well differ from the simple models used up until now, and gathering the data to improve the mutation model is certainly in order. But one thing will be about the same as the model changes --- the statistical confidence interval size --- that noise will always be there. If you want to make TMRCA estimates of any usefulness you have got to maintain a decent "sum of marker mutation rates".

Hoard your "sum of mutation rates" as if it were gold --- don't throw any away unless under severe duress!

Ken





This thread: