GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2009-07 > 1247070327


From: "Ken Nordtvedt" <>
Subject: [DNA] Age Estimate Confidence Intervals
Date: Wed, 8 Jul 2009 10:25:27 -0600


If you estimate ages back to nodes between two haplotypes or two populations, assume simple mutation model, and optimally weight STRs in your estimate for G, the fractional statistical precision (squared) of your estimate will be:

Var(G) / G^2 = 2 / Sum i [4m(i)G / {1+4m(i)G}]

This can be evaluated in the young clade limit (all m(i)G <<1),
and the very old clade limit (all m(i)G>>1 which we don't actually reach for modern man's history since Adam.)

Young: Var(G) / G^2 = 1 / 2MG

Old: Var(G) / G^2 = 2 / N

with M being sum of marker mutation rates,
and N being number of STRs

Note several things.

1) fractional confidence interval is atrocious as G goes to zero (genealogy)
2) fractional confidence interval gets huge if you chop M down by throwing away the fast STRs
3) fractional confidence interval for old G still wants as many STRs (N) as possible

Some rough examples:

Example: G = 250 (7500 years), M = 1/50
95 percent confidence interval = plus/minus 5000 years.
This example was suggested by use of yhrd markers.

Example: G = 150 (4500 years), M = 1/10
95 percent confidence interval = plus/minus 1200 years

Example G = 2000 (60,000 years), N = 24
95 percent confidence interval = plus/minus 40,000 years

I used limiting forms of the equation to simplify the work. You can apply the actual sum for real m(i) and G if you wish.
The actual sum will do somewhat worse than my simplifications, so above numbers are optimistic.

The above based on node age estimator:

G = Sum i [ Var(i) w(i) ] / 2 Sum i [m(i) w(i) ]

with w(i) = 1 / [1+4m(i)G]

I realize some are throwing away STRs for various reasons; I just wanted to remind that there is a cost to doing so --- larger statistical confidence intervals for estimates.


This thread: