GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2011-10 > 1319154242

From: "Kenneth Nordtvedt" <>
Subject: [DNA] Pushing Toward TMRCA
Date: Thu, 20 Oct 2011 17:44:02 -0600

Suppose a haplotype population is divided into two parts, let’s call then U and L (choice of letters to be explained later), with f[U] and f[L] being fractions of haplotype population in each part; f[U]+f[L]=1

A little algebra then yields an expression for the INTER-POPULATION variance “age average” or whatever the namers would choose to call this.

G[UL] == Sum over STRs, Sum over u, Sum over l {r[u] – r[l]}^2 / 2M N[U]N[L]
u is summed over the N[U] haplotypes, l is summed over the N[L] haplotypes, r are haplotype repeat values, M is sum over STR mutation rates.

G[UL] = G* + {G*-G*[U]} f[U]/2f[L] + {G*-G*[L]} f[L]/2f[U]

(this equation above is true for ARBITRARY division of haplotypes into U and L parts.)

G* is the self-variance age of whole population; G*[U] is self-variance age of the “U part”, and G*[L] is self variance age of the “L part”.

I suggest the division of population into U and L can be done under certain circumstances so that G[UL] estimate is greater than G* and indeed crowds into the TMRCA estimate for the whole population (which we can not do directly since we have not assumed we know the founding haplotype).

Recently I have discussed that with haplotypes of many STRs, the STRs whose self-variance deviates most on the up side of MG* (G* estimated from all the STRs in the usual way) are best candidates for having had their earliest mutation in the tree happen in the first branch segments descending from the founder. My site http://knordtvedt.home.bresnan.net shows some tables which confirm this statistically. This is especially so if the distribution of repeat values for these “early mutators” show up as quite asymmetric and suggestive of the superposition of two normal distributions centered in neighboring repeat values, supporting that early mutational split. Let’s say this is found by examination of the STR distributions from the full set of haplotypes. Picking a boundary for such an STR, U will consist of all haplotypes above the boundary, L will consist of all the haplotypes below the boundary; for example, if DYS390 is the STR showing bimodal distribution and unusually large variance, and comparable haplotype counts are seen with 22 repeats and 23 repeats, then U consists of all haplotypes with 22 or fewer repeats, and L consists of all haplotypes with 23 or more repeats.

So with minor exceptions (errors of assignment) our U and L populations are probably the descendants of one or the other of the sons of the founder, respectively. Most all the branch routes from a haplotype of U will then travel back to the founder and then forward to a haplotype of L. Our G[UL] estimate should be quite close to the TMRCA estimate, then.

The expression above then indicates the G*[U] and G*[L] should be diminished from G* as the G[UL] goes up toward TMRCA --- and that makes sense as we are hopefully almost completely segregating haplotypes into those that did and did not experience that earliest mutation of the chosen STR.

This scheme will be tried out as time goes by. I’m sure many are just so excited for another type of variance age estimate coming forth! Ken