GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2010-03 > 1267727729

From: "Ken Nordtvedt" <>
Subject: [DNA] Some Variance Relationships
Date: Thu, 4 Mar 2010 11:35:29 -0700

Nice thing about the form of STR variances and similar moments is there are different ways to mathematically manipulate them. Unlike GD composed of a sum of absolute values whose properties are difficult to examine arithmetically, the math maker in the sky has shined his blessing upon the arithmetical properties of expectation values for variances and higher moments.

Here are some results which end up being expressed in forms some may find more intuitive or revealing.

1. The square of G estimate sigma when doing variance with respect to assumed founding haplotype:

[SigmaG]^2 = {<Gij> + [G - <Gij>] / N } / M

M being sum of STR mutation rates, G is clade TMRCA, <Gij> is average shared initial tree branch length for the N(N-1)/2 haplotype pairs, N is number of haplotypes
<Gij> is the property of the tree for which one needs the tree structure in order to evaluate. So there is an intrinsic SigmaG which can not be reduced by enlarging haplotype sample size, plus a part which is reducible. The latter consists of the parts of the tree which are the independent terminal branch lines to the N haplotypes. The former is the part of the tree where multiple haplotype branch lines share the same tree branch segments.

But remarkably the right hand side of the above is also the expected value of [G - G*] / M (TMRCA - coalescence age) for the same sized sample N

-------------------------------------------------------------------
So SigmaG = SquareRoot { [G - G*] / M }
------------------------------------------------------------------

If one is willing to estimate both G and G* using the same N haplotype sample, the size of the confidence interval for the G estimate can be estimated.

Some people have been looking at how G estimates bounce around as they use independent collections of N haplotypes to estimate G. The analytics predicts:

<[ G(SA) - G(SB) ]^2> = 2 [G - <Gij>] / [NM]
With G(SA) being G estimate using N haplotype samples A, and G(SB) being G estimate using N haplotype samples B.
This should get pretty small as N gets large. This flucuation in G estimate from one sample collection to another is twice the reducible part of the total SigmaG confidence interval.

G = Sum i {r(i) - rf}^2 / NM

G* = Sum i Sum j {r(i) - r(j)}^2 / 2MN^2