From: "Sandy Paterson"
Subject: Re: [DNA] variance quiz game
Date: Thu, 10 Feb 2011 08:49:42 -0000
References: <002701cbc7a9\$8c29ef30\$c2482dae@Ken1> <000001cbc842\$3e876b10\$bb964130\$@com><000c01cbc86c\$2641c9e0\$c2482dae@Ken1> <000301cbc875\$0ed2dca0\$2c7895e0\$@com><00c901cbc891\$ae5ec290\$c2482dae@Ken1> <000001cbc895\$b479baf0\$1d6d30d0\$@com><010c01cbc8a7\$460557c0\$c2482dae@Ken1>

I'm with you up to the point where you consider the entire N(N-1)/2 possible
pair-wise comparisons.

And yes, I can see that some of them will have a low (young) TMRCA, some
will be middle of the road, some would be close to the group TMRCA and some
would equal that of the group.

But surely the weightings are obvious? The TMRCA of the pair judged to have
the largest TMRCA must have a weighting of 1 with a weighting of 0 for all
others? Looking at it from the point of view of a distribution :

The most common pair-wise TMRCA will be 1, followed by 2, then 3, and so on,
up to T (unkown), the group TMRCA?

Or am I on a different planet?

Sandy

[[[ The windy version was meant for something other than being windy, but I
judged it ended up being "sufficiently" windy --- it met the threshold ---
to justify a brief version

Anyway, this may be of interest to what I believe you are working on doing.

Weighting different haplotype pairs in a sum of their distance measures to
form a time estimate for a clade/haplogroup population of haplotypes seems
to be possible.

One of the variance varieties of the three I mentioned in previous message
manifestly does involve haplotype pairs of different time depth. When we
consider the N(N-1)/2 pairs of haplotypes from a population of N haplotypes,

clearly some in reality have a young TMRCA, some have middle-of-the-road
TMRCAs and some have large TMRCAs equal to the actual TMRCA of the whole
tree for those N haplotypes. This justifies weighting. Here's my formal
take on how to do that. We're talking about the coalescence age.

Gcoal = Sum p = 1 to N(N-1)/2 of [ Var(p) w(p) ] / { 2M Sum p = 1 to
N(N-1)/2 of [ w(p) ] }

with label "p" meaning specic pair of haplotypes

Expectation value <Var(p)> = 2M TMRCA(p)

Consider the expectation value of the correlation matrix C(p,p') of
statistical flucuations of the Var(p) about their expected values

<[Var(p)-<Var(p)>][Var(p')-<Var(p')>]> = C(p, p')
C(p, p') is a matrix of dimension N(N-1)/2 by N(N-1)/2

The best weights for minimizing the sigma for Gcoal estimation then obey the

matrix eigenvalue equation for its smallest eigenvalue k.
If the Correlation Matrix was known, the best weights could be determined.

Sum p' of [ C(p, p') w(p') = k w(p) ]

Unfortunately I see no workable way right now to get an estimate for this
Correlation Matrix. Only the diagonal entries could be estimated.

A similar analysis of the other two types of variance could be done, but the

same problem emerges --- how are the off-diagonal elements of the
Correlation Matrix estimated? One certainly can not set them to zero. ]]].

