Archiver > GENEALOGY-DNA > 2005-06 > 1119124003

From: "Ken Nordtvedt" <>
Subject: Order or Information (Anti-Entropy) in Haplotype Correlations
Date: Sat, 18 Jun 2005 13:46:43 -0600

I will get the more formal discussion of correlations in haplotype populations out of the way in this message, and then in a later message discuss the practical search techniques for sub-populations within haplotype populations.

50 years ago or so MIT electrical engineer Claude Shannon devised a measure of information content within a string of symbols which turns out to be identical to the negative of the entropy (disorder) measure invented by physicists a century ago in the quantification of thermal or statistical disorder in physical systems. Shannon's information measure was created to help communications engineers figure out optimal ways to code messages for economical transmission, and related signal detection applications for military and civilian purposes.

If we have a population of haplotypes h with each showing up at frequency f{h}, then the "information" or order or negative entropy of that collection would be

I(1) = [ Sum over hapotypes h] of f(h) Log f{h} " Log " indicates logrithm

If all the M markers used to define these haplotypes had evolved by independent mutations from a single founder (the no correlation case), then this "information" in the population can be expressed as the sum of the information in the individual marker distributions of allele length. Call m the marker label running from 1 to M, and call n(m) the allele length variable for marker m. Then the uncorrelated information in the population is given by

I(2) = [Sum over markers m] of [Sum over allele lengths n(h)] of f{n(h)} Log f{n(h)}

In this second expression f{n(h)} are the population's single marker frequency distributions of allele lengths such as Gordon Hamilton produced for haplogroup I1a from Sorenson database last summer, and others have more recently produced for R1b and other haplogroups, subclades, and varieties within haplogroups.

I(2) will necessarily show more disorder than the actual measure I(1). The difference is a measure of how strongly the allele lengths at the different markers are correlated with each other. How can independently mutating markers get correlated? If one or more additional founders in the descendant population of the original founder go off and start their own population of descendants which flourishes unusually well, then two or more of the markers in his descendant populations may develop mutational distributions of allele lengths centered on different modal values than the original population has produced. This correlates the distributions of the different markers which means simply that your knowledge of the allele length at one marker gives you some information which alters the expected distribution of allele lengths at other markers. Sometimes this correlation is very strong. For instance an actual situation: if I know that a haplotype has DYS455 = 8 then with ver!
y high probability I know its YCAIIa,b = 19,21 and its DYS388 = 14. Sometimes the correlations are weaker and harder to directly perceive. We would like a first quantitative measure of all the hidden correlations in a population before we begin the task of actually discovering what they are specifically. If I(1) were very closely equal to I(2) there would be little reason to invest lots of effort in searching for correlations or sub-populations. It would be interesting to evaluate I(1) and I(2) for R1b and do the same for haplogroup I. I suspect we could objectively measure the apparently much higher level of correlations in haplogroup I than in R1b.

So the quantity 1 - I(1) / I(2) is some measure of the correlation content in the population of haplotypes. I still need to establish norms and levels of significance for this measure before I publish this approach.


This thread: