Archiver > GENEALOGY-DNA > 2005-10 > 1130558922

From: "John McEwan" <>
Subject: FW: [DNA] Re: Ken's Haplogroup I website: part 2
Date: Sat, 29 Oct 2005 17:08:42 +1300

Part 2
Note here that the previous method placed EZVQF in I1aSTR6, but the
cluster method places him in I1aSTR5! Why? Well there are a number of
answers to this. The first answer it you need to consider the next layer
of complexity: the variability observed at each marker and the
variability across markers observed within each "cluster". The second is
that the cluster approach places an individual where its closest
"neighbours" are located and this may not be where the closest modal
match for a cluster is.

The reasons above now get rather slippery to explain, but as a
generalisation where you are placed depends on two things the "distance
measure" used and the method of joining members together.

There are as more possible distance measures that can be used as there
are fingers on my hands. All use certain properties of the variability
in the data and all have underlying assumptions about the mutation
process that may, or may not, hold true on average and in a specific

I tend to use a distance measure called Da

Explicitly the equation is

Da= 1-1/r sum(j to r). Sum (i to mj). squareroot(Xij.Yij).
r= number of loci, Xij and Yij are the frequencies of the ith allele at
the jth locus of the populations X and Y respectively (in this case X
population is the haplotgroup and Y is the individual compared, but for
close matches it is the individuals). Mj is the number of
alleles at the jth locus.

A quick examination of this measure shows:
1) each mutation is treated equally (infinite alleles) a step of 2
change is the same as a step of 1. In practice, however, some of this
information is captured because if there are random walk single step
changes you get an orderly decline in frequency from the ancestral
variant and the frequencies ARE used. The Y STR markers commonly used,
also do not always behave as if they mutate over time via a random walk
of single step changes, periodic larger step changes are observed
(typically to smaller values) and some compound markers also alter due
to recombination loss of hetrozygosity (i.e. a 15, 17 for example may
converge to 15,15). Other compound markers have scoring conventions that
cause futher problems eg 464. Using a method that assumes change by
single steps means it will have difficulty where RecLOH or larger
changes have occurred. There are a number of likely cases where this has
happened referred to on the list. The Da distance handles these events
* each marker is treated equally, and they are NOT weighted by prior
(outside) knowledge of their mutation rate (but they ARE weighed by the
properties of the variability observed within the data set which
IMPLICITLY means these weights are correlated to the mutation rate). A
mutation in a highly conserved marker has more impact overall, than the
same change in a highly variable marker.

The method that Whit Athey uses in his haplotype predictor
also explicitly makes use of allele variability information at each
marker for each group. The occurrence of a rare allele in a haplogroup
marker makes it much more unlikely it came from that group. However
currently he restricts it to SNP defined haplotypes.

Ken on the other hand tends to talk about slow moving markers and places
more weight on their changes (as does Da above) and he implicitly uses
the frequency distribution of the marker as well to determine WHEN the
change occurred, low frequency means recently. In some cases this verges
into the assumption that it has been a UEP (i.e. only happened once) so
that it can be used unambiguously.

The methodolgy of clustering also has an important effect, but is less
relevant in this discussion.

So.... what am I saying?

My conclusion is that when the list is asked about a haplotype most of
us continue to use our quick and dirty, but simple to understand, marker
or genetic distance based methods to place people in R1b Scots, Irish,
Frisian, I1c Isles or whatever group, but we need a better tool which
also at least also uses frequency data within clusters. One option is
for Whit to update his calculator to include these STR defined clusters
(and this to some extent means that there general agreement that those
chosen really exist in the first place) or a list member needs to derive
a better alternative. Dean McGee already has a site as well where such a
tool could also be added. Whoever does it will have the most highly
visited site on the web :-)


John McEwan

This thread: