GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2006-12 > 1167425175


From: "John McEwan" <>
Subject: Re: [DNA] Tree Algorithms
Date: Sat, 30 Dec 2006 09:46:15 +1300
In-Reply-To: <7678E4A4-14AA-4F80-BC85-3A800E7F385B@vizachero.com>


Vincent gave a very good summary at least to my level of knowledge about
tree building programs.

A few more comments
* most methods tend to identify the same major things, hence the
simplistic approach of trying 5 methods and see how they differ
approach. Most of the major groups we discuss on the list are also
robust across several methods. An example would be the R1bSTR19Irish aka
North west Irish aka M222+ group in R1b.
* The distance measure chosen for distance based methods has an effect,
as STRs have all sorts of strange peculiarities. Making too many
assumptions where they clearly do not hold can cause problems. Multicopy
markers are a very clear example of this. You get to the stage where you
can either include them in a simple model or discard them in a model
with more assumptions, you lose both ways.
* making the assumption of a evolutionary clock at work, has two faults,
the first is it may not be the case, but more importantly such methods
for STRs are also subject to "bottleneck effects" which have been much
discussed on the list and certainly ARE present in most of the data we
deal with. Extensive simulations show that tree building algorithms
using STR data that do not make the evolutionary clock assumption give a
better approximation of reality.
* There definitely is a trade off between method used and size of the
problem.
* be aware that the best method to build a tree may not be one which has
a linear relationship with time.
* character state based methods (which are normally used for say SNPs)
are more powerful, but can't be used on many individuals (computing
complexity). They typically also have to make much stronger assumptions
which may be okay for SNPs but probably not for STRs. However using a
state based method for STRs for within family groups (say GD<5 out of 37
markers where only single mutations are likely) may well give better
results.

Cheers

John McEwan





This thread: