GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2006-12 > 1167420704


From: Vincent Vizachero <>
Subject: Re: [DNA] Tree Algorithms
Date: Fri, 29 Dec 2006 13:31:44 -0600
References: <000801c729e8$357c5280$6400a8c0@Ken1>
In-Reply-To: <000801c729e8$357c5280$6400a8c0@Ken1>


On Dec 27, 2006, at 12:52 PM, Ken Nordtvedt wrote:

> A variety of phylogenetic tree drawing software algorithms are
> being used to produce trees for populations of y haplotypes, and we
> often are referred to such trees now in list messages.
>
> Each fork in the trees represents occurence of one or more
> mutations of haplotype markers. I wonder if any of the software
> can be told to keep track of how many times it uses mutations of
> each marker? This would be an interesting test of how close to
> "most likely" tree in a probabilistic sense the complete tree is.
> I think Doug MacDonald first mentioned a property which a good tree
> should have: the various mutations should have occured in the tree
> in proportion to their respective mutation rates.

There are essentially two types of tree building algorithms:
distance based and character-state. Many popular software programs,
including some of the ones most appropriate for building phylogenies
from STR data, are distance based. In short, they accept as an input
the distance (often genetic distance, but TMRCA works too) between
each pair of taxa and never directly work with the markers themselves.

Character-state programs would have no trouble doing the thing you
ask, and many of them do.

I like using Kitsch, which constructs ultrametric trees from a
distance-matrix. Ultrametric trees assume an evolutionary clock is
at work, such that all taxa are considered to be contemporaneous and
are equidistant from the root. Often in biology these are poor
assumptions, but when evaluating non-coding DNA trees (like those
based on Y-STRs) the assumptions are not so bad. I cannot think of
any reason that a character-state program should not be able to
construct an ultrametric tree, but I have yet to find one that does.

> Since some of the algorithms start at the end of the branches and
> join close cousins together, and then work backward in time, and
> the simple minded GD (genetic distance) is often employed which
> treats all markers the same, there is the possibility that the
> algorithm eventually works itself into a corner sometimes and has
> to employ very slow mutating markers too often in completing the tree.

It sounds like you are describing a neighbor joining or UPGMA
algorithm, and these are not generally considered to be reliable at
producing accurate phylogenies (though the clusters they produce at
the tips of the tree are often reasonable). They do have the
advantage of speed, making it possible to use them on very large data-
sets. Other methods are often impossible to use on more than a few
hundred taxa without access to a supercomputer cluster. Their speed
also makes bootstrapping possible, which helps alert users to
unreliable clusters.

Anybody constructing a tree will need to choose a method that is
appropriate for the circumstance. Perhaps someone with more
knowledge of systematics than I have can comment on whether the
inherent conflict I see between having accurate clustering at the
tips and having an internally consistent evolutionary model at the
nodes is, indeed, inherent.

Vince



This thread: