How does FastME compare with Weighbor?


(under construction; mail comments to billb@lanl.gov)

Desper and Gascuel have introduced an innovative new tree building algorithm called FastME. It is certainly fast, dramatically faster than Neighbor Joining (NJ) in fact. Desper and Gascuel found that in their tests using large, randomly generated trees with a fairly minor amount of molecular clock variation, FastME gives more accurate trees than Weighbor. When we generated random trees that strongly violate the molecular clock, Weighbor was found to be more accurate than FastME. Here is a plot based on randomly generated 50 taxa trees:
error plot
Each point represents the average of 100 trees randomly generated using PAML (see Methods). We see that on trees that strongly violate the molecular clock, FastMe is not always better than NJ, and is not as good as BIONJ or Weighbor 1.2.

Why should FastME do so well when there is a molecular clock? One reason is that FastME has a "long branch attraction" (LBA) bias. On four taxa, FastME is equivalent to NJ, which means they both have a tendancy to join long branches. When the molecular clock holds, joining long branches will tend to give the correct tree.

If LBA is the explanation, why doesn't NJ also do well on the D-G trees? In fact NJ, a single-pass method, actually does better than Desper & Gascuel's single-passs GME and BME methods in most of their tests. It is only by virtue of branch swapping that FastME does better than NJ in their tests. These swaps do not consistently improve the tree from the point of view of long branch distraction (see below), but without the swaps the LBA bias is reduced on 8taxon test trees.

Bias towards joining long branches in a tree with two four-taxon star trees connected by a branch of length 0.2.

So does that mean that FastME is only better when the data approximately satify a molecular clock? We think so. Here is an example of a tree from the Desper-Gascuel test set on which FastME performed better than Weighbor:

tree violating clock on some branches
Note that there are regions of the tree where the molecular clock is evident visually.

The next graph shows the distributions used to vary rates so as to violate the molecular clock. We used a standard exponential distribution, which is the least arbitrary distribution in the sense that it is the maximum entropy distribution for a positive variable with finite mean. D-G use an exponential distribution that has been chopped, leaving a sharp, discontinuous peak at the value 1.
rate plots
PDF's of the Desper-Gascuel (DG) distribution in red, and our distribution (BHS) in green.

We have also tested FastME against the trees published in our 2000 MBE paper.

On the 8-taxon trees that test for "long branch distraction," we find that FastME is almost as bad as NJ, and not as good as BIONJ or Weighbor.

In the 4-taxon bias test, FastME gives exactly the same results as NJ, meaning both have a noticeable bias on distance matrices from finite-length sequences; the bias increases as the lengths of the longer branches increase.

Methods

Trees were generated using evolver from Ziheng Yang's PAML 3.0c package. Parameters were chosen as 6.7 for species birth, 2.5 for species death, and 0.3333 for sampling rate, corresponding to Yang and Rannala's estimates for primate speciation [Mol. Biol. Evol. 14:717 (1997)]. The trees were then made to violate the molecular clock by multiplying each branch length by a random number chosen from an exponential distribution. Branches were then rescaled by a constant, so that the longest branch length equaled each value to be plotted. Sequences were simulated and distances calculated under the Jukes-Cantor model, and input order of the taxa was randomized. Infinite values were replaced by the largest obtainable finite value plus .75 ln(4).