Consider the sequences: seq1 CCCCC seq2 CCCCD seq3 CCCCE seq4 CCCCF seq5 DDDDG seq6 DDDDH The last position seems to be highly variable, and one might believe that C through H are equally likely there. In the first 4 positions, C might be more likely than D, but it is also conceivable that there is really only one ancestral instance each of C and D in each position 1-4, and they are really about equally probable, with C perhaps having an edge. Any sequence weighting scheme that makes C less than twice as probable as D in the first 4 positions will make C through F less probable than G and H in the last positions. My program finds that C gets 1.75 counts in the first 4 positions, and D gets 1.1. This is a ratio of 1.6:1, much less than 2:1. On the other hand, C through H all get one count in the last position, so they are all equally probable. Interestingly, in an earlier version of the documentation, I used a similar example: seq1 CCCCCCCC seq2 CCCCDDDD seq3 CCCCEEEE seq4 CCCCFFFF seq5 DDDDGGGG seq6 DDDDHHHH It turns out that in the new version of the program, this example gives C:D a ration greater than 2:1 in the first 4 positions (2.57:1.18). This is not what one initially expects. The reason for this strange answer is that these sequences have the unlikely feature that they exhibit many more changes in the last four positions (20) than in the first four (4). Thus the most likely interpretation is that the first four sites undergo substitutions at a slower rate than the other four, and in the model the only way to slow the rate is to decrease the variability of the site, which is done here by making C more probable. The file "virtcts" shows the number of counts ("virtual counts") caused by this aspect of the likelihood function. In this case, the C's in the first 4 postions have .48 virtual counts. Next consider: seq1 DDCCCCCCCC seq2 CCDDCCCCCC seq3 CCCCDDCCCC seq4 CCCCCCDDCC seq5 CCCCCCCCDD seq6 GGEEEEEEEE seq7 EEGGEEEEEE seq8 EEEEGGEEEE seq9 EEEEEEGGEE seq10 EEEEEEEEGG Here we clearly have two separate families, and the 5 sequences within each family are very closely related. By symmetry, any sequence weighting method must weight each sequence the same, and will conclude that C is 4 times more common than D. However, since we know that the first 5 sequences are closely related, it is likely that the 4 C's are caused by inheritence, and really not much more likely than D. My program finds 1.38 counts (depending on initialization) for C, and 1 for D in each position. Similarly for E and G. If only half the data is given, seq1 DDCCCCCCCC seq2 CCDDCCCCCC seq3 CCCCDDCCCC seq4 CCCCCCDDCC seq5 CCCCCCCCDD it is no longer clear that these sequences are closely related. The abundance of C may reflect a serious selective pressure. In this case the program finds 3.71 C's and 1 D in each position. These two examples show the importance of including distantly related sequences in the data when they are available.