Relative speed of the different programs

C compilers differ in efficiency of the code they generate, and some deal with some features of the language better than with others. Thus a program which is unusually fast on one computer may be unusually slow on another. Nevertheless, as a rough guide to relative execution speeds, I have tested the programs on three data sets, each of which has 10 species and 20 characters. The first is an imaginary one in which all characters are compatible - ("The Willi Hennig Memorial Data Set" as J. S. Farris once called it). The second is the binary recoded form of the fossil horses data set of Camin and Sokal (1965). The third data set has data that is completely random: 10 species and 20 characters with a 50% chance that each character state is 0 or 1 (or A or G). The data sets range from a completely compatible one in which there is no homoplasy (paralellism or convergence), through the horses data set, which requires 29 steps where the possible minimum number would be 20, to the random data set, which requires 49 steps. We can thus see how this increasing messiness of the data affects running times.

Here are the nucleotide sequence versions of the three data sets:

   10   20

   10   20

   10   20
Here are the timings of many of the version 3.5 programs on these three data sets as run after being compiled by Microsoft Quick C on an 16 MHz 80386SX computer under PCDOS 5.0. An 80387 math co-processor was present and was used by the compiled code.
                  Hennigian Data    Horses Data        Random Data
    PROTPARS         82.83              86.23             148.03
    DNAPARS           5.98               5.66              11.54
    DNAPENNY         46.03              23.51            5305.97
    DNACOMP           7.14               6.43              11.86
    DNAINVAR          0.61               0.66               0.61
    DNAML          1928.99            2069.32            2611.48
    DNAMLK         2247.12            6094.81            4993.00
    DNADIST           3.57               4.50               5.38
    RESTML         6818.34           13422.15           28418.34
    FITCH            35.92              48.61              38.17
    KITSCH           12.42              12.36              13.18
    NEIGHBOR          2.20               2.14               2.903
    CONTML           56.85              57.56              59.15
    GENDIST           1.00               1.00               1.00
    MIX              13.62              14.60              25.92
    PENNY             8.41              21.31            3851.1
    DOLLOP           26.69              26.86              46.30
    DOLPENNY         12.25              56.57           23934.22
    CLIQUE            0.77               0.71               0.77
    FACTOR            0.39               0.44               0.44
In all cases the programs were run under the default options, except as specified here. The data sets used for the discrete characters programs have 0's and 1's instead of A's and C's. For CONTML the 0's and 1's were made into 0.0's and 1.0's and considered as 20 2-allele loci. For the distance programs 10 x 10 distance matrices were computed from the three data sets. Nor does it make much sense to benchmark MOVE, DOLMOVE, or DNAMOVE, although when there are many characters and many species the response time after each alteration of the tree should be proportional to the product of the number of species and the number of characters. For DNAML and DNAMLK the frequencies of the four bases were set to be equal rather than determined empirically as is the default. For RESTML the number of enzymes was set to 1.

Several patterns will be apparent from this. The algorithms (MIX, DOLLOP, CONTML, FITCH, KITSCH, PROTPARS, DNAPARS, DNACOMP, and DNAML, DNAMLK, RESTML) that use the above-described addition strategy have run times that do not depend strongly on the messiness of the data. The only exception to this is that if a data set such as the Random data requires one extra round of global rearrangements it takes longer. The programs differ greatly in run time: the likelihood programs RESTML, DNAML and CONTML are quite a bit slower than the others. The protein sequence parsimony program, which has to do a considerable amount of bookkeeping to keep track of which amino acids can mutate to each other, is also relatively slow.

Another class of algorithms includes PENNY, DOLPENNY, DNAPENNY and CLIQUE. These are branch-and-bound methods: in principle they should have execution times that rise exponentially with the number of species and/or characters, and they might be much more sensitive to messy data. This is apparent with PENNY, DOLPENNY, and DNAPENNY, which go from being reasonably fast with clean data to very slow with messy data. DOLPENNY is paritcularly slow on messy data -- this is because this algorithm cannot make use of some of the lower-bound calculations that are possible with DNAPENNY and PENNY. CLIQUE is very fast on all data sets. Although in theory it should bog down if the number of cliques in the data is very large, that does not happen with random data, which in fact has few cliques and those small ones. Apparently the "worst-case" data sets are much rarer for CLIQUE than for the other branch-and-bound methods.

NEIGHBOR is quite fast compared to FITCH and KITSCH, and should make it possible to run much larger cases, although the results are expected to be a bit rougher than with those programs.

Speed with different numbers of species

How will the speed depend on the number of species and the number of characters? For the sequential-addition algorithms, the speed should be proportional to the cube of the number of species, and to the number of characters. Thus a case that has, instead of 10 species and 20 characters, 20 species and 50 characters would take 2 x 2 x 2 x 2.5 = 20 times as long. This implies that cases with more than 20 species will be slow, and cases with more than 40 species VERY slow. This places a premium on working on small subproblems rather than just dumping a whole large data set into the programs.

An exception to these rules will be some of the DNA programs that use an aliasing device to save execution time. In these programs execution time will not necessarily increase proportional to the number of sites, as sites that show the same pattern of nucleotides will be detected as identical and the calculations for them will be done only once, which does not lead to more execution time. This is particularly likely to happen with few species and many sites, or with data sets that have small amounts of evolutionary divergence.

For programs FITCH and KITSCH, the distance matrix is square, so that when we double the number of species we also double the number of "characters", so that running times will go up as the fourth power of the number of species rather than the third power. Thus a 20-species case with FITCH is expected to run sixteen times more slowly than a 10-species case.

For programs like PENNY and CLIQUE the run times will rise faster than the cube of the number of species (in fact, they can rise faster than any power since these algorithms are not guaranteed to work in polynomial time). In practice, PENNY will frequently bog down above 11 species, while CLIQUE easily deals with larger numbers.

For NEIGHBOR the speed should vary only as the square of the number of species, so a case twice as large will take only four times as long. This will make it an attractive alternative to FITCH and KITSCH for large data sets.

If you are unsure of how long a program will take, try it first on a few species, then work your way up until you get a feel for the speed and for what size programs you can afford to run.

Execution time is not the most important criterion for a program, particularly as computer time gets much cheaper than your time or a programmer's time. With workstations on which background jobs can be run all night, execution speed is not overwhelmingly relevant. Some of us have been conditioned by an earlier era of computing to consider execution speed paramount. But ease of use, ease of adaptation to your computer system, and ease of modification are much more important in practice, and in these respects I think these programs are adequate. Only if you are engaged in 1960's style mainframe computing is minimization of execution time paramount.

Nevertheless it would have been nice to have made the programs faster. The present speeds are a compromise between speed and effectiveness: by making them slower and trying more rearrangements in the trees, or by enumerating all possible trees, I could have made the programs more likely to find the best tree. By trying fewer rearrangements I could have speeded them up, but at the cost of finding worse trees. I could also have speeded them up by writing critical sections in assembly language, but this would have sacrificed ease of distribution to new computer systems. There are also some options included in these programs that make it harder to adopt some of the economies of bookkeeping that make other programs faster. However to some extent I have simply made the decision not to spend time trying to speed up program bookkeeping when there were new likelihood and statistical methods to be developed.

Relative speed of different machines

It is interesting to compare different machines using DNAPARS as the standard task. One can rate a machine on the DNAPARS benchmark by summing the times for all three of the data sets. Here are relative total timings over all three data sets (done with various versions of DNAPARS) for some machines, taking Microsoft Quick C running under PCDOS on a 16 MHz 80386 clone as the standard. Pascal benchmarks from version 3.4 of the program are also included -- they are compared only with each other and their times are in parentheses.

This use of two separate standards is necessary not because of different languages but because different versions of the package are being compared. Thus, the "Time" is the ratio of the Total to that for the 386SX, for the appropriate standard, so that the Time for the Macintosh Classic for DNAPARS 3.4 on Think Pascal 3 is compared to the Time for the 386/SX running DNAPARS 3.4 on Turbo Pascal 6.0, but the Time for the Macintosh Classic running version 3.5 on Think C is compared to the Time for the 386SX running version 3.5 on Quick C. The Speed is the reciprocal of the Time.

  Machine             DOS        Compiler            Total     Time     Speed
  -------             ---        --------            -----     ----     -----

  Toshiba T1100+      PCDOS    Turbo Pascal 3.01A   (269)      7.912      0.126
  Apple Mac Plus      MacOS    Lightspeed Pascal 2  (175.84)   5.172      0.193
  Toshiba T1100+      PCDOS    Turbo Pascal 5.0     (162)      4.765      0.210
  Macintosh Classic   MacOS    Think Pascal 3       (160)      4.706      0.212
  Macintosh Classic   MacOS    Think C                43.0     3.58       0.279
  IBM PS2/60          PCDOS    Turbo Pascal 5.0      (58.76)   1.728      0.579
  80286 (12 Mhz)      PCDOS    Turbo Pascal 5.0      (47.09)   1.385      0.722
  Apple Mac IIcx      MacOS    Think Pascal 3        (42)      1.235      0.810
  Apple Mac SE/30     MacOS    Think Pascal 3        (42)      1.235      0.810
  Apple Mac IIcx      MacOS    Lightspeed Pascal 2   (39.84)   1.172      0.853
  Apple Mac IIcx      MacOS    Lightspeed Pascal 2#  (39.69)   1.167      0.857
  Zenith Z386 (16MHz) PCDOS    Turbo Pascal 5.0      (38.27)   1.155      0.866
  Macintosh SE/30     MacOS    Think C                13.6     1.132      0.883
  80386SX (16 MHz)    PCDOS    Turbo Pascal 6.0      (34)      1.0        1.0
  80386SX (16 MHz)    PCDOS    Microsoft Quick C      12.01    1.0        1.0
  Sequent-S81         DYNIX    Silicon Valley Pascal (13.0)    0.382      2.615
  VAX 11/785          Unix     Berkeley Pascal       (11.9)    0.35       2.857
  80486-33            PCDOS    Turbo Pascal 6.0      (11.46)   0.337      2.967
  Sun 3/60            SunOS    Sun C                   3.93    0.327      3.056
  NeXT Cube (68030)   Mach     Gnu C                   2.608   0.217      4.605
  Sequent S-81        DYNIX    Sequent Symmetry C      2.604   0.217      4.612
  VAXstation 3500     Unix     Berkeley Pascal        (7.3)    0.215      4.658
  Sequent S-81        DYNIX    Berkeley Pascal        (5.6)    0.1647     6.07
  Unisys 7000/40      Unix     Berkeley Pascal        (5.24)   0.1541     6.49
  VAX 8600            VMS      DEC VAX Pascal         (3.96)   0.1165     8.59
  Sun SPARC IPX       SunOS    Gnu C version 2.1       1.28    0.1066     9.383
  VAX 6000-530        VMS      DEC C                   0.858   0.0714    13.998
  VAXstation 4000     VMS      DEC C                   0.809   0.0674    14.845
  IBM RS/6000 540     AIX      XLP Pascal             (2.276)  0.0669    14.94
  NeXTstation(040/25) Mach     Gnu C                   0.75    0.0624    16.013
  Sun SPARC IPX       SunOS    Sun C                   0.68    0.0566    17.662
  486DX (33 MHz)      Linux    Gnu C #                 0.63    0.0525    19.063
  Sun SPARCstation-1+ Unix     Sun Pascal             (1.7)    0.05      20.00
  DECstation 5000/200 Unix     DEC Ultrix C            0.45    0.0375    26.69
  Sun SPARC 1+        SunOS    Sun C                   0.40    0.0333    30.025
  DECstation 3100     Unix     DEC Ultrix RISC Pascal (0.77)   0.0226    44.16
  IBM 3090-300E       AIX      Metaware High C         0.27    0.0225    44.48
  DECstation 5000/125 Unix     DEC Ultrix RISC C       0.267   0.0222    44.98
  DECstation 5000/200 Unix     DEC Ultrix RISC C       0.256   0.0222    44.98
  Sun SPARC 4/50      SunOS    Sun C                   0.249   0.02073   48.23
  DEC 3000/400 AXP    Unix     DEC C                   0.224   0.01865   53.62
  DECstation 5000/240 Unix     DEC Ultrix RISC C       0.1889  0.01573   63.58
  SGI Iris R4000      Unix     SGI C                   0.184   0.1532    65.27
  IBM 3090-300E       VM       Pascal VS              (0.464)  0.0136    73.28
  DECstation 5000/200 Unix     DEC Ultrix RISC Pascal (0.39)   0.0114    87.18
The Toshiba T1100+ should be exactly as fast as an 8 MHz PC clone. For a couple of the machines I am not sure that this benchmark is representative of timings on non-numerical programs in PHYLIP. This is particularly the case for the DEC 3000/400 AXP (the DEC "Alpha") which is probably quite a bit faster than indicated here. The numerical programs benchmark below gives it a fairer test. The IBM RS/6000 is probably up to ten times faster than shown here: it may have been ill-served by its Pascal compiler.

Note that parallel machines like the Sequent are not really as slow as indicated by the data here, as these runs did nothing to take advantage of their parallelism.

For a picture of speeds for a more numerically intensive program, here are benchmarks using DNAML, with the 16 MHz 386SX with math co-processor active as the standard. Numbers are total run times (total user time in the case of Unix) over all three data sets.

  Machine             System         Compiler       Seconds   Time    Speed
  -------             ---------      --------       -------   ----    -----

  386SX 16 Mhz          PCDOS   Turbo Pascal 6    (7826)     1.0        1.0
  386SX 16 Mhz          PCDOS   Quick C            6549.79   1.0        1.0
  Compudyne 486DX/33    Linux   Gnu C              1599.9    0.2441     4.096
  SUN Sparcstation 1+   SunOS   Sun C              1402.8    0.2142     4.669
  Everex STEP 386/20    PCDOS   Turbo Pascal 5.5  (1440.8)   0.1841     5.432
  486DX/33              PCDOS   Turbo C++          1107.2    0.1690     5.916
  Compudyne 486DX/33    PCDOS   Waterloo C/386     1045.78   0.1597     6.263
  Sun SPARCstation IPX  SunOS   Gnu C               960.2    0.1466     6.821
  NeXTstation(68040/25) Mach    Gnu C               916.6    0.1399     7.146
  486DX/33              PCDOS   Waterloo C/386      861.0    0.1314     7.607
  Sun SPARCstation IPX  SunOS   Sun C               787.7    0.1203     8.315
  486DX/33              PCDOS   Gnu C               650.9    0.0994    10.063
  VAX 6000-530          VMS     DEC C               637.0    0.0973    10.282
  DECstation 5000/200   Unix    DEC Ultrix RISC C   423.3    0.0646    15.473
  IBM 3090-300E         AIX     Metaware High C     201.8    0.0308    32.46
  Convex C240/1024      Unix    C                   101.6    0.01551   64.47
  DEC 3000/400 AXP      Unix    DEC C                98.29   0.01501   66.64
You are invited to send me figures for your machine for inclusion in future tables. Use the data sets above and compute the total times for DNAPARS and for DNAML for the three data sets (setting the frequencies of the four bases to 0.25 each for the DNAML runs). Be sure to tell me the name and version of your compiler, and the version of PHYLIP you tested.

Published Benchmarks

Some of you may have seen the "benchmark" published by Luckow and Pimentel (1985). PHYLIP's WAGNER (an immediate ancestor of MIX) did not do well in it, either in terms of the quality of result or execution speed. I do not believe that this was a fair benchmark. WAGNER was run only with one order of input species, not ten as recommended here. Had it been, perhaps the shortest tree would have been found more often. No credit was given to PHYLIP in that article for its free distribution, availability on microcomputers, availability in source code form, or portability to new computers. Pimentel's laboratory commissioned the development of a competing package, PHYSYS, which is a commercial product, and that involvement was not stated in the article.

The benchmarks by Fink (1986) are fairer, although there are some impressions given by that article which do not apply to the present version. In particular, I have since added to many of the programs the ability to save multiple equally-parsimonious trees, and have changed the outputs so that reconstruction of states in the hypothetical ancestral nodes is much easier, thus answering Fink's major criticisms. I have since eliminated the Metropolis annealing method algorithms which he criticized. I disagree with Fink's view OF PHYLIP that one should "be wary of published results from an analysis using it", as I do not think that a tree slightly longer than the most parsimonious one should be rejected out of hand. Nor do I agree that "it is really too slow to use as a teaching tool", as in teaching one uses small data sets and speed is not of the essence. Rather, simplicity of user interface is paramount, and there PHYLIP does very well (so is ability to run on a variety of computers, in which respect PHYLIP is also superior). In fact, it is widely used as a teaching tool.

Nevertheless MIX is undoubtably not as fast or as sophisticated as PAUP or Hennig86. The present version of PHYLIP is closer to its competitors in quality of result than was the version Fink reviewed.

Platnick's (1987) benchmarks concentrated, as did the other benchmarkers (all of them members of the same school of systematists) on parsimony as the only phylogeny criterion worthy of attention. He concluded that PHYLIP could be used effectively, especially if up to ten different input orders of species were used. Again, as with the other benchmarks, no credit was given for diversity of methods, portability, price, or availability of source code.

Platnick's second benchmark paper (1989) concentrates on Hennig86 and Paup, and concludes that PHYLIP has not kept up with those programs in its features. Again, the review is entirely concerned with parsimony, and only the barest mention is made of ... (you can complete this sentence).

Sanderson's (1990) benchmark paper breaks with the method of the others by specifying 36 features of the packages rated and giving separate ratings in each. Like the other benchmark papers it concentrates almost exclusively on parsimony as applied to morphological characters, but does at least give some credit where credit is due.

My own, obviously biased, feeling is that there is a discrepancy between the benchmarkers' projections of how satisfied users of PHYLIP will be, and how satisfied they actually are. And that this discrepancy is in PHYLIP's favor.


Here are some comments about PHYLIP. Explanatory material in square brackets is my own: From the pages of Cladistics:

"Under no circumstances can we recommend PHYLIP/WAG [their name for the Wagner parsimony option of MIX]."
Luckow, M. and R. A. Pimentel (1985)
"PHYLIP has not proven very effective in implementing parsimony (Luckow and Pimentel, 1985)."
J. Carpenter (1987a)
"... PHYLIP. This is the computer program where every newsletter concerning it is mostly bug-catching, some of which have been put there by previous corrections. As Platnick (1987) documents, through dint of much labor useful results may be attained with this program, but I would suggest an easier way: FORMAT b:"
J. Carpenter (1987b)
"PHYLIP is bug-infested and both less effective and orders of magnitude slower than other programs ...."
"T. N. Nayenizgani" [J. S. Farris] (1990)
"Hennig86 [by J. S. Farris] provides such substantial improvements over previously available programs (for both mainframes and microcomputers) that it should now become the tool of choice for practising systematists."
N. Platnick (1989)
and in the pages of other journals:
"The availability, within PHYLIP of distance, compatibility, maximum likelihood, and generalized 'invariants' algorithms (Cavender and Felsenstein, 1987) sets it apart from other packages .... One of the strengths of PHYLIP is its documentation ...."
Michael J. Sanderson (1990)

(Sanderson also criticizes PHYLIP for slowness and inflexibility of its parsimony algorithms, and compliments other packages on their strengths).

"This package of programs has gradually become a basic necessity to anyone working seriously on various aspects of phylogenetic inference .... The package includes more programs than any other known phylogeny package. But it is not just a collection of cladistic and related programs. The package has great value added to the whole, and for this it is unique and of extreme importance .... its various strengths are in the great array of methods provided ...."
Bernard R. Baum (1989)
(see also above under Benchmarks for W. Fink's critical remarks (1986) on version 2.8 of PHYLIP).

Back to the main PHYLIP page
Back to the SEQNET home page
Maintained 15 Jul 1996 -- by Martin Hilbers(