**EFitch** estimates phylogenies from distance matrix data using the Fitch-Margoliash method and some related least squares methods.

**EFitch** is a modified version of the PHYLIP version 3.572c's FITCH,
by Joseph Felsenstein,
with command line control added.

**EFitch** estimates phylogenies from distance matrix data under the "additive tree model" according to which the distances are expected to equal the sums of branch lengths between the species.
It uses the Fitch-Margoliash criterion and some related least squares criteria and does not assume an evolutionary clock.

The input file for **EFitch** is the output file from EDnaDist
and EProtDist.

This program was originally written by Joe Felsenstein (E-mail:joe@evolution.genetics.washington.edu. Post: Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, U.S.A.)

This version was modified for inclusion in EGCG by Maria Jesus Martin (E-mail: martin@ebi.ac.uk; Post: EMBL Outstation Hinxton, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SQ or E-mail: martin@tdi.es; Post: Tecnologia para Diagnostico e Investigacion, Condes de Torreanaz 5, 28028 Madrid).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).

Here is a session with **EFitch**

% efitch -options EFITCH of what distance matrix file ? fmdv.ednadist What should I call the output file (* fmdv.efitch *) ? Use user trees in input file (* No *) ? What Power (* 2.0 *) ? Allow negative branch lengths (* No *) ? OutGroup root (* No *) ? Data matrix form : S)quare. L)ower-triangular. U)pper-triangular. Choose the matrix form (* S *) ? Use subreplicates (* No *) ? Do global rearrangements (* No *) ? Randomize input order of sequences (* No *) ? Analyze multiple data sets (* No *) ? Print out the data at start of run (* No *) ? Adding species: APHAVP1C APHAVP1D APHAVP1A APHAVP1B APHAVP1E APHAVP1F Output written to fmdv.efitch Tree also written onto fmdv.efitchtrees %

The input file for **EFitch** is the output file from EDnaDist
and EProtDist.
The first line of the input file contains the number of species.
There follows species data,
starting with a species name.
The species name is ten characters long,
and must be padded out with blanks if shorter.
For each species there then follows a set of distances to all the other species (options allow the distance matrix to be upper or lower triangular or square).

Here is the input file for the example session.

6 APHAVP1C 0.0000 0.0083 0.0392 0.0348 0.0324 0.0353 APHAVP1D 0.0083 0.0000 0.0412 0.0368 0.0344 0.0406 APHAVP1A 0.0392 0.0412 0.0000 0.0034 0.0106 0.0178 APHAVP1B 0.0348 0.0368 0.0034 0.0000 0.0071 0.0189 APHAVP1E 0.0324 0.0344 0.0106 0.0071 0.0000 0.0232 APHAVP1F 0.0353 0.0406 0.0178 0.0189 0.0232 0.0000

The output from **EFitch** are two files,
one containing an ASCII representation of an unrooted tree and the lengths of the interior segments,
and another containing the tree in nested-pairs parenthesis notation.

Here is the output file from the example session.

6 Populations __ __ 2 \ \ (Obs - Exp) Sum of squares = /_ /_ ------------ 2 i j Obs Negative branch lengths not allowed +APHAVP1D ! ! +APHAVP1F --1--4 ! ! +APHAVP1E ! +--3 ! ! +APHAVP1B ! +--2 ! +APHAVP1A ! +APHAVP1C remember: this is an unrooted tree! Sum of squares = 0.17109 Average percent standard deviation = 7.81696 examined 28 trees Between And Length ------- --- ------ 1 APHAVP1D 0.00554 1 4 0.02305 4 APHAVP1F 0.01058 4 3 0.00471 3 APHAVP1E 0.00477 3 2 0.00223 2 APHAVP1B 0.00035 2 APHAVP1A 0.00305 1 APHAVP1C 0.00276

Here is the output tree file from the example session.

(APHAVP1D:0.00554,(APHAVP1F:0.01058,(APHAVP1E:0.00477,(APHAVP1B:0.00035, APHAVP1A:0.00305):0.00223):0.00471):0.02305,APHAVP1C:0.00276);

PileUp creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. LineUp creates and edits multiple sequence alignments. Pretty displays multiple sequence alignments. Distances creates a table of the pairwise distances within a group of aligned sequences. GrowTree creates a phylogenetic tree from a distance matrix created by Distances using either the UPGMA or neighbor-joining method. You can create a text or graphics output file.

EDnaDist computes a distance matrix from nucleic acid sequences, under four different models of nucleotide substitution (Jukes and Cantor (1969), Kimura (1980), Jin and Nei(1990) and a model of maximum likelihood (Felsenstein, 1981)). EProtDist computes a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on the genetic code plus a constraint on changing to a different category of amino acid. ENeighbor estimates phylogenies from distance matrix data using the Neighbor-Joining method or the UPGMA method of clustering. EKitsch estimates phylogenies from distance matrix data under the "ultrametric" model which is the same as the additive tree model except that an evolutionary clock is assumed. EDnaPars estimates phylogenies from nucleic acid sequences using the parsimony method. EProtPars estimates phylogenies from amino acid sequences using the parsimony method. EDnaML estimates phylogenies from nucleotide sequences by maximum likelihood. EDnaMLK does the same as EDnaML but assumes a molecular clock. ESeqBoot produces multiple data sets from a molecular sequence data set by bootstrap, jackknife, or permutation resampling. EConsense computes consensus trees by the majority-rule consensus tree. It can be used as the final step in doing bootstrap analyses.

**EFitch** carries out the method of Fitch and Margoliash (1967)
for fitting trees to distance matrices.
They also are able to carry out the least squares method of Cavalli-Sforza and Edwards (1967),
plus a variety of other methods of the same family.

The objective of these methods is to find that tree which minimizes

Sum of squares = Si) Sj) (n(ij) (D(ij) - d(ij))(2) / D(ij)P))

where D is the observed distance between species i and j and d is the expected distance, computed as the sum of the lengths (amounts of evolution) of the segments of the tree from species i& to species j. The quantity n is the number of times each distance has been replicated. In simple cases this is taken to be one, but the user can, as an option, specify the degree of replication for each distance. The distance is then assumed to be a mean of those replicates. The power P is what distinguished the various methods. For the Fitch- Margoliash method, which is the default method with this program, P is 2.0. For the Cavalli-Sforza and Edwards least squares method it should be set to 0 (so that the denominator is always 1). An intermediate method is also available in which P is 1.0, and any other value of P, such as 4.0 or -2.3, can also be used. This generates a whole family of methods.

These methods assume that the variance of the measurement error is proportional to the P-th power of the expectation (hence the standard deviation will be proportional to the P/2-th power of the expectation). If you have reason to think that the measurement error of a distance is the same for small distances as it is for large, then you should set P=0 and use the least squares method, but if you have reason to think that the relative (percentage) error is more nearly constant than the absolute error, you should use P=2, the Fitch- Margoliash method. In between, P=1 would be appropriate if the sizes of the errors were proportional to the square roots of the expected distance.

For more information, please see the Distance Matrix Programs documentation file "distance.doc" and the "fitch.doc" file from PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein, available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip.

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

Minimum Syntax: % efitch [-INfile=]file.ednadist -defaultPrompted Parameters: [-OUTfile=]file.efitch output file. -POWer=2.0 P value in the least squares formula. The default value is 2 (the Fitch-Margoliash method). (See documentation) -MATrix=S form of the data matrix, where: S)quare. L)ower-triangular. U)pper-triangular. Optional Parameters: -OPTions makes the program ask for further specific options. -USERTree one or more user-defined unrooted trees is to be provided for evaluation in the input file. -LENGth the tree is evaluated with their lengths fixed. (only for "user tree" option). -NEGallowed allows negative segment lengths in the tree. -OUTGroup=1 species used to root the tree. -SUBREPlicates subreplication option. -GLOBal Global rearrangements. (not available with the USERTree parameter) -RANDom=3 uses a random number generator to choose the input order of species. The seed should be an integer between 1 and 32767. -JUMnumber=1 number of times to restart the process (with different orders of species). -SETS=2 multiple data sets. -SHOWData prints data in the output file.

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

makes the program ask for all specific options.

tells the program that one or more user-defined trees are to be provided for evaluation in the input file. The trees are regarded as unrooted, and are specified with a trifurcation (three-way split) at their base, e. g.: ((A,B),C,(D,E));

uses branch lengths option to avoid having those branches iterated, so that the tree is evaluated with their lengths fixed. Only available with the "user tree" option.

indicates that negative segment lengths are to be allowed in the tree (default is to require that all branch lengths be non-negative).

specifies which species is to be used to root the tree by having it become the outgroup (the species being taken in the numerical order that they occur in the input file).

tells the program that after each distance will be provided an integer indicating that the distance is a mean of that many replicates. Each distance must be followed by an integer indicating the number of replicates, so that a line of data looks like this:

Delta 3.00 5 3.21 3 1.84 9the 5, 3, and 9 being the number of times the measurement was replicated. When the number of replicates is zero, a distance value must still be provided, although its vale will not afect the result.

global rearrangements. It is an additional stage to the search for the best tree. Each possible subtree is removed from the tree and added back in all possible places. This process continues until all subtrees can be removed and added again without any improvement in the tree. The purpose of this extra rearrangement is to make it less likely that one or more a species gets "stuck" in a suboptimal region of the space of all possible trees. The use of global optimization results in approximately a tripling (3x) of the run-time. It is not available with the "user tree" option.

use a random number generator to choose the input order of species. Must be odd.

causes the program to ask you how many times you want to restart the process. If you answer 10, the program will try ten different orders of species in constructing the trees, and the results printed out will reflect this entire search process (that is, the best trees found among all 10 runs will be printed out, not the best trees from each individual run). Of course this is slow, taking 10 times longer than a single run. The program will print out the best tree found overall.

tells the program how many data sets there are from the input file. It allows us (when the output tree file is analyzed in EConsense) to do a bootstrap (or delete-half-jackknife) analysis with the distance matrix programs.

prints the sequences data on the output file before the distance matrix.

Cavalli-Sforza, L. L., and A. W. F. Edwards. 1967. Phylogenetic analysis: models and estimation procedures. Evolution 32: 550-570 (also Amer. J. Human Genetics 19: 233-257).

Farris, J. S. 1981. Distance data in phylogenetic analysis.pp. 3-23 in Advances in Cladistics: Proceedings of the first meeting of the Willi Hennig Society, ed. V. A. Funk and D. R. Brooks. New York Botanical Garden, Bronx, New York.

Farris, J. S. 1985. Distance data revisited. Cladistics 1: 67-85.

Farris, J. S. 1986. Distances and statistics. Cladistics 2: 144-157.

Felsenstein, J. 1984. Distance methods for inferring phylogenies: a justification. Evolution 38: 16-24.

Felsenstein, J. 1986. Distance methods: a reply to Farris. Cladistics 2: 130-144.

Felsenstein, J. 1988. Phylogenies from molecular sequences: inference and reliability. Annual Review of Genetics 22: 521-565.

Fitch, W. M., and E. Margoliash. 1967. Construction of phylogenetic trees. Science 155: 279-284.

For further information please refer to the "distance.doc" and "fitch.doc" files from the PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein (available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip).

Printed: November 15, 1996 11:46 (1162)