EKitsch estimates phylogenies from distance matrix data using the Fitch-Margoliash method and some related least squares methods, with the assumption that there is a evolutionary clock.
EKitsch is a modified version of the PHYLIP version 3.572c's KITSCH, by Joseph Felsenstein, with command line control added.
EKitsch estimates phylogenies from distance matrix data under the "ultrametric" model which is the same as the additive tree model in EFitch except that an evolutionary clock is assumed. It uses the Fitch-Margoliash criterion and some related least squares criteria.
The input file for EKitsch is the output file from EDnaDist and EProtDist.
This program was originally written by Joe Felsenstein (E-mail:firstname.lastname@example.org. Post: Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, U.S.A.)
This version was modified for inclusion in EGCG by Maria Jesus Martin (E-mail: email@example.com; Post: EMBL Outstation Hinxton, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SQ or E-mail: firstname.lastname@example.org; Post: Tecnologia para Diagnostico e Investigacion, Condes de Torreanaz 5, 28028 Madrid).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (email@example.com).
Here is a session with EKitsch
% ekitsch -options EKITSCH of what distance matrix file ? fmdv.dnadist What should I call the output file (* fmdv.ekitsch *) ? Use user trees in input file (* No *) ? What power (* 2.0 *) ? Allow negative branch lengths (* No *) ? Data matrix form : S)quare. L)ower-triangular. U)pper-triangular. Choose matrix form (* S *) ? Use subreplicates (* No *) ? Randomize input order of sequences (* No *) ? Analyze multiple data sets (* No *) ? Print out the data at start of run (* No *) ? Adding species: APHAVP1C APHAVP1D APHAVP1A APHAVP1B APHAVP1E APHAVP1F Doing global rearrangements !-----------! ........... Output written into fmdv.ekitsch Tree also written into fmdv.ekitsch_trees %
The input file for EKitsch is the output file from EDnaDist and EProtDist. The first line of the input file contains the number of species. There follows species data, starting with a species name. The species name is ten characters long, and must be padded out with blanks if shorter. For each species there then follows a set of distances to all the other species (options allow the distance matrix to be upper or lower triangular or square).
Here is the input file for the example session.
6 APHAVP1C 0.0000 0.0083 0.0392 0.0348 0.0324 0.0353 APHAVP1D 0.0083 0.0000 0.0412 0.0368 0.0344 0.0406 APHAVP1A 0.0392 0.0412 0.0000 0.0034 0.0106 0.0178 APHAVP1B 0.0348 0.0368 0.0034 0.0000 0.0071 0.0189 APHAVP1E 0.0324 0.0344 0.0106 0.0071 0.0000 0.0232 APHAVP1F 0.0353 0.0406 0.0178 0.0189 0.0232 0.0000
The output from EKitsch are two files, one containing an ASCII representation of an unrooted tree and the lengths of the interior segments, and another containing the tree in nested-pairs parenthesis notation.
Here is the output file from the example session.
6 Populations __ __ 2 \ \ (Obs - Exp) Sum of squares = /_ /_ ------------ 2 i j Obs negative branch lengths not allowed +APHAVP1B +--3 +--4 +APHAVP1A ! ! +--5 +APHAVP1E ! ! --2 +APHAVP1F ! ! +APHAVP1D +--1 +APHAVP1C Sum of squares = 0.324 Average percent standard deviation = 10.75753 examined 121 trees From To Length Time ---- -- ------ ---- 3 APHAVP1B 0.00170 0.01818 4 3 0.00239 0.01648 3 APHAVP1A 0.00170 0.01818 5 4 0.00565 0.01409 4 APHAVP1E 0.00409 0.01818 2 5 0.00844 0.00844 5 APHAVP1F 0.00974 0.01818 1 APHAVP1D 0.00415 0.01818 2 1 0.01403 0.01403 1 APHAVP1C 0.00415 0.01818Here is the output tree file from the example session.
PileUp creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. LineUp creates and edits multiple sequence alignments. Pretty displays multiple sequence alignments. Distances creates a table of the pairwise distances within a group of aligned sequences. GrowTree creates a phylogenetic tree from a distance matrix created by Distances using either the UPGMA or neighbor-joining method. You can create a text or graphics output file.
Phylip2Tree displays trees computed with one of the PHYLIP-programs or with EProtPars, EDnaPars, EDnaML, EDnaMLK, ENeighbor, EFitch and EKitsch in GCG style. ESeqBoot produces multiple data sets from a molecular sequence data set by bootstrap, jackknife, or permutation resampling. EDnaDist computes a distance matrix from nucleic acid sequences, under four different models of nucleotide substitution (Jukes and Cantor (1969), Kimura (1980), Jin and Nei(1990) and a model of maximum likelihood (Felsenstein, 1981)). EProtDist computes a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on the genetic code plus a constraint on changing to a different category of amino acid. ENeighbor estimates phylogenies from distance matrix data using the Neighbor-Joining method or the UPGMA method of clustering. EFitch estimates phylogenies from distance matrix data under the "additive tree model" according to which the distances are expected to equal the sums of branch lengths between the species. EDnaPars estimates phylogenies from nucleic acid sequences using the parsimony method. EProtPars estimates phylogenies from amino acid sequences using the parsimony method. EDnaML estimates phylogenies from nucleotide sequences by maximum likelihood. EDnaMLK does the same as EDnaML but assumes a molecular clock. EConsense computes consensus trees by the majority-rule consensus tree. It can be used as the final step in doing bootstrap analyses.
EKitsch carries out the Fitch-Margoliash (1967) and Least Squares (Cavalli-Sforza and Edwards, 1967) methods, plus a variety of others of the same family, with the assumption that all tip species are contemporaneous, and that there is an evolutionary clock (in effect, a molecular clock). This means that branches of the tree cannot be of arbitrary length, but are constrained so that the total length from the root of the tree to any species is the same.
The objective of the least squares methods is to find that tree which minimizes
Sum of squares = Si) Sj) (n(ij) (D(ij) - d(ij))(2) / D(ij)P))
where D is the observed distance between species i and j and d is the expected distance, computed as the sum of the lengths (amounts of evolution) of the segments of the tree from species i& to species j. The quantity n is the number of times each distance has been replicated. In simple cases this is taken to be one, but the user can, as an option, specify the degree of replication for each distance. The distance is then assumed to be a mean of those replicates. The power P is what distinguished the various methods. For the Fitch- Margoliash method, which is the default method with this program, P is 2.0. For the Cavalli-Sforza and Edwards least squares method it should be set to 0 (so that the denominator is always 1). An intermediate method is also available in which P is 1.0, and any other value of P, such as 4.0 or -2.3, can also be used. This generates a whole family of methods.
The method may be considered as providing an estimate of the phylogeny. Alternatively, it can be considered as a phenetic clustering of the tip species. This method minimizes an objective function, the sum of squares, not only setting the levels of the clusters so as to do so, but rearranging the hierarchy of clusters to try to find alternative clusterings that give a lower overall sum of squares. If P = 0.0, we are minimizing a simple sum of squares of the differences between the observed distance matrix and the expected one, and the method is very close in spirit to Unweighted Pair Group Arithmetic Average Clustering (UPGMA), also called Average-Linkage Clustering. If the topology of the tree is fixed and there turn out to be no branches of negative length, its result should be the same as UPGMA in that case. But since it tries alternative topologies and it combines nodes that otherwise could result in a reversal of levels, it is possible for it to give a different, and better, result than simple sequential clustering. UPGMA itself is available as an option in program ENeighbor.
For more information, please see the Distance Matrix Programs documentation file "distance.doc" and the "kitsch.doc" file from PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein, available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip.
An important use of this method will be to do a formal statistical test of the evolutionary clock hypothesis. This can be done by comparing the sums of squares achieved by EFitch and by EKitsch but some caveats are necessary. First, the assumption is that the observed distances are truly independent, that no original data item contributes to more than one of them (not counting the two reciprocal distances from i to j and from j to i). This will not hold if the distances are obtained from gene frequencies, from morphological characteres, or from molecular sequences. It may be invalid even for immunological distances and levels of DNA hybridization, provided that the use of a common standard for all members of a row or column allows an error in the measurement of the standard to affect all these distances simultaneously. It will also be invalid if the numbers have been collected in experimental groups, each measured by taking differences from a common standard which itself is measured with error. Only if the numbers in different cells are measured from independent standards can we depend on the statistical model. The details of the test and the assumptions are discussed in the review paper by Felsenstein (1984). For further and sometimes irrelevant controversy on these matters see the papers by Farris (1981, 1985, 1986) and by Felsenstein (1986, 1988). A second caveat is that the distances must be expected to rise linearly with time, not according to any other curve. Thus it may be necessary to transform the distances to achieve an expected linearity. If the distances have an upper limit beyond which they could not go, this is a signal that linearity may not hold. It is also VERY important to choose the power P at a value that results in the standard deviation of the variation of the observed from the expected distances being the P/2-th power of the expected distance.
To carry out the test, fit the same data with both EFtich and EKitsch and record the two sums of squares. If the topology has turned out the same, we have N = n(n-1)/2 distances which have been fit with 2n-3 parameters in EFitch, and with n-1 parameters in EKitsch Then the difference between S(K) and S(F) has d1 = n-2 degrees of freedom. It is statistically independent of the value of S(F), which has d2 = N-(2n-3) degrees of freedom. The ratio of mean squares ([S(K)-S(F)]/d1)/(S(F)/d2) should, under the evolutionary clock, have an F distribution with n-2 and N-(2n-3) degrees of freedom respectively. The test desired is that the F ratio is in the upper tail (say the upper 5%) of its distribution. If the S (subreplication) option is in effect, the above degrees of freedom must be modified by noting that N is not n(n-1)/2 but is the sum of the numbers of replicates of all cells in the distance matrix read in, which may be either square or triangular. A further explanation of the statistical test of the clock is given in my more recent paper (Felsenstein, 1986).
The "user tree" option requires a bifurcating tree in the input file, unlike EFitch, which requires an unrooted tree with a trifurcation at its base. Thus the tree shown below would be written:
((D,E),(C,(A,B)));If a tree with a trifurcation at the base is by mistake fed into EKitsch then some of its species will be ignored and too small a tree read in. This should result in an error message and the program should stop. It is important to understand the difference between the User Tree formats for Ekitsch and EFitch.
The output is a rooted tree, together with the sum of squares, the number of tree topologies searched, and, if the power P is at its default value of 2.0, the Average Percent Standard Deviation is also supplied. The lengths of the branches of the tree are given in a table, that also shows for each branch the time at the upper end of the branch. "Time" here really means cumulative branch length from the root, going upwards (on the printed diagram, rightwards). For each branch, the "time" given is for the node at the right (upper) end of the branch. It is important to realize that the branch lengths are not exactly proportional to the lengths drawn on the printed tree diagram! In particular, short branches are exaggerated in the length on that diagram so that they are more visible.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimum Syntax: % ekitsch [-INfile=]file.ednadist -default
Prompted Parameters: [-OUTfile=]file.ekitsch output file. -POWer=2.0 P value in the least squares formula. The default value is 2 (the Fitch-Margoliash method). (See documentation) -MATrix=S form of the data matrix, where: S)quare. L)ower-triangular. U)pper-triangular. Optional Parameters: -OPTions makes the program ask for further specific options. -USERTree one or more user-defined rooted trees is to be provided for evaluation in the input file. -NEGallowed allows negative segment lengths in the tree. -SUBREPlicates subreplication option. -RANDom=1 use a random number generator to choose the input order of species. The seed should be an integer between 1 and 32767. -JUMnumber=10 number of times to restart the process (with different orders of species). -SETS=2 multiple data sets. -SHOWData print data in the output file.
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
makes the program ask for all specific options.
tells the program that one or more user-defined trees are to be provided for evaluation in the input file. The trees are regarded as rooted and have a bifurcation at the base:
indicates that negative segment lengths are to be allowed in the tree (default is to require that all branch lengths be nonnegative).
tells the program that after each distance will be provided an integer indicating that the distance is a mean of that many replicates. Each distance must be followed by an integer indicating the number of replicates, so that a line of data looks like this:
Delta 3.00 5 3.21 3 1.84 9the 5, 3, and 9 being the number of times the measurement was replicated. When the number of replicates is zero, a distance value must still be provided, although its vale will not afect the result.
use a random number generator to choose the input order of species. Must be odd.
causes the program to ask you how many times you want to restart the process. If you answer 10, the program will try ten different orders of species in constructing the trees, and the results printed out will reflect this entire search process (that is, the best trees found among all 10 runs will be printed out, not the best trees from each individual run). Of course this is slow, taking 10 times longer than a single run. The program will print out the best tree found overall.
tells the program how many data sets there are from the input file. It allows us (when the output tree file is analyzed in EConsense) to do a bootstrap (or delete-half-jackknife) analysis with the distance matrix programs.
prints the sequences data on the output file before the distance matrix.
Cavalli-Sforza, L. L., and A. W. F. Edwards. 1967. Phylogenetic analysis: models and estimation procedures. Evolution 32: 550-570 (also Amer. J. Human Genetics 19: 233-257).
Farris, J. S. 1981. Distance data in phylogenetic analysis.pp. 3-23 in Advances in Cladistics: Proceedings of the first meeting of the Willi Hennig Society, ed. V. A. Funk and D. R. Brooks. New York Botanical Garden, Bronx, New York.
Farris, J. S. 1985. Distance data revisited. Cladistics 1: 67-85.
Farris, J. S. 1986. Distances and statistics. Cladistics 2: 144-157.
Felsenstein, J. 1984. Distance methods for inferring phylogenies: a justification. Evolution 38: 16-24.
Felsenstein, J. 1986. Distance methods: a reply to Farris. Cladistics 2: 130-144.
Felsenstein, J. 1988. Phylogenies from molecular sequences: inference and reliability. Annual Review of Genetics 22: 521-565.
Fitch, W. M., and E. Margoliash. 1967. Construction of phylogenetic trees. Science 155: 279-284.
For further information please refer to the "distance.doc" and "kitsch.doc" files from the PHYLIP (Phylogeny Inference Package) distribution Version 3.57c by Joseph Felsenstein (available by anonymous FTP at evolution.genetics.washington.edu in directory pub/phylip).
Printed: November 15, 1996 11:46 (1162)