Most of the programs allow various options that alter the amount of information the program is provided or what it is to do with the information. Most options are selected in the menu. However a few are specified in the input file, or require part of their specification to be in the input file.
The Jumble option also causes the program to ask you how many times you
want to restart the process. If you answer 10, the program will try ten
different orders of species in constructing the trees, and the results printed
out will reflect this entire search process (that is, the best trees found
among all 10 runs will be printed out, not the best trees from each individual
Weights can be used to analyze different subsets of characters (by
weighting the rest as zero). Alternatively, in the discrete characters
programs they can be used to force a certain group to appear on the phylogeny
(in effect confining consideration to only phylogenies containing that group).
This is done by adding an imaginary character that has 1's for the members of
the group, and 0's for all the other species. That imaginary character is then
given the highest weight possible: the result will be that any phylogeny that
does not contain that group will be penalized by such a heavy amount that it
will not (except in the most unusual circumstances) be considered. Of course,
the new character brings extra steps to the tree, but the number of these can
be calculated in advance and subtracted out of the total when reporting the
results. This use of weights is an important one, and one sadly ignored by
many users who could profit from it. In the case of molecular sequences we
cannot use weights this way, so that to force a given group to appear we have
to add a large extra segment of sites to the molecule, with (say) A's for that
group and C's for every other species.
Options Information in the Input File
In such cases, the program is notified that an option has been invoked by
the presence of one or more letters after the last number on the first line of
the input file. These letters may or may not be separated from each other by
blanks, though it is usually necessary to separate them from the number by a
blank. They can be in any order. Thus to invoke options A and W, the input
file starts with the line:
12 20 WA
12 20 A W
The options are described individually in the other documents of this package.
For the options that require information to be in the input file, additional
information must be provided. For all but one of these, this information is
provided by placing a line after the first line of the file, but before the
beginning of the species data. The first character of that line should match
the option letter. These auxiliary information lines can be in any order.
Thus if options A and W are both invoked, both of the following formats (and
two others as well) are legal:
12 20 AW 12 20 A W
A 0001111000 Weights 00112221A0
Weights 00112221A0 A 0001111000
(then the species information) (then the species information)
One of the options requires special discussion. Many of the programs have in
their menu the option U, which signals that one or more user-defined trees is
to be provided for evaluation. This "user tree" is supplied in the input file
(not the tree file), but AFTER the species data, rather than before it. It does
not require any indication to be placed in the first line of the input file, as
do the options that place information before the species data. After the data,
there is a line containing the number of user-defined trees being defined.
Each user-defined tree starts on a new line. It is in the same form as the
trees in the tree files mentioned above, namely the New Hampshire standard.
Here is an example with one user-defined tree:
In using the user tree option, check the pattern of parentheses carefully.
The programs do not always detect whether the tree makes sense, and if it does
not there will probably be a crash (hopefully, but not inevitably, with an
error message indicating the nature of the problem).
Common Options in the Menu
Seven options from the menu, the U (User tree), G (Global), J (Jumble), O
(Outgroup), T (Threshold), M (multiple data sets), and the tree output options,
are used so widely that it is best to discuss them in this document.
The U (User tree) option
This option toggles between the default
setting, which allows the program to search for the best tree, and the User
tree setting, which reads a tree or trees ("user trees") from the input file
and evaluates them. The user trees must follow the other information in the
data set, and be preceded by a line specifying the number to user trees that
are to be evaluated. Each user tree then is given in standard form, each
starting on a new line. The form that the user trees must take is described in
some detail below, under the description of the program output of tree files.
In some cases a program may require that the trees fed in be rooted trees, even
though the program cannot infer the placement of the root. In those cases you
can place the root anywhere. Program RETREE can be used to convert between
rooted and unrooted trees.
The G (Global) option
In the programs which construct trees (except
for NEIGHBOR, the "...PENNY" programs and CLIQUE, and of course the "...MOVE"
programs where you construct the trees yourself), after all species have been
added to the tree a rearrangements phase ensues. In most of these programs the
rearrangements are automatically global, which in this case means that subtrees
will be removed from the tree and put back on in all possible ways so as to
have a better chance of finding a better tree. Since this can be time
consuming (it roughly triples the time taken for a run) it is left as an option
in some of the programs, specifically CONTML, FITCH, and DNAML. In these
programs the G menu option toggles between the default of local rearrangement
and global rearrangement. The rearrangements are explained more below.
The J (Jumble) option
In most of the tree construction programs
(except for the "...PENNY" programs and CLIQUE), the exact details of the
search of different trees depend on the order of input of species. In these
programs J option enables you to tell the program to use a random number
generator to choose the input order of species. This option is toggled on and
off by selecting option J in the menu. The program will then prompt you for a
"seed" for the random number generator. The seed should be an integer between
1 and 32767, and should of form 4n+1, which means that it must give a remainder
of 1 when divided by 4. This can be judged by looking at the last two digits
of the number. Each different seed leads to a different sequence of addition
of species. By simply changing the random number seed and re-running the
programs one can look for other, and better trees. If the seed entered is not
odd, the program will not proceed, but will prompt for another seed.
The O (Outgroup) option
This specifies which species is to be used
to root the tree by having it become the outgroup. This option is toggled on
and off by choosing O in the menu. When it is on, the program will then prompt
for the number of the outgroup (the species being taken in the numerical order
that they occur in the input file). Responding by typing "6" and then a
carriage-return (Enter) character indicates that the sixth species in the data
is the outgroup. Outgroup-rooting will not be attempted if the data have
already established a root for the tree from some other consideration, and may
not be if it is a user-defined tree, despite your invoking the option. Thus
programs such as DOLLOP that produce only rooted trees do not allow the
Outgroup option. It is also not available in KITSCH, DNAMLK, or CLIQUE. When
it is used, the tree as printed out is still listed as being an unrooted tree,
though the outgroup is connected to the bottommost node so that it is easy to
visually convert the tree into rooted form.
The T (Threshold) option
This sets a threshold such that if the
number of steps counted in a character is higher than the threshold, it will be
taken to be the threshold value rather than the actual number of steps. The
default is a threshold so high that it will never be surpassed. The T menu
option toggles on and off asking the user to supply a threshold. The use of
thresholds to obtain methods intermediate between parsimony and compatibility
methods is described in my 1981b paper. When the T option is in force, the
program will prompt for the numerical threshold value. This will be a positive
real number greater than 1. In programs MIX, MOVE, PENNY, PROTPARS, DNAPARS,
DNAMOVE, and DNAPENNY, do not use threshold values less than or equal to 1.0,
as they have no meaning and lead to a tree which depends only on considerations
such as the input order of species and not at all on the character state data!
In programs DOLLOP, DOLMOVE, and DOLPENNY the threshold should never be 0.0 or
less, for the same reason. The T option is an important and underutilized one:
it is, for example, the only way in this package (except for program DNACOMP)
to do a compatibility analysis when there are missing data. It is a method of
de-weighting characters that evolve rapidly. I wish more people were aware of
The M (Multiple data sets) option
In menu programs there is an M
menu option which allows one to toggle on the multiple data sets option. The
program will ask you how many data sets it should expect. The data sets have
the same format as the first data set. Here is a (very small) input file with
two five-species data sets:
The main use of this option will be to allow all of the methods in these
programs to be bootstrapped. Using the program SEQBOOT one can take any DNA,
protein, restriction sites, or binary character data set and make multiple data
sets by bootstrapping. Trees can be produced for all of these using the M
option. They will be written on the tree output file if that option is left in
force. Then the program CONSENSE can be used with that tree file as its input
file. The result is a majority rule consensus tree which can be used to make
confidence intervals. The present version of the package allows, with the use
of SEQBOOT and CONSENSE and the M option, bootstrapping of many of the methods
in the package.
The option to write out the trees into a tree file
that you want the program to write out the tree not only on its usual output,
but also onto a file in nested-parenthesis notation (as described above). This
option is sufficiently useful that it is turned on by default in all programs
that allow it. You can optionally turn it off if you wish, by typing the
appropriate number from the menu (it varies from program to program). This
option is useful for creating tree files that can be directly read into the
plotting programs, the consensus tree program, and can be incorporated into the
input file to specify user-defined trees in many of the other programs.
The (0) terminal type option
The program will default to one
particular assumption about your terminal (except in the case of Macintoshes,
the default will be an ANSI compatible terminal). You can alternatively select
it to be either an IBM PC, a DEC VT52, or nothing. This affects the ability of
the programs to clear the screen when they display their menus, and the
graphics characters used to display trees in the programs DNAMOVE, MOVE,
DOLMOVE, and RETREE. If you are running a PCDOS system any have the ANSI.SYS
driver installed in your CONFIG.SYS file, you may find that the screen clears
correctly even with the default setting of ANSI.
Common Options Requiring Information in the Input File
There are a number of options (Ancestor, Factors, Categories and Weights)
that are specified in the input file. Some of them must also be selected in
the menu. Of these, the Ancestor and Factors options are specific to the
Discrete Characters programs and are described in their group document. The
Categories option is specific to some of the molecular sequence programs and is
described in their group document. The Weights option is used throughout the
package and is best introduced here.
The Weights Option
This allows us to specify weights on the individual characters. Weights
are invoked by placing a W on the first line of the file. The weights are then
specified by a line or lines which start with W and then have enough characters
or blanks to complete the full length of a species name. Then they have a
single character (0-9 or A-Z) for each character. Thus they look like the data
for a species:
The weights cause a character to be counted as if it were n characters, where n
is the weight. The values 0-9 give weights 0 through 9, and the values A-Z
give weights 10 through 35. By use of the weights we can give overwhelming
weight to some characters, and drop others from the analysis. In the molecular
sequence programs only two values of the weights, 0 or 1 are allowed.
Back to the main PHYLIP page
Back to the SEQNET home page
Maintained 15 Jul 1996 -- by Martin Hilbers(e-mail:M.P.Hilbers@dl.ac.uk)
The Jumble option also causes the program to ask you how many times you want to restart the process. If you answer 10, the program will try ten different orders of species in constructing the trees, and the results printed out will reflect this entire search process (that is, the best trees found among all 10 runs will be printed out, not the best trees from each individual run).
Weights can be used to analyze different subsets of characters (by weighting the rest as zero). Alternatively, in the discrete characters programs they can be used to force a certain group to appear on the phylogeny (in effect confining consideration to only phylogenies containing that group). This is done by adding an imaginary character that has 1's for the members of the group, and 0's for all the other species. That imaginary character is then given the highest weight possible: the result will be that any phylogeny that does not contain that group will be penalized by such a heavy amount that it will not (except in the most unusual circumstances) be considered. Of course, the new character brings extra steps to the tree, but the number of these can be calculated in advance and subtracted out of the total when reporting the results. This use of weights is an important one, and one sadly ignored by many users who could profit from it. In the case of molecular sequences we cannot use weights this way, so that to force a given group to appear we have to add a large extra segment of sites to the molecule, with (say) A's for that group and C's for every other species.