Go back to top



Homologies makes a table of the pair-wise distances within a group of aligned sequences.


Homologies makes a table of the pair-wise distances within a group of aligned sequences.


This program was written by Jack A.M. Leunissen (E-mail: jackl@caos.kun.nl; Post: CAOS/CAMM Center, University of Nijmegen, 6525 ED Nijmegen, The Netherlands).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


Here is a sample session with Homologies

  % homologies
   HOMOLOGIES uses any sequences
   HOMOLOGIES of what sequence(s) ?  pileup.msf{*}
           pileup.msf{Hs70_Plafa}, len: 738
           pileup.msf{Hs70_Thean}, len: 738
           pileup.msf{Hs77_Yeast}, len: 738
           pileup.msf{Dnak_Ecoli}, len: 738
                    Start (* 1 *) ? 31
                  End (*   738 *) ? 230
   What is the threshold for a match (* 0.60 *) ?
   How should gaps be handled:
  I)nclude gaps in the comparisons
  L)ength-independent gap inclusion
  E)xclude gaps from comparison
  N)one of the gaps in any sequence
   Please choose one (* I *):
   How should END-gaps be handled:
  I)nclude them in the comparison
  E)xclude them
   Please choose one (* I *):
   Divide the sum of the matches by
  R)esidues compared
  S)horter sequence length
  A)verage sequence length
   Please choose one (* R *):
   What should I call the output file (* pileup.homologies *) ?


Here is some of the output file of a session with Homologies

   HOMOLOGIES within: @gendocdata:pretty.list  May 23, 1995 14:21
 Number of sequences: 11
First residue number: 1
 Last residue number: 349
  Threshold of comparison: 0.60
  Symbol Comparison Table: pepdistances.cmp
         Denominator: "Number of residues compared"
    Gap handling, General: "Including gaps"
    Gap handling, Termini: "Including terminal gaps"
         Gap penalty: 0.00
  Default scoring matrix used by OLDDISTANCES for the comparison of
  protein sequences.
  Dayhoff table (Schwartz, R. M. and Dayhoff, M. O. [1979] in Atlas of
  Protein Sequence and Structure, Dayhoff, M. O. Ed, pp. 353-358, National
  Biomedical Research Foundation, Washington D.C.) rescaled by dividing
  each value by the sum of its row and column, and normalizing to a mean . . .
  Key for column and row indices:
    1                 fa10.ugly  Length: 349       Length without gaps: 212
    2                 fa12.ugly  Length: 349       Length without gaps: 213
    3                 fo1k.ugly  Length: 349       Length without gaps: 213
    4                    e.ugly  Length: 349       Length without gaps: 288
    5                  p1m.ugly  Length: 349       Length without gaps: 302
    6                  p1s.ugly  Length: 349       Length without gaps: 302
    7                  p2s.ugly  Length: 349       Length without gaps: 301
    8                  p3s.ugly  Length: 349       Length without gaps: 300
    9                  cb3.ugly  Length: 349       Length without gaps: 288
   10                  r14.ugly  Length: 349       Length without gaps: 289
   11                   r2.ugly  Length: 349       Length without gaps: 289
   Similarity Matrix Part: 1
              1         2         3         4         5   ...
  ______________________________ ... ..
  |    1   |    0.6000    0.5606    0.4605    0.1634    0.1787 ...
  |    2   |              0.6000    0.4744    0.1677    0.1809 ...
  |    3   |                        0.6000    0.1586    0.1809 ...
  |    4   |                                  0.6000    0.1694 ...



Homologies calculates the pair-wise homology scores, or the distances within a group of previously aligned sequences. Optionally, the program creates an output file, suitable to be used as input for the PHYLIP programs.


Homologies can handle gaps in the input sequences in various ways: they can be included in the calculations, or they may be ignored. When gaps are incorporated in the calculation(s), they may either be treated as all individual mismatches, or each gap can be treated as being just one single mismatch. Likewise, when gaps are ignored, they may either just be ignored for each sequence pair individually, or any gap occuring in any sequence may be ignored in all sequences.

A special case is the treatment of end-gaps, i.e. gaps occuring at the beginning or ends of the sequences, when the termini do not align. They can be switched on or off separately (this, of course, does not apply when the general gap handling was already switched off!).


The homology or mismatch value is usually expressed as the number of (mis)matches per residue. The number of matching characters - or the mismatch value - can therefore be divided by either the number of residues compared, the length of the smaller sequence, the average sequence length, or nothing. By using the -PERCent option, the sum of matching residues is (by default) divided by the number of residues compared, and multiplied by 100.


Homologies uses a correction method known as augmentation. When distantly related sequences are compared, usually the number of mismatches (or evolutionary events) is underestimated, due to a process known as "multiple hits". This simply means that any homologous position in a pair of sequences may have undergone numerous substitutions, while we just notice a difference at that particular position. In fact, the position may even be identical, while both lineages may have undergone several substitutions since their common ancestor, before arriving at their current (identical) state. To compensate for this underestimation of the number of substitutions, especially in distantly related sequences, several formulas ("augmentation" schemes) have been published. A number of them has been implemented in Homologies


The default settings in Homologies are by no means imperative, they merely reflect the author's preferences!


The input file for Homologies is a GCG MSF sequence file.


All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  Syntax: % homologies [-INfile=]pileup.msf{*} -Default
  Required Parameters:
  -THReshold=1.0       minimum symbol comparison score for a match
  -DENOMinator=Residue divides the sum of the matches by:
                  Residues = Number of residues compared
                  Shorter  = Length of shorter sequence
                  Average  = Average length
                  Nothing  = Nothing
                  Include = Include gaps in the comparison
                  Length = length-independent gaps
                  Exclude = Exclude gaps from the comparison
                  None = Exclude EVERY gaps in EVERY sequence
                  Include = Include end-gaps in the comparison
                  Exclude = Exclude end-gaps in the comparison
  -GAPValue=           Gap penalty
  [-OUTfile=]pileup.distances  output file
  Local Data Files:
  -DATa=pepdistances.cmp  comparison table for peptide sequences
  -DATa=dnadistances.cmp  comparison table for nucleotide sequences
  Optional Parameters:
  -DISTances              calculate sequence differences
  -NASscore               Doolittle's NAS score
  -PERCent                print percentage homology
  -AUGmentation=          correct sequences for multiple hits:
                     Jukes = Jukes-Cantor
                     Kimura = Kimura's method
  -SQRT                   take square root of distances
  -PHYlip=PileUp.Phylip   output comparison matrix in PHYLIP format
  -NAMELength=10          length of name-field in PHYLIP output


The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.


The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.


defines the minimum symbol comparison score for a match.


divides the sum of the matches by: Residues: the number of residues compared; Shorter: the length of shorter sequence; Average: the average sequence length; or Nothing: nothing.


instructs the program how to handle gaps in the sequences. Include: include gaps in the comparison; Length: length-independent gaps, i.e. every gap is treated as one single mismatch, regardless of its length; Exclude: exclude gaps from the comparison; or None: exclude EVERY gap in EVERY sequence.


tells the program how to operate on sequences of unqueal length. Valid reponses are: Include: include end-gaps in the comparison; or Exclude: exclude end-gaps in the comparison.


sets the gap penalty to a user-specified value.


calculate sequence differences instead of sequence similarities.


calculate Doolittle's NAS score.


print homology values as a percentage, rather than a fraction.


correct sequences for multiple hits. Currently implemented methods are Jukes: Jukes-Cantor formula, and Kimura: Kimura's method.


write the output comparison matrix in PHYLIP format.


specify the length of the name-field in the PHYLIP output file. Changing this parameter usually requires you to also change this value in the PHYLIP source code, and to recompile these programs!

Printed: April 22, 1996 15:53 (1162)