EConsensus calculates a consensus sequence for a set of pre-aligned short nucleic acid sequences by tabulating the percent of G, A, T, and C for each position in the set. GCG's FitConsensus uses the EConsensus output table as a probe to search for the best examples of the derived consensus in other nucleotide sequences.
EConsensus reads a file of aligned sequences for which you want to know the consensus pattern. EConsensus constructs a consensus table with the percent of each nucleotide at each position. The total number of nucleotides contributing to each position in the sequence shown in the table is also reported. Below the table, EConsensus writes the least ambiguous expression of the consensus sequence for a confidence level that you request.
This GCG program was modified by Jaakko Hattula (Tampere University of Technology, Finland) and Peter Rice (E-mail: firstname.lastname@example.org Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (email@example.com).
Here is a session using EConsensus to find the consensus of the intervening sequence acceptor splice sites from the file acceptor.dat:
% econsensus ECONSENSUS of what file ? acceptor.dat Find consensus to what percent certainty (* 75.0 *) ? What should I call the output file (* acceptor.csn *) ? ................ %
Here is the output file, which is a legal GCG sequence file:
ECONSENSUS of: acceptor.dat IVS Acceptor Splice Site Sequences from Stephen Mount NAR 10(2); 459-472 figure 1 page 460 Acceptor ***** %G 15 22 10 10 10 6 7 9 7 5 5 24 1 0 %A 15 10 10 15 6 15 11 19 12 3 10 25 4 100 %T 52 44 50 54 60 49 48 45 45 57 58 30 31 0 %C 18 25 30 21 24 30 34 28 36 35 27 21 64 0 Total 114 114 115 127 127 127 128 128 128 130 131 131 131 131 %G 100 52 24 19 %A 0 22 17 20 %T 0 8 37 29 %C 0 18 22 32 Total 131 131 131 131 ***** ECONSENSUS sequence to a certainty level of 75.0 percent at each position: Length: 18 December 12, 1995 16:15 Type: N Check: 3343 .. 1 BBYHYYYHYY YDYAGVBH
FitConsensus uses the file written by EConsensus to search for the best places in a nucleotide sequence where the consensus table fits. The mapping programs can be run with the command line option -ALL to search for all potential restriction sites in an ambiguous sequence.
ProfileMake creates a position-specific scoring table, called a profile, that quantitatively represents the information from a group of aligned sequences. The profile can then be used for database searching (ProfileSearch) or sequence alignment (ProfileGap) .
EConsensus makes no attempt to align the sequences in the input file, so you should be sure that they are optimally aligned before running the program. The input file structure is described below. The ambiguous representation of the sequence may be arbitrary if there are equal numbers of observations of some nucleotides.
EConsensus counts the number of G's, A's, T's, and C's in each position of the prealigned sequences. G, A, T, and C each have a value of one. The ambiguous nucleotide codes are divided. R, for instance, represents A or G and therefore contributes 0.5 to G and 0.5 to A. Periods (gaps) have no value. When the count is complete, the counts of each nucleotide at each position are totaled, normalized to 100, and rounded to the nearest integer. The normalized integers are reported as the %G, %A, etc., at each position. The total number of observations used to generate the percent figures is also shown. An observation is any IUB code (see Appendix III) ; periods do not count as observations.
For some user-set certainty level, EConsensus writes the least ambiguous expression of the sequence in the table using the IUB ambiguity codes. For each column (position) in the table, the computer starts with the largest member (G, A, T, or C) and adds successively smaller members until the sum is equal to or greater than the certainty level set by you. If two nucleotides have the same score, EConsensus picks one to add to the consensus arbitrarily. This may be somewhat misleading.
The input file has a heading of indefinite length, followed by a line containing two adjacent periods (..). The sequences follow with one sequence per line. Every sequence is the same length. The maximum size for the sequences is 130 bases. EConsensus assumes optimal alignment of the sequences. Here is part of the input file for the example above:
IVS Acceptor Splice Site Sequences compiled by Stephen Mount NAR 10(2); 459-472 figure 1 page 460 / .. .........AAATAGGAT .........TTGTAGGTG ..........TGTAGGTG TTTATTTATTTCAAGATT ////////////////// GTCACTTGTCACTAGGTA
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimal Syntax: % econsensus [-INfile=]acceptor.dat -Default Prompted Parameters: [-OUTfile=]acceptor.csn output file name -CERtainty=75.0 percent certainty at which to find consensus Local Data Files: None Optional Parameters: None -NOMONitor turn off monitoring of progress
Printed: April 22, 1996 15:52 (1162)