The Short Descriptions section tells you what tools are available in the EGCG Package and contains a one or two sentence description of every EGCG command.
You run GCG and EGCG programs by typing the name of the program next to the % (percent sign) UNIX prompt. The text next to the % in the descriptions below are the commands that you enter to run each program; you don't type the percent sign. The bold type (also called the typewriter font) in any of its qualifiers indicates the minimum number of characters that you must type on the command line to run the program. You must type the full UNIX program name; partial names are ignored.
Whenever the word file(s) or sequence(s) appear in text, the (s) means that the program works on one or more sequences simultaneously. (See the Introduction section of the GCG users guide for more information about GCG documentation conventions.)
The % gcg command is used at login to initialize the GCG Package. The % egcg command is used in the same way to initialize the EGCG Package. The EGCG Package is ready to run when a banner appears on your screen that looks something like this:
Welcome to the EGCG extensions to the WISCONSIN PACKAGE Version 8.1, November 1995 Installed on irix Copyright 1982, 1983, 1984, 1985, 1986, 1987, 1989, 1991, 1992, 1994 Genetics Computer Group, Inc. All rights reserved. Published research assisted by this software should always cite: Program Manual for the Wisconsin Package, Version 8, August 1994, Genetics Computer Group, 575 Science Drive, Madison, Wisconsin, USA 53711 and also for the Extension programs: Program Manual for the EGCG Package, Peter Rice, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, England. Additional code by Peter Rice, The Sanger Centre, Hinxton, England and other members of the EGCG team. Help is available with the commands % egenhelp and % egenmanual
The NewFeatures program is used by the EMBL Data Library to edit feature tables and to update sequence database entries. It is also a useful tool for maintaining your own version of a feature table, or for exploring large feature table.
NewFeatures is an interactive editor for entering and modifying the feature table and for minor editing of the sequence itself.
The Fragment Assembly programs provide additional functions for the GCG Fragment Assembly system.
GelStatus reads a GCG Fragment Assembly database, and produces a summary report of the quality of each contig.
GelPicture reads a contig from the Fragment Assembly database and displays a diagram of the gel alignments and a printout of the aligned gel sequences and consensus. GelPicture has been modified to include the sequence direction in both sections of the output, and to mark with '=======' any consensus sequence that is correct (agrees with every fragment) and has been sequenced in both directions.
GelFigure produces a graphical report of the status of a contig in a fragment assembly project.
GelAnalyze reads a GelStatus report from a shotgun project, and produces project statistics by the method of Lander and Waterman.
A new program makes the task of creating your own restriction enzyme file for GCG programs easy.
MapSelect selects restriction enzymes by name or by their ability to cut a given sequence, and writes them to a new enzyme file for use in other programs.
EFingerPrint identifies the products of T1 ribonuclease digestion. EFingerPrint is a version of GCG's old FingerPrint with command line control.
Command line control is added to GCG's program Diverge.
EDiverge is a version of Diverge with command line control. Diverge measures the percent divergence of two protein coding sequences using the method of Perler and Efstratiadis.
EOverlap compares two sets of DNA sequences to each other in both orientations using a WordSearch style comparison. EOverlap is an extended version of GCG's Overlap for use in database nonredundancy checks, together with the FilterOverlap program.
BigEOverlap compares two sets of DNA sequences to each other in both orientations using a WordSearch style comparison. EOverlap is an extended version of GCG's Overlap for use in database nonredundancy checks, together with the FilterOverlap program. BigEOverlap has a very high limit on total sequnec length for genome scale sequence analysis, but it too large for general use on most systems.
FilterOverlap reads the output file from EOverlap and filters out only those overlaps which meet specified values when the alignments are built. Output from GCG's Overlap program may also be used, but only if generated from a self comparison of a single database.
Output from (T)Fasta can be screened for significance. TWordSearch searches can compare a protein sequence to the nucleotide databases. The EQuickSearch program can run far faster with far smaller memory requirements, and output can be screened for the best hits using QuickMatch.
FastaCheck selects significant alignments from a (T)Fasta output file.
TWordSearch identifies DNA sequences similar to a protein query sequence using a six frame translation of the database and a Wilbur and Lipman-style search. The output is a list of significant diagonals whose alignments can be displayed with TSegments.
TSegments aligns and displays the segments of similarity found by TWordSearch.
EQuickSearch rapidly identifies places where query sequence(s) occur in a nucleotide sequence database. The output is a file of overlaps that can be displayed with QuickMatch or EQuickShow. You can make up your own sequence database or use GenEMBL, which consists of GenBank and those sequences in EMBL that are not represented in GenBank (or the other way around at some sites).
QuickMatch displays the overlaps found by EQuickSearch with either optimal alignments or dot-plots. The alignments can be selected by quality to discard poor matches. The dot-plots can be reviewed rapidly with a graphic screen.
EQuickIndex builds hash tables from sequence(s) in data libraries, and stores them as map sections. These tables make up the database that is searched by EQuickSearch.
NewFetch copies GCG sequences or fragments or data files from the GCG database into your directory or displays them on your terminal screen and allows the user to specify a sequence range.
StsSearch looks for primer pairs in a set of sequences.
RFindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal. The output is a series of files called r1.rfind, r2.rfind, and so on, each containing a single extracted sequence. These can be fed through Pileup or manipulated in other ways.
PatternPlot produces a graphical representation of the results of GCG's FindPatterns program.
The first four programs in this section allow you to display multiple sequence alignments. The last three programs are modified verions of the GCG profile programs, supporting automatic translation of nucleotide sequence database entries, and modifications to allow searches of far larger databases.
PrettyPlot displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment, it simply displays it.
PrettyBox displays multiple sequence alignments as shaded boxes in Postscript format (e.g., the output file must be printed and/or displayed on a Postscript-compatible device). PrettyBox will optionally calculate a consensus sequence. The program does not create the alignment; it simply displays it.
PlotAlign takes a GCG format sequence alignment, and plots the mean and range of values for any amino acid parameter you supply. The "panel file" contains a list of parameters to be plotted. The main database of parameters is taken from Nakai et al. (1988), and the default panel file uses selected parameters from the 13 discrete clusters in that paper. This program is experimental. Any suggestions would be most welcome.
PepAllWindow plots measures of protein hydrophobicity according to the method of Kyte and Doolittle.
EPlotSimilarity plots the running average of the similarity among the sequences in a multiple sequence alignment.
PolyDot compares two sets of sequences, draws a dotplot for each pair of sequences, and reports all identical matches of a specified length.
TProfileSearch uses a profile (representing a group of aligned protein sequences) as a probe to search the nucleotide database for new sequences with possible protein products having some similarity to the group. The profile is created with the program ProfileMake.
TProfileSegments makes optimal alignments showing the segments of similarity found by TProfileSearch.
TProfileGap makes an optimal alignment between a profile and a sequence.
ProfilePlot produces a graphical report of the frequency of patterns in a protein or nucleotide sequence.
SortConsensus identifies the strong consensus regions of an alignment in an MSF file and reports them in sorted order.
ELineUp is a screen editor for editing multiple sequence alignments. You can edit up to 500 sequences simultaneously. New sequences can be typed in by hand or added from existing sequence files. A consensus sequence identifies places where the sequences are in conflict.
MultAlign does a simultaneous alignment for two or more DNA or protein sequences. It introduces a certain number of gaps into either pairwise aligned sequences or groups of sequences to find a minimal global distance. The user can influence the result by defining the order in which the sequences will be aligned. The program is based on a generalization of the algorithm of Waterman, Smith and Beyer by Krueger and Osterburg.
EClustAlW calculates a multiple alignment of nucleic acid or protein sequences according to the method of Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). This is part of the original ClustalW distribution, modified for inclusion in EGCG.
ClusTree computes a phylogenetic tree according to the Neighbor-Joining Method of Saitou and Nei (1987). This is part of the original ClustalW distribution, modified for inclusion in EGCG. The tree will be displayed graphically.
ProfAlign is for taking two old aligments (or single sequences) and aligning them with each other. The result is one bigger aligment. This is part of the original ClustalW distribution, modified for inclusion in EGCG.
These programs measure the pairwise homologies between a set of sequences, and provide a conversion to the format required by the Phylip program.
Homologies makes a table of the pair-wise distances within a group of aligned sequences.
ToPhylip writes GCG sequences into a single file in Phylip format.
Phylip2Tree displays trees computed with one of the PHYLIP-programs in GCG style.
Searching for inverted repeats in DNA sequences is now provided. Some GCG programs in this section now have command line control added.
Palindrome searches for perfect inverted repeats in a nucleic acid sequence.
EWindow is a version of Window with command line control. Window makes a table of the frequencies of different sequence patterns within a window as it is moved along a sequence. A pattern is any short sequence like GC or R or ATG. You can plot the output with the program StatPlot.
EStatPlot is a version of StatPlot with command line control. StatPlot plots a set of parallel curves from a table of numbers like the table written by the Window program. The statistics in each column of the table are associated with a position in the analyzed sequence.
ECodonFrequency tabulates codon usage from sequences and/or existing codon usage tables. The output file is correctly formatted for input to the CodFish, CodonPreference, Correspond, and Frames programs.
ECodonFrequency is a modified version of GCG version 7's CodonFrequency with command line control added.
EConsensus calculates a consensus sequence for a set of pre-aligned short nucleic acid sequences by tabulating the percent of G, A, T, and C for each position in the set. GCG's FitConsensus uses the EConsensus output table as a probe to search for the best examples of the derived consensus in other nucleotide sequences.
ECorrespond looks for similar patterns of codon usage by comparing codon frequency tables.
ERepeat finds direct repeats in sequences. You must set the size, stringency, and range within which the repeat must occur; all the repeats of that size or greater are displayed as short alignments. ERepeat is a version of GCG's old Repeat with command line control.
ETerminator searches for prokaryotic factor-independent RNA polymerase terminators according to the method of Brendel and Trifonov. ETerminator is a version of GCG's old Terminator with command line control.
Melting temperature and GC content of a sequence can be analyzed and displayed on a plot. The variation in di-nucleotide composition along a sequence can be plotted.
Melt calculates the melting temperature (Tm) and the percent G+C of a nucleic acid sequence using the algorithms described by Breslauer et al. Proc. Natl. Acad. Sci. USA 83, 3746-3750 and Baldino et al. Methods in Enzymol. 168, 761-777.
MeltPlot plots the melting curve for a nucleic acid sequence using the algorithms described by Breslauer et al. Proc. Natl. Acad. Sci. USA 83, 3746-3750 and Baldino et al. Methods in Enzymol. 168, 761-777.
BasePairPlot plots the percentage occurence and the observed over expected frequency of a di-nucleotide pair relative to their position in a nucleic acid sequence.
CpGPlot plots the frequency of occurence of CpG di-nucleotides and C and G percentage relative to their position in a sequence by the method described by Gardiner-Garden (1987)
CpGReport looks for potential CpG islands in a nucleotide sequence.
Chaos makes a CHAOS game representation of a nucleic acid sequence using the method of Jeffrey (1990) Nucleic Acids Research 18: 2163-2170.
CODFISH calculates a set of codon usage statistics for a sequence using a specified codon usage table.
WordCount counts the commonest words in a sequence and reports them in order of frequency and sequence.
WordUp is based on a first order Markov analysis and detects statistically significant oligonucleotide patterns from six to nine nucleotides long in the sequences under investigation. WordUp dynamically detects significant signals of any length in the same analysis.
The program Poland simulates transition curves of double-stranded nucleic acids (DNA as well as RNA). Calculation is based on Poland, D. (1974) 'Recursion Relation Generation of Probability Profiles for Specific-Sequence Macromolecules with Long-Range Correlations'.
GeneTrans extracts and/or translates coding regions as defined in the feature table of sequences stored in the EMBL or Genbank databases.
GapFrame moves all gaps in a DNA sequence reading frame to be at codon boundaries.
Prima selects oligonucleotide primers for a template DNA sequence. The primers may be useful for the polymerase chain reaction (PCR) or for DNA sequencing. You can allow Prima to choose primers from the whole template or limit the choices to a particular set of primers listed in a file.
QuickTandem scans for potential tandem repeats in a nucleotide sequence.
Tandem looks for multiple tandem repeats of a given size in a nucleotide sequence.
Inverted looks for imperfect inverted repeats in a nucleotide sequence.
EComposition determines the composition of sequence(s). For nucleotide sequence(s), EComposition also determines dinucleotide and trinucleotide content.
The first four programs provide graphical analyses of protein sequences. The first three provide different approaches to finding coiled-coil regions and amphipathic helices. PepWindow provides a general hydropathy plot. PepStats calculates physical properties of proteins. The last three programs look for specific sequence motifs: signal peptide cleavage sites, potential epitopes (antigenic surface regions), and helix turn helix DNA binding domains.
PepCoil identifies potential coiled-coil regions of protein sequences using the algorithm of Lupas A, van Dyke M & Stock J (1991).
PepNet is a program to view the two-dimensional helical representation of protein sequences.
PepWheel is a program to view the periodic distribution of amino acid residues in protein sequences.
PepWindow plots measures of protein hydrophobicity according to the method of Kyte and Doolittle.
PepStats gives a short statistical summary on the composition of a protein sequence and gives the molecular weight and isoelectric point.
SigCleave uses the von Heijne method to locate signal sequences, and to identify the cleavage site. The method is 95% accurate in resolving signal sequences from non-signal sequences with a cutoff score of 3.5, and 75-80% accurate in identifying the cleavage site. The program reports all hits above a minimum value.
Antigenic looks for potential antigenic regions using the method of Kolaskar.
HelixTurnHelix uses the method of Dodd and Egan to determine the significance of possible helix-turn-helix matches in protein sequences.
DoDayhoffStat compares the composition of a protein sequence against the Dayhoff statistic for protein composition. The closer the Dayhoff Stat value is to 1.0 the better the composition of the protein sequence fits with the theoretical value.
PepCount reports the number of occurrences of residues at a given position in protein sequences.
EPeptideSort shows the peptide fragments from a digest of an amino acid sequence. It sorts the peptides by weight, position, and HPLC retention at pH 2.1, and shows the composition of each peptide. It also prints a summary of the composition of the whole protein. EPeptideSort is a modified version of GCG's PeptideSort which has additional options to control output of peptides sorted by weight, retention and position.
One GCG program has command line control added. The second program translates aligned nucleic acid sequences into aligned protein sequences.
ETranslate is a version of GCG's old Translate program with command line control added.
EExtractPeptide is a version of ExtractPeptide with command line control. ExtractPeptide writes a peptide sequence from one or more of the translation frames displayed in the output from Map. Translate supercedes ExtractPeptide for most applications.
AllTrans translates a set of aligned nucleotide sequences into protein.
MyTrans is a simple EGCG application that translates part of a nucleotide sequence into protein.
A GCG program has command line control added.
EAssemble is a version of GCG's old Assemble program with command line control added.
ECompTable creates a scoring matrix using equivalences defined in a simplification scheme such as the one used for Simplify. ECompTable is a version of GCG's CompTable with command line control added.
EReverse reverses and/or complements a sequence. EReverse is a version of GCG's Reverse with command line control.
PepCorrupt randomly introduces small numbers of substitutions, insertions, and deletions into protein sequence(s). Note that substitutions are Residue to other Residue, and that back mutations to the original are allowed!
EPublish is a version of Publish that allows command line control. No other Display programs are released, but there has been some interest in a modified version of Red to provide alternative forms of documentation.
EPublish is a version of Publish with command line control. Publish arranges sequences for publication. It creates a text file that you can modify to your own needs with a text editor.
ELibGen creates formatted versions of EGCG documentation for the on-line help facilities egenhelp and egenmanual.
RedToHtml is a modification of GCG's program Red to convert documentation source files into HTML documents.
The first program converts any sequence to plain text. The next two programs provide a way to generate the original database entry format from a GenBank/EMBL entry in a GCG database. The ToPirAll program provides a way to extract a set of subsequences in PIR format. The last program produces input files for the Primer program.
CReformat rewrites sequence file(s), scoring matrix file(s), or enzyme data file(s) so that they can be read by GCG programs. For sequence files, a base range can be selected or excluded.
ToText converts a sequence into plain text format.
ToGenBank is a simple utility program that reads a GenBank entry from a GCG sequence database, and writes it out in GenBank flat file format.
ToEmbl is a simple utility program that reads an EMBL entry from a GCG sequence database, and writes it out in EMBL flat file format.
ToPirAll is a utility program that converts a list of sequences, or ranges of sequences, into PIR format for use in other non-GCG programs, especially CLUSTALV.
ToPrimer formats a GCG sequence file into a PRIMER compatible file.
ToRelate creates an input file for the NBRF RELATE program.
EFromFastA reformats one or more sequences from FastA format into individual files in GCG format.
EFromStaden changes a sequence from Staden format into GCG format. If the file contains a nucleotide sequence, the ambiguity codes are converted as shown in Appendix III of the GCG Program Manual. EFromStaden is a version of GCG's old FromStaden with command line control.
EToStaden writes a GCG sequence into a file in Staden format. If the file contains a nucleotide sequence, the ambiguity codes are converted as shown in Appendix III of the GCG Program Manual. EToStaden is a version of GCG's ToStaden with command line control.
EGetSeq reads a sequence from a computer that is acting as a terminal and writes it into a new sequence file in GCG format on the computer running the Wisconsin Package. EGetSeq is a version of GCG's GetSeq with command line control.
These utilities act on text files.
NoReturn removes trailing carriage return or line feed control characters from text files.
CppJL converts EGCG VMS fortran source code to Unix fortran source code.
CRtoLF converts carriage return characters to linefeed characters in text files.
AddComment rewrites a text file with every line commented out.
ECrypt writes an encrypted version of a file using a key word that you choose. Run ECrypt a second time with the same keyword to restore the encrypted output file to its original state.
ECodeSearch searches through FORTRAN source files for references to mnemonics such as procedure names. You must provide the mnemonics in a separate file. The default parameters show you some suitable inputs.
ECLSort sorts the output of ECodeSearch on the first argument of each procedure. The heading is lost. EGCG will use ECLSort to make up the command line dictionary in the Procedure Library chapter of the future EGCG System Support Manual.
These programs do not fit the other categories.
Test is provided as a skeleton for programmers to test ideas.
CTest is provided as a skeleton for programmers to test ideas.
KeyFind reports the characters passed to the program by keys on the keyboard.
These programs are used at several sites to build additional databases in GCG format.
DbStats counts the number of entries and the total lengths of sequence entries in a GCG formatted database.
GbOnly creates a list of GenBank entries that have accession numbers not found in the latest release of the Embl database.
PirOnly and related programs select entries from PIR that are not included in the latest release of SwissProt.
CheckLen calculates five checksums and the sequence length for each entry in a database, and writes them to a file for use in a quick cross check for identical sequences.
CheckLenComp compares two sorted CheckLen output files, and produces a list of entries from the first file which are not found in the second.
KabatToGCG creates GCG data libraries from Kabat distribution files.
SeqDbToGcg converts the SeqDb database distribution files into a database in GCG format.
ConvertEnz reads lines extracted from the ENZYME database, and converts them to lower case.
Ig2Nbrf is a utility program that converts an IG formatted file into an NBRF formatted database which PirToGcg can index.
ConvertEnz reads lines extracted from the ENZYME database, and converts them to lower case.
EMBLToGCGSC is the Sanger Centre's modification of GCG's EMBLtoGCG which reformats EMBL and SWISS-PROT flat sequence files into GCG data libraries.
EGenHelp displays an index of all the programs in the EGCG Program Manual. To view the topics for an individual program on your screen, type in the program name. To select a topic for a program, type in the topic name (including any underscores). Program documentation always includes a picture of the screen for a typical session with the program.
EGenManual displays an index of all the sections of the EGCG Program Manual. To view the programs in a section, type in the section name (including any underscores). To select a program, type in the program name. To select a topic for a program, type in the topic name (including any underscores).
These commands are defined by EGCG for general use. Most are similar to GCG commands.
These commands are defined by EGCGSUPPORT for system maintenance. Most are similar to GCGSUPPORT commands.
builds the complete help libraries for egenhelp and egenmanual from the documentation source files.
Printed: April 23, 1996 16:26 (1162)