EFromStaden changes a sequence from Staden format into GCG format. If the file contains a nucleotide sequence, the ambiguity codes are converted as shown in Appendix III of the GCG Program Manual. EFromStaden is a version of GCG's old FromStaden with command line control.
Any sequence created with the Staden database programs can be converted with EFromStaden into a format suitable for use with Wisconsin Package(TM) programs. All of the compatible ambiguity codes are converted. If more than one contig is present in the Staden file, then the sequences are concatenated into a single sequence. The contig markers in the Staden sequences are retained in the heading of the GCG output file. In the example below, the 11th to the 29th bases contain all of the Staden ambiguity codes. You can see how they are converted in Appendix III or in the example below. If the sequence is a peptide sequence then no conversion is made.
The command % seqformat Staden sets a global switch to make Wisconsin Package programs accept sequences in Staden format without running EFromStaden (See "Using Global Switches" in Chapter 3, Basic Concepts: Using Programs of the User's Guide. ) Use the EFromStaden program only to convert sequences that you wish to keep in GCG format.
This GCG program was modified by Jaakko Hattula (Tampere University of Technology, Finland) and Peter Rice (E-mail: email@example.com Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (firstname.lastname@example.org).
Here is a session using EFromStaden to convert the Staden file origin.sdn into a GCG-format file:
% efromstaden EFROMSTADEN of what file ? origin.sdn What should I call the output file (* origin.seq *) ? %
Here is the complete output of the file origin.seq:
EFROMSTADEN of: origin.sdn check: 5867 from: 1 to: 200 <---ORIGIN.001-----> origin.seq Length: 200 March 8, 1991 14:50 Type: N Check: 5867 .. 1 ATGGATCCTA ctagctagct agRYMKWSXC TGAGGAGGAG ATTCACTTGT 51 TTAGAGGCTG GGAGTGGTGG CTCACGCCTG TAATCCCAGA ATTTTGGGAG 101 GCCAAGGCAG GCAGATCACC TGAGGTCAAG AGTTCAAGAC CAACCTGGCC 151 AACATGGTGA AATCCCATCT CTACAAAAAT ACAAAAATTA GACAGGCATG
The following programs convert sequences between other formats and GCG format: FromEMBL, FromGenBank, FromIG, FromPIR, FromStaden, FromFastA, ToIG, ToPIR, ToStaden and ToFastA.
DataSet creates a GCG data library from any set of sequences in GCG format. ToBLAST creates a database that can be searched by the BLAST program from any set of sequences in GCG format.
Staden nucleotide ambiguity codes are not all strictly comparable to IUB-IUPAC ambiguity codes (see Appendix III) . If contigs are present in the Staden file, all of the contigs are concatenated into a single sequence. If this is not what you want, put the contigs into separate files with a text editor and run EFromStaden on each them individually. The contig comments cannot be more than 130 characters long.
Here is the input file used for the example above:
<---ORIGIN.001----->ATGGATCCTA1234DVBHKLMNRY5678-CTGAGGAGGAG ATTCACTTGTTTAGAGGCTGGGAGTGGTGGCTCACGCCTGTAATCCCAGAATTTTGGGAG GCCAAGGCAGGCAGATCACCTGAGGTCAAGAGTTCAAGACCAACCTGGCCAACATGGTGA AATCCCATCTCTACAAAAATACAAAAATTAGACAGGCATG
When EFromStaden writes GCG sequence files, it assigns the sequence type based on the composition of the sequence characters. This method is not fool-proof, so to ensure that the output files are written with the correct sequence type, use the -PROtein or -NUCleotide command-line option when running EFromStaden
If EFromStaden is run interactively, you can watch the program monitor to see if the sequences are assigned the correct type. As each new file is written, its name and the number of bases (bp) or amino acids (aa) appears on the screen. If the wrong abbreviation appears (for example, bp appears for a protein sequence), the sequence file was assigned the wrong type. The sequence type also appears in the sequence file. Look on the last line of the text heading just above the sequence itself for Type: N or Type: P.
If the sequence type was incorrectly assigned, turn to Appendix VI for information on how to change or set the type of a sequence.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimal Syntax: % efromstaden [- Prompted Parameters: [- Local Data Files: None Optional Parameters: - - -
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the User's Guide.
sets the program to expect either protein or nucleic acid sequences. Normally, FromStaden determines whether an input sequence is protein or nucleic acid by looking at its composition. If the first 300 alphanumeric characters in a sequence are composed entirely of Staden nucleotide codes (see Appendix III) , it is reformatted as a nucleic acid sequence in GCG format; otherwise it is reformatted as a protein sequence. Using these command-line options, you can insist that your sequences are proteins (-PROtein) or nucleic acids (-NUCleotide).
This program normally monitors its progress on your screen. However, when you use the -Default option to suppress all program interaction, you also suppress the monitor. You can turn it back on with this option. If your program is running in batch, the monitor will appear in the log file. If the monitor is slowing the program down, suppress it with -NOMONitor.
Printed: April 22, 1996 15:52 (1162)