Sequence Analysis FAQ

Sequence Analysis Software

 

General purpose UNIX software

Other UNIX software

Free Macintosh Software

Commercial Macintosh & PC Software

(which is available through the Bioinformatics Resource)

WWW versions of Programs


Available Databases

The following databases are available locally

For more information on the GCG versions of these databases, click here.

Here are some Databases on the Net.


Database Searching

There are several different methods you can use to search the databases

GCG PROGRAM

WHAT IT DOES

WHEN TO USE IT

lookup

searches the local databases for keywords, author names, sequence names

When you want to obtain a specific sequence.

fetch

retrieves a sequence from the database

When you have an accession or locus number

fasta

searches the local databases for similar sequences

When you have a DNA or protein sequence

blast

searches the local or the remote database at NCBI for similar sequences

When you have a DNA or protein sequence

motif

Searches for Protein Motifs

You want to know the function of a protein

findpattern

Search for sequence patterns or search the TFD

Look for a specific sequence motif in your sequence

profilesearch

Search the databases for a pattern created by profilemake

Search for sequences that may be similar to your sequence.

In GCG, the strategy is

 

You can also search by using the NCBI web site or any of the other database web sites.

Search Program

What it does

ENTREZ

Combination of Genbank and the Molecular Biology subset of Medline

BLAST

Search the sequence databases

The local Decypher search program

Advantages

Disadvantages

Real Fast - Smith Waterman algorithm inserts gaps to optimize sequence alignment- Most Thorough Search

May not have the most recent databases

You can use the program DBWatcher on PMGM to schedule automatic searches of the database once a week or so.

If you want to send multiple sequences to a server for searching, visit the BLASTALL web page for instructions on how to do this and interpret the results.


Submitting Sequences to Genbank

There are several different methods to send a sequence to the National Center for Biotechnology Information. You do not need to send the sequence to EMBL also. Genbank and EMBL exchange information.

For more information about submitting info to Genbank, visit this site.


Exchanging Sequences Between Programs

Each sequence analysis program and database has a different data format than the other. For example, in order to use any sequence file in GCG, it must first be converted into GCG format.

If you use Microsoft Word to edit your sequence file, you should save that file as "TEXT ONLY". Do not save the file in Microsoft Word or Normal format. The sequence analysis programs cannot understand this format.

The easiest way to move a sequence from a Macintosh or a PC into your PMGM account is to use a ftp program. With a Macintosh, you would use the program called Fetch.

For more detailed information about exchanging sequence info. Click here.


Multiple Alignment

The best method to do this is the PileUp or ClustalW program in GCG. For more info on Clustal, and its applicability and limitations, see TIBS Oct98, page 403

Once the alignment has been created by the computer, it is possible to manually edit this alignment using the SeqLab editor or by using LineUp.

This multiple alignment file (msf file) can then be sent to phylogenetic analysis programs, or the alignment can be sent to programs which will let you create a nicer looking display of your multiple alignment.

For example, to nicely display your alignment file

1. run Pretty on PMGM, or

2. run prettybox on PMGM (here is a detailed example), or

3. Convert a GCG MSF file to Excel format

4. The SeqWeb PileUp alignment program gives a nice colored output.

5. MacBox

On PMGM, there is also the X-Clustal program. There is also a Macintosh version of ClustalX available on MacArchives. These programs like to use sequences that have been gathered together into one big "fasta" formatted file. You can't input sequences one at a time. Other Mac and PC alignment programs are also available. These include MacVector and DNAstar (Multalign).


Sequence Comparison

In GCG, there are two main programs for comparing two sequences

GAP - This program does a "global" alignment, and tries to insert gaps to align one sequence completely on top of the other.

BestFit - Does a local alignment, and will find the region of the two sequences that has the best alignment.

Programs like fasta can be used for sequence comparison, but they are best used for searching entire databases.

We also have programs like MUMmer that can compare entire genomes. Here's a link to the Web version of MUMmer


Phylogenetic Analyses

GCG Evolutionary Analysis Programs

1. Perform the multiple alignment using Pileup or Clustal

2. Manually adjust the alignment if necessary. It is very important to start with a correct alignment. You can do this by using the GCG program LineUp, or better yet, move the "msf" file over to the SeqLab Editor. Sometimes you have biological knowledge that you need to incorporate into the alignment. For example, you might not want to put a gap in the middle of an alpha helix, but instead, move the gap into a loop or a turn in the protein sequence.

3. Calculate the evolutionary distances using the Distances program.

4. Pass the results of the Distances program over to the GrowTree program.

 You can also use parsimony methods using the PAUP programs within GCG

GDE within GCG

You can switch over to the SeqLab editor, and access the Phylip phylogenetic software. You can find all this software under the GDE menu while in the SeqLab Editor.

You can also run the Phylip software separately, but it's a bit more difficult.

Here is a discussion of some potential strategies for doing phylogenetic analysis.


Predicting Function

There are several different programs you can use to predict the function of an unknown protein or gene. I suggest using them all.

1. Search the sequence databases

Program

Advantages/Disadvantages

Fast, Most recent database - May not find all sequences of interest

More thorough / Databases may be 2 wks old/ No EST databases

Fast - Most thorough search / Database may be 2 months old/ No EST Databases

2. Search for Motifs within your unknown protein

3. Search for Blocks within your unknown protein

4. Identify Search for sequence patterns within your unknown protein

5. Hidden Markov search of the PFAM database.

The best strategy is to use the programs together, especially BLAST, Decypher and Identify. Comparing the results from the different programs will give you the best insight into the function of your unknown protein.


Predicting Genes

Given a genomic sequence, there are several different computer programs that you can use to predict

These programs are not 100% accurate, but they can predict some of the coding regions within a genomic sequence.

There are programs locally, like GCG's Frames and GrailPro (Works in SeqWeb), as well as out on the web. It is suggested that you try a couple of different programs. Grail EXP and GenScan seem to work the best.

On PMGM, from an X-Window terminal, you can run the program X-Grail.


Printing or Viewing GCG Images

The recommended method is to use MacX, but the alternative methods can be seen on these web pages


Analyzing Arrays

Currently we have the following software

Online analysis of expression arrays using Significance Analysis

Don't forget to check out the Stanford Microarray Database


Genomic Analysis and Databases

ENSEMBL - Database of Human Genome


Other Sequence Analysis Classes

There are several other classes about sequence analysis at Stanford

and also out on The Net.

Also, check out the BioCompanion