DNA Helix

CAMIS Symposium

Genomics and Bioinformatics:
Their Impact on Medicine and Molecular Biology

Monday, November 6, 1995
8:45 to 5:30 PM

Fairchild Auditorium
Stanford University Medical Center

The CAMIS symposium will cover leading edge methods of genomics and bioinformatics and demonstrate current efforts to apply these methods to molecular biology, and the diagnosis and treatment of disease. Sequence information produced by the genome projects yields useful diagnostic probes immediately. In the longer term, analyses of inherited and infectious disease genes will provide deeper understanding of the disease and ultimately better therapies.

Symposium Schedule

Introduction: From Genes to Symptoms and Back
Doug Brutlag ( 8:45 - 9:00)

Automated DNA Sequencing and Analysis
Ron Davis ( 9:00 - 9:45)

The Human Genome Project: Current Progress and Future Directions
Rick Myers ( 9:45 - 10:30)

Posters & Demos (10:30 - 11:15)

Biomolecular Sequence Analysis: Score-based Methods
Sam Karlin (11:15 - 12:00)

Discovering the Meaning in Molecular Sequences
Doug Brutlag (12:00 - 12:45)

Lunch, Posters & Demos (12:45 - 2:00)

Computing with RNA Structure and Sequence: The 16S Ribosomal Subunit
Russ Altman ( 2:00 - 2:45)

Towards Large-Scale Modeling of Structure from Sequence
Michael Levitt ( 2:45 - 3:30)

Genetic Circuit Simulation: Stochastic Chemistry in the Phage Lambda Switch
Harley McAdams ( 3:30 - 4:15)

Summary: Russ Altman ( 4:15 - 4:30)

Posters & Demos ( 4:30 - 5:30)

Sponsored By
Affymetrix, Inc.
Perkin-Elmer's Bioresearch Division
Beckman Instruments
Hyseq, Inc.
Molecular Applications Group
SmithKline Beecham

The symposium is open and free to the public.

Parking permits for the day of the symposium in the "C" parking lots will be on sale for $1 at the Fairchild Center

For more information contact Pat Swift

Pat Swift
(415) 725-2972

Poster and Demonstration Titles

Detection of Virulence Genes in Salmonella Using Molecular Probes.
Baerbel Raupach & Stanley Falkow

Tracking Tuberculosis Transmission: Computer Assisted Molecular Epidemiology.
Greg Woelffer, Melvin Javonillo, Gary Schoolnik, & Peter Small

ENTREZ: Searching Molecular Biology Literature and Sequence Databases.
Heidi Heilemann & Jane Goh

Saccharomyces Genome Database:
One Stop Shopping for Yeast Information on the Internet.

Mike Cherry & David Botstein

Extracting Structural Implications from Sequence Information.
Tod Klingler

Parsing Gene Structure Using Phase-Specific Dynamic Programming.
Tom Wu

Automated Discovery of Discrete Motifs in Related Proteins:
Application to Reverse Transcriptase.

Tom Wu

Evolution of HIV-1 Reverse Transcriptase Mutations In vivo:
Constraints Imposed by Combination Antiretroviral Therapy.

Bob Shafer & Tom Merigan

The Bioinformatics Resource and the Molecular Modeling Laboratory.
Lee Kozar

Stochastico: Exact, Chemical-Event Level Simulation of Genetic Mechanisms.
Adam Arkin

The Structure of the 16S Ribosomal Subunit.
Doran Fink, Richard Chen, Ramon Felciano & Russ Altman

RiboWeb: Online Access to Structural Biology Information.
Ramon Felciano, Richard Chen, Doran Fink, & Russ Altman

Characterizing the Microenvironments within Proteins.
Liping Wei, Carol Cheng, Steve Bagley & Russ Altman

A Library of Protein Family Cores.
Robert Schmidt, Mark Gerstein & Russ Altman

ProteanD: Methods for Displaying Molecular Structural Uncertainty.
Russ Altman

Charged Residue Clusters in Protein 3-D Structures.
Zhan-Yang Zhu & Samuel Karlin

Genomic Functional Analysis in Yeast by Genetic Footprinting
Victoria Smith, Karen Chou, David Botstein, & Patrick Brown

Gene Disruption and Functional Analysis of All the Yeast Genes.
Dan Shoemaker, Deval Lashkari & Ron Davis

Ecocyc: An Interactive Encyclopedia of E. coli Metabolism.
Peter Karp & Suzanne Paley

The JBC Online - ASBMB and Stanford's HighWire Press present a full text WWW
version of the Journal of Biological Chemistry

Todd McGee, Maureen Phayer and Vicky Reich

Molecular fingerprinting in infectious diseases epidemiology: vignettes.
Lynette Waring, M.D., Patricia Mickelsen, Ph.D., Lucy S. Tompkins, M.D., Ph.D.

Exploring the diversity of uncultivated microbial pathogens with amplified
ribosomal DNA sequences.

Kristine Yoder, Ian Kroes, Dave Fredricks, David Relman

Tracking Y chromosomes genes during human evolution (tentative title).
Peter Oefner and Peter Underhill

Sequencing by Hybridization: A Genome Analysis Platform.
Radoje Drmanac, Hyseq, Inc.

A homology model of Dictyostelium myosin light chain kinase based on cyclic
AMP-dependent protein kinase, and assessment of the model.

Jonathan M. Goldberg & Janet L. Smith

Full Abstracts

Computing with RNA Structure and Sequence: The 16S Ribosomal Subunit

From Genotype to Symptoms and Back
Discovering the Meanings in Molecular Sequences


From Genes to Symptoms and Back

Doug Brutlag

Molecular biology has been described as the study of the flow of
information from genome to phenotype. Classically, the steps in this flow
include transcription, translation, processing and targeting of the gene
products in the organism. It is the goal of bioinformatics to be able to
predict this flow of information and the transformations the informational
molecules take during their flow from genome to phenotype.

Medicine is being revolutionized by these molecular methods. Genome
mapping and sequencing has lead to novel diagnostic probes for both
inherited and infectious disease. However, improved therapies depend on a
deeper understanding of the disease based on the genetic changes. It is
through the use of bioinformatics to understand the nature of the inherited
defect or the infectious agent that better, more rational therapies will be
designed (Brutlag, D. L. (1994). Understanding the Human Genome. In Leder,
P., Clayton, D. A. and Rubenstein, E. (Ed.), Scientific American:
Introduction to Molecular Medicine (pp. 153-168). New York NY: Scientific
American Inc.).

Discovering the Meanings in Molecular Sequences

Doug Brutlag

When confronted with a new DNA or protein sequence, biologists often argue
by homology. Comparison of the new sequence with all the known sequences
often (more than 50% of the time) yields a similar sequence in another
organism, or a related gene in the same organism. If sequence-similarity
tests fail, then there are databases of consensus motifs or sequence
patterns that can also be compared with the new sequence. Motifs often
identify short regions of functional significance that give a clue to a
possible function of the new protein. Motifs are generated from families
of related sequences either by heuristic procedures or by methods commonly
known as machine learning. Motifs are of several forms including consensus
patterns, blocks, profiles, templates and hidden Markov models.

All of the sequence-similarity and motif methods are based on evolutionary
principles involving conservation of residues, or conservation of some
property of the residues, at specific positions in the motif. They also
all make the implicit assumption that the residues at each position are
independent of the residues at all other positions. While they yield
insight as to what is conserved, they do not often yield an understanding
why those residues are conserved.

Recently we have examined sets of sequences of common structure or function
for correlations between non-conserved positions. We have found that many
highly variable positions, often have properties that are correlated with
the properties of the residues at another position in a motif. Tod
Klingler has discovered 12 pairs of correlated residues in a-helices which
stabilize the helical structure (Klingler, T. M. and Brutlag, D. L.
(1994). Discovering Structural Correlations in a-Helices. Protein Science
3 , 1847-1857). He has also discovered a motif that is found at the ends
of a-helices. Correlated positions within sequence families, unlike
conserved positions, reveal structural features.

Hence there are at least two fundamentally different types of information
present in sequence families of common structure or function. On the one
hand, there is the classical information represented by conserved residues
or properties at distinct positions in a sequence. Conserved positions can
point out important functional groups without explaining the nature of the
function. On the other hand, correlations between positions in sets of
sequences (the so called "mutual information") have direct structural

Automated DNA Sequencing and Analysis


Towards Large-Scale Modeling of Structure from Sequence


Towards Large-Scale Automatic Modeling of Structure from

Michael Levitt, Doug Laurents*, Enoch Huang and S. Subbiah
Department of Structural Biology, Stanford Medical School, Stanford, CA 94305

With the dramatic increase in the rate of determination of DNA and hence
protein sequences, experimental structural studies simply cannot keep up.
Todaywe know the sequences for many thousands of genes for which there is no
structure and this structure-sequence gap is doubling every 18 months.
Fortunately, proteins that have similar sequences also have similar

This, together with recent improvements in automatic modeling of structures by
homology, opens up the way to producing reasonable structures for an increasing
numbers of genes.

In this talk, we first describe our work in this area. First, we present
our method of homology modeling by segment matching, and show how it can be used to
build a structure for a new sequence aligned to an old sequence of known
structure. Second, we show that proteins without any significant level of
sequence similarity often have similar structures. Our method for structural
alignment, Structal, which compares structures in much the same way as sequence
alignment compares sequences, is used to examine all known proteins. Thirdly, a
new and very simple energy function we have developed is being combined with
alignment algorithms to help improve the sensitivity. The power of our
approach will be greatly enhanced if we can better detect whether a new structure can be
accurately modeled from an old one.

This work, which has already given a number interesting results, is presently
under active study in the hope that we can maximize the value of the sequence
information in the genome.

Genetic Circuit Simulation: Stochastic Chemistry in the Phage Lambda Switch


The Human Genome Project: Current Progress and Future Directions


Biomolecular Sequence Analysis: Score-based Methods


Title: Biomolecular Sequence Analysis: Score-based Methods

Author: Samuel Karlin

Email: fd.zgg@forsythe.stanford.edu

Web page: http://gnomic.stanford.edu/Group.html/#Sam


Mathematical (statistical) ideas and computational approaches play an increasing role in
modern molecular biology research. I review the method of score-based sequence analysis
with the objective of discerning distinctive segments in single sequences and in resolving
significant common segments in sequence comparisons. Examples will focus on identifying
charge clusters in sequences and in protein structures on exon prediction and on characterizing
hererogeneity in genomes. Some interpretations will be discussed.

Posters and Demonstrations:

1) Detection of Virulence genes in Salmonella Using Molecular Probes.
Baerbel Raupach & Stanley Falkow

Detection of Virulence Genes in Salmonella using Molecular Probes
Baerbel Raupach & Stanley Falkow, Department of Microbiology and
Immunology, Stanford
e-mail address: raupach@cmgm.stanford.edu

In the past, several different genetic approaches have been employed to identify
bacterial virulence factors. However, comprehensive screens of bacterial genomes
for virulence genes in vivo have not been possible because of the inability to
identify mutants with attenuated virulence within pools of mutagenized bacteria,
and the inability to assess separately the virulence of each of several thousand
mutants necessary to screen a bacterial genome.

Recently, this problem has been circumvented by a new methodology, called signature-tagged
mutagenesis (STM), which was developed by Hensel et al. (Science 7/95). Each transposon
mutant is tagged by a unique DNA sequence. Mutants of virulence factors can be identified
by negative selection, i.e., by comparing the identity of mutagenized bacteria recovered
post-infection with the total pool of transposon mutants used as the inoculum. Hensel et al.
have evaluated the signature-tagged mutagenesis system by identifying S. typhimurium mutants
with attenuated virulence after intraperitoneal infection of mice.

We are currently utilizing the same set of signature-tagged mutants to conduct a comprehensive
screen of the Salmonella genome for virulence determinants -- both in vitro and in vivo .
We have screened for S. typhimurium mutants that are either non-invasive for cultured
epithelial cells or non-replicating in phagocytes (murine macrophage-like cells RAW264.7),
respectively. Moreover, we are in the process of identifying the factors that are
required for the interaction of S. typhimurium with its host in vivo by oral
inoculation of BALB/c mice with pools of transposon mutants.
The technique of signature-tagged mutagenesis will be introduced and the virulence
genes of Salmonella typhimurium identified so far using signature tags as molecular
probes will be presented.

2) Tracking Tuberculosis Transmission: Computer Assisted Molecular
Greg Woelffer, Melvin Javonillo, Gary Schoolnik & Peter Small

A Computer Assisted Molecular Epidemiologic Approach for Confronting
the Re-emergence of Tuberculosis

Gregory B Woelffer B.A., Melvin Javonillo, B.A., Gary K. Schoolnik,
M.D., and Peter M Small, M.D.
Division of Infectious Disease and Geographic Medicine, Stanford Medical
School and Division of Tuberculosis Control, SF Dept of Public Health

WWW page: http://molepi.stanford.edu/tb.www.html

Abstract: Molecular epidemiologic approaches have provided important insights into
the pathogenesis and epidemiology of tuberculosis (TB). However, continued progress
in this field will be reliant upon the development of computerized information management
systems capable of analyzing large numbers of bacterial DNA fingerprints and incorporating
this with data collected as part of conventional disease surveillance. The specific
attributes of these computer systems must be tailored to the nature and scope of the
research question. In this paper we describe a system which is being used for the
surveillance of M. tuberculosis (MTB) strains in San Francisco. The current performance
characteristics are described and potential future developmental directions are outlined.
This system demonstrates several general principles of computerized molecular epidemiology
which are likely to be of increasing applicability to a variety of pathogens.

3) ENTREZ: Searching Molecular Biology Literature and Sequence Databases.
Heidi Heilemann & Jane Goh

Entrez and Internet Grateful Med Demonstrations
Jane Goh and Heidi Heilemann
Lane Medical Library
Stanford University Medical Center

Entrez is an integrated multi-database searching system for protein and
nucleotide molecular sequences and associated MEDLINE references. This
menu-driven system was developed by the National Center for Biotechnology
Information (NCBI), a division of the National Library of Medicine. Unique
characteristics of Entrez and its use applications will be presented. Its
Sequences database portion includes records from GenBank, PIR, SWISS-PROT,
and other databases. Sequence-associated MEDLINE records plus the MEDLINE
records indexed with the "molecular sequence data" MeSH term make up the
Reference portion. Both the Network Entrez (nEntrez) and the Web Browser
Entrez interfaces will be demonstrated.

The Internet Grateful Med prototype is a new program being developed by the
National Library of Medicine for assisted searching in MEDLINE via the World
Wide Web. Using a web browser like Netscape, users can create, submit and
refine a search in MEDLINE. Internet Grateful Med also offers direct links
to the full text of the Clinical Practice Guidelines supported by the Agency
of Health Care Policy and Research (AHPCR) available from NLM on the Health
Services/Technology Assessment Text (HSTAT) World Wide Web server. Internet
Grateful Med also offers direct access to more than 60,000 images in NLM's
collection of Online Images from the History of Medicine division on another
NLM web server. HIV-AIDS information in the files AIDSLINE, AIDSTRIALS and
AIDSDRUGS will soon be available via Internet Grateful Med.

4) Saccharomyces Genome Database: One Stop Shopping for Yeast Information
on the Internet.
Mike Cherry & David Botstein

Saccharomyces Genome Database: One Stop Shopping for Yeast Information
on the Internet

J.M. Cherry, C. Adler, C. Ball, B. Dunn, R. Schmidt, S. Dwight, and
D. Botstein
Department of Genetics, Stanford University School of Medicine,
Stanford, California 94305-5120

A genomic database for Saccharomyces cerevisiae is available from the Saccharomyces
Genome Database project (SGD). The project acts as the S. cerevisiae gene name
registry and the depository for genetic and physical mapping information.
Information on the literature, biology and biochemical/molecular interactions
of yeast gene products are included. The results of the completed chromosomal
sequencing projects are integrated into the database. Connections to the Yeast
Protein Database from CSHL, GenBank from NCBI, and SwissProt from EMBL are also

The database is accessible through the Internet using Gopher from the host
genome-gopher.stanford.edu using port 70. The World Web Wide (WWW) form of
the database provides access to the full database and allows queries to be
entered through specialized forms. Many types of information are presented
graphically, including the genetic, physical maps and features of large DNA
sequences. The WWW version of SGD is available from
URL: http://genome-www.stanford.edu/

5) Extracting Structural implications from Sequence Information.
Tod Klingler

6) Parsing Gene Structure Using Phase-Specific Dynamic Programming.
Tom Wu and Doug Brutlag

I present a general segment-based dynamic programming algorithm for predicting
gene structure from a sequence of genomic DNA. This algorithm finds the most
likely gene structure based on three types of criteria: (1) Junctional constraints,
which specify possible sites of initiation, termination, and splicing;
(2) Frame constraints, which ensure that the total exon length is a multiple
of three and that no stop codons are formed internally or at exon-exon junctions; and
(3) Scoring functions that estimate the likelihood that a region is coding or that a
site is a true intron-exon junction. The algorithm is general because it can optimize
any set of scoring functions over all gene structures that satisfy the junctional and
frame constraints. In contrast with existing dynamic programming algorithms, which
compute over individual nucleotides or over clusters of exon candidates, segment-based
dynamic programming computes over all possible intron and exon segments. Because the
algorithm maintains reading frame and phase information for each segment, it can both
assemble exons in-frame and score them in-frame. I present experimental evidence on the
computational power of in-frame exon assembly and in-frame exon scoring. Using a simple
scoring measure based on Markov hexamer frequencies, my results show that in-frame assembly
improves specificity only slightly over frame-independent assembly, whereas in-frame
scoring improves specificity by a factor of three over frame-independent scoring.

7) Automated Discovery of Discrete Motifs in Related Proteins: Application
to Reverse Transcriptase.
Tom Wu and Doug Brutlag

We present a method for discovering discrete motifs from a set of related
protein sequences. Our method addresses a fundamental difficulty in motif
identification: the possible presence of incoherent data. Several sources
of incoherent data--such as the presence of subclasses, contaminating sequences,
errors, or misalignments--may obscure an underlying discrete motif. Our
algorithm relies on a search process to detect incoherent data by constructing
alternative partitions of the sequence data. The goal is to find a subset
of the sequence data that generates a sensitive and specific motif. The search
process is guided by knowledge about meaningful groups of amino acids, or
alphabets, organized according to biochemical and physical properties, such
as charge, volume, or hydrophobicity. These alphabets may be based on
theoretical grounds or by empirical studies of amino acid substitutions.
We have developed a statistical technique for identifying amino acid alphabets
empirically and have derived an alphabet using the BLOCKS database of related
protein sequences. We have tested the overall motif identification algorithm
on a set of reverse transcriptases that contains subclasses, sequence errors,
misalignments, and contaminating sequences. Despite these sources of incoherent
data, our program identifies a novel motif for the subclass of retroviral and
retrovirus-related reverse transcriptases. This motif has a much higher
specificity than previously reported discrete motifs and suggests the importance
of conserved hydrophilic and hydrophobic residues in the structure or function
of reverse transcriptases.

8) Evolution of HIV-1 Reverse Transcriptase Mutations In vivo: Constraints
Imposed by Combination Antiretroviral Therapy.
Bob Shafer & Tom Merigan

Evolution of HIV-1 Reverse Transcriptase (RT) Mutations In vivo:
Constraints Imposed by Combination Antiretroviral Therapy. Robert
W. Shafer, Astrid K. Iversen, Mark A. Winters, Michael J. Kozal,
and Thomas C. Merigan.

Introduction: HIV-1 has an enormous capacity to mutate and become resistant
to antiretroviral agents. However, certain RT mutations conferring resistance
to one drug may suppress resistance to a second drug suggesting that during
combination therapy HIV-1 may be prevented from developing certain combinations
of mutations. For example, in molecular HIV-1 constructs, the ddI-resistance
RT mutation L74V suppresses drug resistance conferred by the AZT-resistance RT
mutation T215Y. Methods: We studied the drug susceptibility, RT sequence, and
plasma virus burden of HIV-1 strains from patients receiving combination therapy
with AZT+ddI for r2 years. Results: Among 38 patients receiving AZT+ddI, the
AZT-resistance mutation T215Y occurred in 21 (55%) patients but the ddI-resistance
mutation L74V was prevented. 3 (8%) patients developed multidrug resistant
HIV-1 strains with 5 newly identified mutations, Q151M, A62V, V75I, F77L,
and F116Y. These mutations developed in a sequential and cumulative pattern.
Certain combinations of mutations associated with attenuated virus replication
were replaced by nonattenuating combinations. Analysis of the 3-D structure
of HIV-1 RT shows that positions 75, 77, and 151 are in the fingers subdomain
and closely associated with nucleotides of the template strand.

Drug resistance and viral replicative capacity both play a role in
selection of HIV-1 RT mutations in vivo. Antiretroviral regimens
which select for HIV- 1 strains with even a partially reduced level
of viral function may be of clinical benefit if the extent of replication
impairment reduces the steady state level of the resistant mutant virus.

9) The Bioinformatics Resource and the Molecular Modeling Laboratory.
Lee Kozar

10) Stochastico: Exact, Chemical-Event Level Simulation of Genetic
Adam Arkin

Stochastica: Exact, Chemical-Event Level Simulation of Genetic

Stochastica is a simulation language complete with graphical
point-and-click user interface for the simulation of genetic
and biochemical mechanisms. Most computer simulations of biochemical
networks are based on the assumption that macroscopic mass-action
kinetics are valid for cellular reactions. This assumption breaks
down when the number of molecules per unit volume is small and/or
probabilities per unit time for a sub-set of reactions are small.
This is the case, for example, in genetic transcription which involves
relatively infrequent single molecule events which must be modeled at
the so-called mesoscopic rather than microscopic scale in order to
better capture the actual dynamics. In addition, reactions involving
small numbers of signaling molecules and metabolites must also be
modeled in this way. Stochastica efficiently and transparently implements
the Master-Equation approach to chemical kinetics which allows both the
"average" dynamic behavior of a biochemical network to be calculated as
well as the important fluctuational dynamics. These more accurate dynamics
can differ considerably from the macroscopic predictions. Initial
analyses of genetic feed-back loops, the Lambda phage lysis/lysogeny
decision and other systems using Stochastica have indicated that the
fundamental statistical nature of genetic transcription and translation,
which leads to irregular bursts of genetic activity and protein production,
can have profound implications for cellular control and can lead to a
multiplicty of phenotypes from a single genotype starting with identical
initial conditions. These results emphasize the necessity of such simulations
for testing hypotheses about the behaviour of even relatively simple genetic
networks. This demonstration will explain the basics of model building,
simulation and analysis with Stochastica using examples of small genetic
feed-back circuits.

11) The Structure of the 16S Ribosomal Subunit.
Doran Fink, Richard Chen, Ramon Felciano & Russ Altman

12) RiboWeb: Online Access to Structural Biology Information.
Ramon Felciano, Richard Chen, Doran Fink & Russ Altman

13) Characterizing the Microenvironments within Proteins.
Liping Wei, Carol Cheng, Steve Bagley & Russ Altman

14) A Library of Protein Family Cores.
Robert Schmidt, Mark Gerstein, & Russ Altman

15) ProteanD: Methods for Displaying Molecular Structural Uncertainty.
Russ Altman

16) Charged Residue Clusters in Protein 3-D Structures.
Zhan-Yang Zhu & Samuel Karlin

Charged Residue Clusters in Protein 3-D Structures

Zhan-Yang Zhu & Samuel Karlin

Department of Mathematics
Stanford University
Stanford, CA94305-2125


A protein 3-D structure with N residues is represented by a matrix of N linear
sequences or tree-like structures. The statistical methods of high scoring
segments and binomial model are applied to the N linear sequences or tree-
like structures respectively to identify significant clusters of a desired
type of residues, e.g., charge residues (acidic residues, positive charge
residues and mixed charge residues) or buried residues as defined by solvent

The residues in a charge cluster are usually from different regions of its
primary sequence or from two or more chains. It is shown that charge clusters
can provide local stability and inter-chain stability by forming hydrogen bonds
and salt bridges. It can also provide necessary binding environment for calcium
and other substrates. The examples of charge clusters will be illustrated for
several examples.

17) Genomic Functional Analysis in Yeast by Genetic Footprinting.
Victoria Smith, Karen Chou, David Botstein, & Patrick Brown
Genomic Functional Analysis in Yeast by Genetic Footprinting
Victoria Smith1, Karen Chou1, David Botstein2, Patrick O. Brown1,3.
Departments of Biochemistry1 and Genetics2, Stanford University, and the
Howard Hughes Medical Institute3.

As a result of the yeast genome sequencing effort, thousands of
putative genes of unknown function are being identified. We have
developed a genomic strategy, genetic footprinting, aimed at rapidly and
efficiently providing information on the biological function of new DNA
sequences in Saccharomyces cerevisiae. Insertional mutagenesis, using an
inducible Ty1 transposable element, followed by growth under different
selections, are performed en masse in a large populations of cells. PCR
amplification using a primer specific to a sequence of interest and second
primer specific to the Ty1 element detects Ty1 insertions at that sequence.
If that sequence was important for growth under a particular selection, the
cells harboring these Ty1 insertions are depleted from the cell population with
each generation of growth. The corresponding depletion of PCR product bands,
the "genetic footprint," indicates a role for that sequence in the selection.
As the effects of mutations in any DNA sequence under any particular selection
can be determined retrospectively by a single PCR, each selection need only be
performed once to enable the functions of thousands of genes to be readily

In the last year, we have obtained data for 249 predicted genes on
chromosome V, analyzed under several different selections, by genetic
footprinting. Ty1 insertion mutations were detected using
fluorescently-labeled primers specific for each gene. Mutations in
32 genes conferred a severe growth disadvantage in all selections
(essential genes). Mutations in another 44 genes conferred more subtle
growth disadvantages under all selections. In addition, another 32 genes
were found to be necessary for growth in minimal medium, or medium containing
lactate as the sole carbon source, or medium containing salt or caffeine,
or for growth at high temperature. Overall, more than half of these genes
are novel genes that have not been previously characterized. We are currently
adding new selections to expand our understanding of gene function in S. cerevisiae.

18) Gene Disruption and Functional Analysis of All the Yeast Genes.
Dan Shoemaker, Deval Lashkari & Ron Davis

The use of Genomic Mismatch Scanning in conjunction with microarray
technology to map yeast multigenic traits.


Joe DiRisi, Lolita Penland and Pat Brown

Most of the complex phenotypes observable in organisms ranging from
humans to yeast are likely due to the interaction of multiple genes, as
opposed to single gene traits, which have been the practical focus of
classical genetics. Although these multigenic traits are among the most
important, especially for human health and industrial purposes, they are
among the most difficult to map. Genomic Mismatch Scanning (GMS) is a
mapping technique which has the ability to greatly reduce the complexity
and labor of mapping such traits and yeast is an excellent model system to
test and resolve issues concerning GMS as a mapping tool.

The results of the GMS technique, which consists of Identical by
Descent DNA sequences between two individuals, can be easily assayed on a
DNA microarray. With an ordered set of array elements, such as those
presented in this poster, high resolution recombination data can also be
directly obtained.

20) Ecocyc: An Interactive Encyclopedia of E. coli Metabolism.
Peter Karp & Suzanne Paley

EcoCyc: Encyclopedia of E. Coli Genes and Metabolism
Peter D. Karp
SRI International

EcoCyc is a knowledge base of E. coli genes and metabolism that runs
through the World Wide Web and on Unix Workstations. Its graphical user
interface creates drawings of metabolic pathways, of individual reactions,
and of the E. coli genomic map. Users can call up objects through a variety
of queries (such as retrieving an enzyme by a substring search), and then
navigate to related entities shown in the resulting display window. For
example, a user could zoom in on a region of the genetic map, click on a
gene to obtain detailed information about it, and then navigate to the
enzyme product of the gene, and then to the metabolic pathway containing the
enzyme. Metabolic pathway drawings are produced automatically, and can be drawn
in several styles, such as with compound structures present or absent. The
EcoCyc knowledge base currently contains information about 100 pathways, 400
enzymes, 600 reactions, 1100 metabolic compounds, and 2030 E. coli genes.
It will eventually contain information about all pathways, enzymes, and reactions
of E. coli metabolism. EcoCyc contains extensive information about each enzyme,
including its cofactors, activators and inhibitors (qualified by type), subunit
composition, substrate specificity, and molecular weight. Individual values in
the knowledge base are extensively annotated with citations to the literature,
as are comment fields..

21) The JBC Online - ASBMB and Stanford's HighWire Press present a full
text WWW version of the Journal of Biological Chemistry
Todd McGee, Maureen Phayer and Vicky Reich

Title: The JBC Online - ASBMB and Stanford's HighWire Press present
a full text WWW version of the Journal of Biological Chemistry

Presenters: Todd McGee, Maureen Phayer, Vicky Reich

Abstract: Full text of the Journal of Biological Chemistry is on the web beginning with the
April 14, 1995 issue. New issues are online within 48 hours of the cover date. Some
features of the online version include: full text searching, expandable line art and
gray scale images, links to upcoming issue table of contents, links to Medline and GenBank,
and print ready versions (PDF) of articles. This service is available free through 1995.
URL: http://www-jbc.stanford.edu/jbc/

22) Molecular fingerprinting in infectious diseases epidemiology: vignettes
Lynette Waring, M.D., Patricia Mickelsen, Ph.D., Lucy S. Tompkins, M.D., Ph.D.

"Molecular fingerprinting in infectious diseases epidemiology: vignettes"
Lynette Waring, M.D., Patricia Mickelsen, Ph.D., Lucy S. Tompkins, M.D. Ph.D.

Molecular fingerprinting of infectious diseases agents has become an important
tool in epidemiological investigations into outbreaks of infectious diseases.
These methods apply conventional techniques to examine DNA or RNA sequence
differences among a collection of isolates. Those containing identical
"fingerprints" are considered to be derivatives of the same strain or clone for
epidemiological analysis. Among these techniques used in diagnostic microbiology
laboratories are plasmid profile analysis, analysis of genomic DNA by restriction
endonuclease digestion and separation by conventional or pulse-field gel
electrophoresis, Southern hybridization, PCR amplification using specific
or random primers, and DNA sequence analysis. We present several examples
of the application of these methods to solve infectious disease "mysteries".

23) Exploring the diversity of uncultivated microbial pathogens with
amplified ribosomal DNA sequences
Kristine Yoder, Ian Kroes, Dave Fredricks, David Relman

24) Tracking Y chromosomes genes during human evolution (tentative title)
Peter Oefner and Peter Underhill

Tracking the Y chromosome during human evolution.
Peter A. Underhill (1), Li Jin (1), and Peter J. Oefner(2)
Departments of Genetics (1) and Biochemistry (2), Stanford University, CA,

The non-recombining portion of the human Y chromosome provides a unique system
for the study of human origins, migration and admixture. However, few such
polymorphisms have been identified to date, presumably due to a smaller effective
population size relative to autosomes, polygyny and, possibly, the reduction of
variation on the entire chromosome due to selection at a single locus. Recently,
the search for polymorphisms on the Y chromosome has been significantly accelerated
by means of denaturing high-performance liquid chromatography, which enables the
detection of single-base changes in DNA fragments as large as 1.5 Kb. The method
exploits the differential retention of heteroduplex molecules under partially denaturing
conditions formed upon reannealing PCR products generated from the same Y-specific
region of two indivdual genomes. Heteroduplices are generally retained less than
homoduplices, both of which are readily identified by on-line UV absorbance detection.
With the exception of very AT- or GC-rich sequences, that are best analyzed at temperatures
slightly lower or higher than 56?C, respectively, a single set of conditions is sufficient.
No sample pretreatment is required, and analyses can be carried out within minutes in a
completely automated fashion. Using this innovative technique, several new Y-specific
polymorphisms have been identified, the most significant of which is a C to T transition
found exclusively within the Western Hemisphere. The Pre-Columbian T allele occurs at >
90 % frequency within the native South and Central American populations examined, while
its occurrence in North America is about 50%. Concomitant genotyping at a polymorphic
tetranucleotide microsatellite DYS19 locus revealed significant linkage disequilibrium
of the C to T mutation with the 186-bp allele. The data suggest a single origin of
linguistically diverse native Americans with subsequent haplotype differentiation within
radiating indigenous populations as well as Post-Columbian European and African gene flow.
Overall, the polymorphisms found to date indicate a pattern of pronounced geographical
localization of Y-specific nucleotide substitutions. Furthermore, it becomes evident that
the majority of variation on the Y-chromosome, which is approximately 3.5-fold lower
compared to autosomes, is mainly due to the smaller effective population size of the Y
chromosome, rather than selective sweeps.

25) Sequencing by Hybridization: A Genome Analysis Platform.
Radoje Drmanac, Hyseq Inc.

26) A homology model of Dictyostelium myosin light chain kinase based on
cyclic AMP-dependent protein kinase, and assessment of the model
Jonathan M. Goldberg & Janet L. Smith


A homology model of Dictyostelium myosin light-chain kinase
based on cyclic AMP-dependent protein kinase, and evaluation
of the model


Jonathan M. Goldberg & Janet L. Smith

CONTACT PERSON: Jon Goldberg, B439 Beckman Ctr., 723-6902, FAX 725-6044


Dictyostelium myosin light-chain kinase (MLCK, 295 residues) is activated by
intramolecular phosphorylation of Thr-289, and by phosphorylation of Thr-166
by another kinase. The catalytic domain of MLCK is about 35 % identical to
that of cyclic AMP-dependent kinase (PKA, 350 residues) for which the crystal
structure of the ternary complex with ATP and an inhibitor peptide (24 residues)
has been solved [Knigton, D. R. et al. Science 253, 407-413 (1991)]. Sequence
alignment and homology modeling were done iteratively using LOOK 1.0 (Molecular
Applications Group, Palo Alto, Ca.). The interactions of conserved residues of
the MLCK model and PKA are nearly identical. The viability of the model was also
assessed using the Profile-3D module of INSIGHT 2.0 (San Diego, Ca.), which
indicated that the primary sequence of MLCK and the 3-dimensional profile of
the model are compatible. The model predicts that conserved residues Arg-129
and Lys-156 interact with the phosphate group on conserved Thr-166. A His
residue which also interacts with the homologous phosphothreonine-197 in PKA
becomes Asn in the MLCK sequence, but the modeling process positions a non-conserved
Lys residue, which is one position N-terminal from the Asn residue, to interact with
the phosphothreonine. Autophosphorylation of Thr-289 occurs intramolecularly, and
in the model this residue is well positioned to accept the gamma phosphate of ATP.
A conserved Arg residue three residues N-terminal of the phosphate acceptor is
positioned to interact with a conserved Glu residue in the catalytic core.
Part of the intervening loop connecting the carboxyl terminus of the model
to the remainder of the molecule interacts with a 5 residue insertion after
the catalytic core region; thus, two non-conserved loops of the MLCK sequence
form a single patch on the surface of the model, suggesting that this region
confers distinct regulatory characteristics to MLCK.

27) Sequencing the Chlamydia trachomatis Genome
Sue Kalman, Ed Allen, Rina Araujo1, John Carpenter, Ed Chung, Fred Dietrich,
Caridad Komp, Gary Otto, Howard Lew, David Lin, Surya Mirthipati, Allen
Namath, Peter Oefner, Fabien Petel, Neil Shroff, Ronald Davis, and Richard

28) Application of Genomic Mismatch Scanning to a Mammalian Genome.
Linda McAllister & Patrick O. Brown

Application of Genomic Mismatch Scanning to a Mammalian Genome.
Linda McAllister & Patrick O. Brown
Dept of Biochemistry, Div of Cardiology, and HHMI
Stanford University

Analysis of complex genetic traits, such as diabetes or high blood pressure, using
current methods is arduous and costly. Genomic Mismatch Scanning (GMS), first developed
in Saccharomyces cerevesiae (Nelson et. al.) is a powerful alternative to the current
approaches to linkage analysis. By utilizing the mismatch repair system of E. coli
to recognize single base pair mismatches in total genomic DNA, GMS isolates the identical
by descent (IBD) DNA shared by two affected related individuals. This DNA can be mapped
in a single hybridization. Although, for any two related individuals a large fraction
of the total DNA is IBD (the fraction is determined by their relatedness), the superposition
of many of the IBD maps will reveal trait associated loci. We present here, the successful
application of GMS to a mammalian genome. We employ a simple mouse pedigree; a cross between
two very divergent laboratory strains to generate an heterozygote F1. GMS was performed on
total genomic DNA from the F1 in combination with one of the parents. PCR genotyping of
microsatellite repeats was employed to test the GMS selection. Only the alleles IBD between
the two mice were detected in the final GMS product. The size and repeat content of human DNA
is very similar to mouse, thus, we anticipate little difficulty in the application of GMS to human.

Nelson SF, McCusker JH, Sander MA, Kee Y, Modrich P, Brown PO.
Genomic mismatch scanning: a new approach to genetic linkage mapping.
N Genetics 4, 11-18 (1993).

29) The Protein and Nucleic Acid (PAN) Facility
Al Smith

DNA Probe Arrays - Accessing Genetic Diversity

R. J. Lipshutz, D. Morris, M. Chee, E. Hubbell, M. J. Kozal, N. Shah,
N. Shen, R. Yang and S. P. A. Fodor

Affymetrix, 3380 Central Expressway, Santa Clara, CA 95051

As the Human Genome Program and related efforts identify and determine
the sequence of human genes, it is important that highly reliable and
efficient mechanisms are found to access individual genetic variation.
It is only through more effective access to genetic information that the
true benefit of the Human Genome Program will be realized.
Light-directed chemical synthesis has been used to generate miniaturized,
high-density arrays of oligonucleotide probes. Application specific
oligonucleotide probe array designs have been developed to rapidly resequence
known genes. These probe arrays are then used for parallel DNA hybridization
analysis, directly yielding sequence information from genomic DNA sequence.
Dedicated instrumentation and software has been developed for array hybridization,
fluorescent detection, and data acquisition and analysis. Experiments demonstrating
the effectiveness of these methods will be described. More generally, DNA probe
arrays are proving to be a powerful tool for rapid investigations in sequence
checking, pathogen detection, expression monitoring, and DNA molecular recognition.

Molecular Applications Group

Molecular Applications Group will be presenting the latest versions
of our bioinformatics and macromolecular analysis and modeling software.

LOOK v2.0

LOOK is a revolutionary software tool designed to access, integrate, and apply relevant
information in a uniquely useful way. Look's informatics tools allow you to seamlessly
access and manage protein sequences, structures, literature, and data. LOOK's science
applications help molecular biologists use this information to design experiments and
interpret their results, saving them valuable research time. LOOK offers molecular
biologists the tools they need to get information on a protein, to test hypotheses
before doing the actual experiments, and to communicate their insights with others,
all in one package. LOOK's intuitive, user-friendly interface makes the full power
of all of these features readily accessible to any user, making it an invaluable partner
to laboratory work.

LOOK's SegMod (Segment Match Modeling) accessory package brings a novel, and highly accurate
approach to homology modeling. SegMod combines automated simplicity with a powerful and accurate
algorithm, to make homology modeling accessible to modelers and non modelers alike.

MacImdad v5.2

MacImdad is a versatile and easy to use research and teaching tool designed for your macromolecular v
isualization and analysis needs. It is ideal for display and analysis of complex macromolecular
structure, for creation of publication-quality graphics and for interactive classroom presentations.

Demonstrators: Michael Mueller, PhD and Charlene Son

Automated DNA Sequencing and Analysis
Ronald W. Davis, Ph.D.

We are developing a highly automated, totally integrated system for very high throughput DNA sequencing.

The system will consist of six modules. Each module will be interconnected with a custom built
random access microtiter plate server, and electronically connected with a sample-tracking database.
The first module will shear source DNA to a uniform size that can be used in library construction.
The DNA is of such quality that it does not need to be size fractionated or enzymatically repaired.
This system will be capable of generating DNA samples for construction of up to 500 libraries per day.
Plaque plates will be sent to the automated plaque/colony picker which has already been fabricated and
can pick 50,000 plaques/colonies per day. Phage or cells will be grown and sent to a new template
preparation station. This station will prepare sequencing templates and be capable of generating
10,000 to 20,000 templates per day.

These templates will be then sent to a cycle sequencing station which will be capable of
carrying out cycle sequencing on all of the templates in a volume of 1 ml or less.
The sequencing reactions will then be sent to another module that will load and electrophorese
polyacrylamide gels. This instrument will use very high density of lanes and be capable of
producing 10,000 lanes of DNA sequence per day. Sequences will be called with a DNA base
calling software and lane tracking software that has already been developed. Finishing of
the sequence will be conducted by synthesizing a new primer. The new primer will be produced
on a 96 well oligosynthesizer. This module will be a more automated design from an existing

We are also conducting a project to improve the quality of the sequence that will improve
the accuracy of the base calls. This will largely be through modifications of the cycle
sequencing enzyme and the chemistry of sequencing. This system should be capable of generating
10,000 sequencing lanes per day for approximately 5 million bases of raw sequence per day.
The best estimate of cost if this system is successful is $0.01 per base of finished sequence.

Gene networks present a new challenge after DNA sequences are known. Clusters
of genes and related biochemical reactions are being identified with logic and
control functions such as switches and environmental sensors. Networks of many
genes are being identified with larger cellular functions such as flagella
synthesis, chemotaxis, and regulation of metabolic pathways. We need a
simulation capability to help us understand the dynamical behavior of these
genetic circuits and to verify hypothesized regulatory circuits by comparing
predicted behavior to observations. Intracellular protein concentrations of a
few tens of nanomoles often determine outcomes in genetic circuits. At these
low concentrations, the assumptions underlying chemical modeling using coupled
systems of macroscopic kinetic equations are not valid. Both the time interval
between transcripts and the number of proteins produced per transcript are
determined by stochastic processes governed by highly skewed probability
distributions with long tails. Exact modeling of these phenomena and other
reactions in the bacteriophage lambda decision circuit has been accomplished
using a stochastic simulation algorithm. Results show that fluctuations in low
concentration protein signals are important both to the design and function of
mechanisms that determine genetic outcomes. The modeling approach and software
architecture developed in this work suggest directions for development of
"user-friendly" software tools suitable for modeling large genetic networks
comprised of tens to hundreds of genes.

Correlated changes in biological sequences are important indicators of
structural and functional interactions. In previous work we have described
a system for the discovery, representation and use of sequence correlations.
We have reported in detail the application of this sytem to tRNA sequences
(Klingler & Brutlag, ISMB-93) and to a-helical sequences (Klingler & Brutlag,
Protein Science, 1995). In these and other applications we have been required
to define residue alphabets in order to discover significant correlations.
In this poster we describe an automated procedure for discovery of correlation
alphabets. Instead of letting our system run exhaustive correlation searches
with a list of alphabets, this procedure derives alphabets specific for any
two positions in a sequence alignment. The procedure consists of:

1. Starting with a full alphabet for each position in the sequence pair
(i.e. individual building blocks) and a full contingency tables.
2. Removing groups which occur infrequently in either positions (i.e. rows
or columns with small sums in the contingency table).
3. In both the rows and columns, iteratively combine groups with the most
similar frequencies (i.e. combine rows and columns that have the most
similar distributions).
4. Test all combinations of row and column groups for the most significant
dependency using statistical and informational measures.

We have tested this procedure on the a-helix data, generating groups of amino
acids for each position in (i,i+1), (i,i+2), (i,i+3), (i,i+4) and (i,i+5)
sequence pairs. Based on our previous analyses we have been able to confirm
that the derived alphabets, particularly for the (i,i+3) and (i,i+4) relative
positions are meaningful and interesting. We have also been able to
automatically interpret alphabet groups by assigning them semantic meanings
based on common properties (e.g. hydrophobicity, charge, size). Furthermore,
using randomly generated data sets, we have been able to characterize the
usefulness of this approach.


The PAN Facility is a support facility designed to meet the needs
of members of the Program for Molecular & Genetic Medicine and the Beckman
Center at Stanford University Medical Center. It is located in rooms B065
and B017 of the Beckman Center. The facility contains state-of-the-art DNA
synthesis and DNA sequencing. The facility currently has a full-time staff
of nine, including the Director, Alan Smith. The PAN personnel can be
reached at (415) 723-1907 and 723-3189. It is operated as a use-for-fee
facility and administered through the School of Medicine (see attachment
for fee schedule).

Amino Acid Analysis

Automated amino acid analysis is performed on a ninhydrin-based
Beckman 7300 analyzer. The system is configured for quantitative and
qualitative analyses of acid hydrolysates. Analysis can be performed on
submicrogram quantities of proteins and subnanomole amounts of peptides.

Protein & Peptide Sequencing

The facility contains three high sensitivity automated protein
sequencers equipped with on-line HPLC's. An Applied Biosystems 470A gas
phase sequencer and Applied Biosystems 473 and 477 pulsed liquid sequencers
are available. These instruments are capable of sequencing proteins and
peptides at the low picomole level.

Peptide Synthesis

PAN contains a Rainin Symphony 12 column peptide synthesizer and an
Applied Biosystems 431A automated peptide synthesizer. Both instruments
employ FMOC chemistry for synthesis and are equipped with efficiency
monitoring systems. The standard synthesis level is 0.1 - 0.2 mmolar which
should produce approximately 200mgm of a crude 20-mer for the ABI
instrument. The Symphony will make 10mmolar syntheses for immunogen
production. Peptides are purified on a Waters Deltaprep 3000 preparative
HPLC. All peptides, either crude or purified, will be returned with an
analytical scale HPLC profile and amino acid composition.

DNA Synthesis

The facility contains Applied Biosystems 380B 3-column and 394
4-column automated DNA synthesizers and a 3894 48-column instrument. The
standard syntheses are performed using the cyanoethyl phosphoramidite
chemistry. The standard synthesis level is 40 nanomolar although 0.2, 1.0
and 10 micromolar syntheses are also available. Crude oligos are analyzed
by a U.V. shadowed photograph or a capillary electrophoresis profile. A
variety of modifications are also available including phosphorothioation,
biotinylation, 5' amine, 3' and 5' phosphate and 5' thiol. The facility
maintains a supply of commonly used primers.

DNA Sequencing

Automated DNA sequencing utilizes an Applied Biosystems 373A
instrument with Seq. Ed. software. Template sources can be either single
or double stranded DNA and can be sequenced using Sequenase or Taq
polymerase. PCR products can also be sequenced but care must be taken to
remove excess primers. As a general rule, better quality sequence data is
obtained from primer extension sequencing chemistry which utilizes
manufacturer supplied template kits. However, gene walking using custom
synthesized primers and dye terminator chemistry is also available.

Additional Capabilities

The PAN Facility contains an Applied Biosystems 140 narrow bore
HPLC system, a Beckman Instruments standard bore HPLC and a Beckman
Instruments 168 diode array detector. Using this instrumentation, PAN is
able to perform protein digestions in solution or in gels and separate the
resulting peptides for subsequent sequence and amino acid analyses. In
addition, proteins can be blotted onto PVDF or isolated as stained bands on
a gel, subjected to proteolytic digestion, narrowbore HPLC separation and
subsequent protein sequencing.

The facility also contains a Beckman Instrument P/ACE capillary
electrophoresis instrument which is used for a variety of protein and DNA

The Bioinformatics Resource and the Molecular Modeling Laboratory.
Lee Kozar

The Bioinformatics Resource provides computer support for over 1300
biological researchers in several departments on campus who are members of
the Program in Genetic and Molecular Medicine, as well as many local
companies who are affiliated with the Stanford Spectrum program. Both
commercial and public domain software for sequence analysis, image analysis
and molecular modeling are available. Popular programs include the Genetics
Computer Group and the Intelligenetics sequence analysis packages, and the
Insight, Sybyl and Look molecular modeling software. Most of the commmon
molecular biology databases such as Genbank, the Protein Databank of X-Ray
Crystallographic Data, and the Protein Information Resource of protein
sequences are accessable either locally or over the Internet. The computer
laboratory, comprised of SUN and Silicon Graphics workstations, is often
used for regular courses and special courses are regularly taught on the
use of the various software packages.

Sequencing by hybridization (SBH): A Genome Analysis Platform

Radoje Drmanac, Hyseq, Inc.
670 Almanor Ave., Synnyvale CA 94086

We have developed a high throughput hybridization facility by
integrating i) high throughput DNA amplification, ii) fast spotting
of 31,000 DNA samples on a 6x9 inches membrane using 864-pin tool
and iii) automated hybridization of 96 probes per day per machine.
The production of 10 million DNA-probe scores per day allows
similar turn-around times for both preparation and SBH analysis of
a shotgun-BAC,-YAC,-bacterial genome, or a cDNA library. The recent
analysis with newly developed image processing and clone matching
software on 200,000 cDNA clones shows the unique performance levels
of SBH in gene discovery and gene expression studies. As many as 50
Mb of DNA comprising new genes or clones uniformly covering a
genomic sample is processed as a single batch on a set of replica

Hyseq is also developing super-chips consisting of an array of
oligonucleotide arrays spaced by hydrophobic strips. The first
generation of super-chips has been manufactured by dispensing of
pre-made oligonucleotides to the support. The process is 100-fold
more effective in terms of speed, cost and performance than in situ
synthesis. Long oligonucleotides (10- to 14-mers) are scored
efficiently by ligating support-bound and labeled-unbound 5- to 7-
mer probes using target DNA as hybridization templates (SBH format
3). Obtained full match-mismatch discriminations is about 10-50
fold providing a basis for accurate and low cost horizontal (--
ATTC---CTGC--) or vertical (--ATTC--; --AgTC--) sequencing of over
10,000 bases in a single reaction.30) 31)
Sequencing the Chlamydia trachomatis Genome

Sue Kalman1, Ed Allen2, Rina Araujo1, John Carpenter, Ed Chung1, Fred
Caridad Komp1, Gary Otto2, Howard Lew1, David Lin1, Surya Mirthipati1, Allen
Namath1, Peter Oefner1, Fabien Petel2, Neil Shroff1, Ronald Davis1, and Richard

Department of Biochemistry1, Department of Genetics2, Stanford University,
Stanford, CA 94305, and Program in Infectious Diseases3, School of Public
Health, University of California, Berkeley, CA 94720

Chlamydia trachomatis is an obligate intracellular bacterium which causes
several human diseases. It is the leading cause of ocular trachoma, a form of
preventable blindness. C. trachomatis also causes the most prevalent sexually
transmitted disease in the U.S. Yet, relatively little is known regarding the
biology of chlamydiae and their interaction with eukaryotic host cells.

We have initiated sequencing the C. trachomatis genome which
consists of a circular chromosome of approximately 1.04 Mb and a 7.4 Kb plasmid.
The entire genome was shotgun-cloned in M13. Sequence distribution and gene similarity
analysis will be presented.

The structure of the procaryotic 16 S ribosomal subunit is the subject of intense
interest because of its importance in the translation of the genetic code into protein
polypeptides. This subunit, the site of translation initiation, is the site of
action of many commonly used antibiotics, including streptomycin and gentamycin.
The subunit is composed of a single strand of RNA with 1542 bases (in E. Coli)
and 21 separate polypeptides, ranging in size from 9Kd to three-dimensional
key role 60 Kd. So far, this structure is too large (and possibly too flexible)
to submit to xray diffraction studies. Thus, the primary sources of data are the
results of biochemical experiments (cross-linking, RNA protection, labelling) as
well as comparative analysis of the RNA from multiple species. We have developed
methods for exploring the range of conformations that are compatible with these
data sources, and have built a model of the 16 S ribosomal subunit that is
approximately 14 equivalent resolution. We have also docked the tRNA
crystal structure into the binding cleft within the 16S, and can begin to
think about the detailed contacts that are made in the complex, and the
possible implications for mechanisms of translation.

The two principle scientific issues that arise in computing this structure are:

1). How can we generate reasonable structures using data that is noisy,
uncertain and of relatively low abundance? We have employed the PROTEAN
program 30 (the 16S RNA)has beenfora modification of the PROTEAN program
that was developed for the interpretation of NMR data on proteins.

2). How can we organize the myriad information about the 30S subunit?
We are building a prototype next-generation structural information
resource that allows world wide web based browsing, evaluation and
computation of structure. This resource, called RiboWeb, employs a
knowledge base of the structural components, structural data, and their
occurence in the biochemical literature on the 30S ribosomal subunit to
provide integrated data analysis services.

Currently, the Protein Data Bank (PDB) holds structural information
describing approximately 3,000 structures. That number is expected to
grow exponentially to 30,000 over the next several years.

The explosive rate at which new structural data is becoming
available exceeds the rate at which the data can be visualized and
studied carefully.

One approach to tackling this problem is to group proteins
into families. The basis of such a classification is that members of
a protein family have similar overall folds, but differences in
detailed structure.

In our work, we focus on one commonality among protein family
members - a shared set of atoms that occupy roughly the same relative
positions in space. Our focus here is in identifying this set of
atoms -- the core -- and using it as a concise statistical summary of
a protein family.

Our specific objective has been to produce, from multiple
alignments of structures, a library of core structures for protein
families. Today we are in the process of making core structural
information (data and images) for fifty protein families available
over the World Wide Web.

Exploring the diversity of uncultivated microbial pathogens with amplified ribosomal DNA sequences
K Yoder, I Kroes, D Fredricks, D Relman

Traditional approaches for the identification of microbial pathogens have relied
heavily upon propagation or purification of the microorganism in the laboratory,
and subsequent phenotypic analysis. These approaches are cumbersome and inaccurate;
furthermore, they ignore microorganisms that resist cultivation. These microorganisms
are known or suspected to cause a variety of human diseases. We have taken advantage
of small subunit ribosomal DNA (ss rDNA) sequences as a reliable means for determining
phylogenetic relationships, and have developed methods for amplifying microbial ss rDNA
directly from infected host tissue. Examples or applications include the following:
1) identification and characterization of uncultivated bacterial pathogens such as
the Whipple's disease bacillus; 2) development of broad range primers for fungal
identification; 3) phylogenetic analysis of the human intestinal coccidian pathogen
Cyclospora; 4) analysis of bacterial diversity within a human commensal microflora;
and 5) investigation of possible microbial etiologies for a number of chronic
unexplained inflammatory diseases.