Biochemistry 218 Computational Molecular Biology Final Project
Peter Feng December 10, 1998
A survey of computational algorithms to analyze patterns of gene expression from cDNA/DNA chip Microarray data.
INTRODUCTION
The cDNA microarray and DNA chip (microarrays packed densely with short oligonucleotides) technologies allow rapid, parallel genome-wide analysis of an entire genome (yeast) or a population of cells’ functional state in response to a stimuli. The technology has been applied to further our understanding of various biological processes: the identification of genes involved in complex human diseases such as cancer, inflammation (rheumatoid arthritis and bowel disease) and basic processes such as sporulation and cell-cycle regulated genes in yeast.
DNA chip technology is currently being developed to screen for genetic mutations(10): e.g. examine tissue samples for known mutations of cystic fibrosis, HIV drug resistant viruses or mutations implicated in colon cancer simultaneously have primarily diagnostic value. While these technologies are very promising, the computer algorithms I will discuss are not intended to be applied to data generated by this type of microarray experiment.
Gene expression patterns of 1,000-10,000 different genes provided by this technology promises to help shed light on the problem of simplifying the complexity of basic biological function in lower organisms and identifying genetic keys to the onset and progression of human disease as well as differential responses to pharmaceuticals. Some goals that the technology hopes to address include the characterization of a signalling pathway’s transcriptional program, the screening of potential drug candidates for toxicity signatures, identification of putative roles played by genes of which only sequence information is available, and the identification of previously undetected regulatory elements in promoter regions of co-regulated genes.
Microarray technologies should significantly speed up biologists’ effort to keep up with the speed of the current world-wide sequencing projects(by adding to the annotation of sequenced, yet uncharacterized genes whose sequence information resides in computer databases) and provide clues to understanding the complex genetic circuitry which explain life on a molecular level.
The quantification of novel or known genes’ individual contributions to a process is measured by the comparison of a fluorescence ratio between experimental mRNA and control mRNA labeled with different fluorescent markers. A step-by-step description follows: First one isolates RNA from samples representing an experimental time point and a reference, e.g. RNA harvested at time 0. Then mRNA is isolated with an oligo dT column that recognizes polyA signals(which distinguish mRNA from tRNA and rRNA).
Fluorescently labeled cDNA probes were then prepared by reverse-transcription in the presence of Cy3(green) or Cy5(red) labeled dUTP and then hybridized to a cDNA microarray. There are two types of microarrays that have been used: one which is densely packed with cDNA sequences(e.g. Pat Brown) or 25mer oligonucleotides(e.g. Affymetrix) printed in a high-density array on a glass microscope slide.
After hybridization, the arrays are scanned on a scanning laser fluorscence microscope(a separate scan for each fluorophore and then combination of the two scans for analysis). Genes induced or repressed appear as red or green spots while genes expressed at roughly equal levels before and after appear in the image as yellow spots. Moreover, the relative fluorescence intensity measured for each fluorophore for each distinct gene provides a reliable measure of the relative abundance of the corresponding mRNA in two cell populations.
One of the basic questions scientists using the microarray initially sought to answer was of a screening nature: the identification of genes that are differentially expressed in certain conditions. The initial work was presented as an approach that was ‘more powerful’ than other molecular biological screening methods to identify novel genes such as differential display and subtractive cloning.
In the ‘genomic era’ where all protein-encoding sequences of an organism(such as yeast) are available, the study of higher-order relationships found in microarray data is sought by clustering genes based on similar expression patterns. While the identification of genes which play important roles is also an aim of global monitoring procedures, this paper hopes to show that the choice of algorithm affects the amount of knowledge that can be extracted from raw data.
I will explore the transition from using simple approaches which examine extremes to statistical/mathematical approaches which reveal clusters of genes with similar patterns of expression. These new approaches not only allow one to simplify the methods that conclusions are drawn using the simple method, but also increase one’s confidence in the reliability of the microarray as a tool to study complicated biological processes.
Because the data sets represented by microarray experiments are quite large, sophisticated clustering and classification algorithms are needed to help order genes based on their patterns of gene expression to enable biologists to rapidly assimilate and interpret the data. A technique developed by Eisen et al. uses color coding and intensities to quantitatively monitor gene expression: induced genes are red and repressed are green. Previously, the display of the entire range of microarray data(e.g. 40,000 points for 4000 genes 10 time points) was arranged in tables containing numbers to represent degree of induction or repression.
An example of the power of clustering algorithms is the following: upon rapid scan of these color coded patterns in Spellman et al. Figure1B(
http://genome-www.stanford.edu/cellcycle/figures/figure1B.html), one notices the conservation of a histone cluster’s response throughout the case of cell cycle. Just as the sequence of histones is highly conserved across evolution, it appears that the function and the contribution of the individual family members to cell-cycle progression is also conserved across the time course of the experiment.This type of higher order relationship might be uncovered using the manual method but would require a more laborious procedure of finding a group of genes which display similar patterns of expression(thru examination of extremes between time points)and fitting of the curves on a 2D plot of fluorescence ratio vs. time(will be discussed in more detail below). This procedure might actually group more genes in the expression pattern than simply the histones: for example, CLN2 displays the same pattern of high induction, followed by dramatic repression that repeats itself over time in the four experimental datasets(figure1B,Spellman et al.).
A fundamental goal in designing algorithms to clustering genes is to simplify the raw data (reading of digits is a rate-limiting step for the human brain in the interpretation of data) and facilitate the visualization (through color-coded schemes) and interpretation of the global gene expression program represented in the large body of data. Ideally, one desires an approach that encourages both intuitive as well as integrative interpretations of the raw data, both verifying previously reported data as well as establishing new relationships of structural, sequence, or functional similarity within groups.
I will be giving a detailed description of four methods to analyze data compiled from microarray experiments to study gene expression patterns, highlighting the advantages and disadvantages of each approach in comparison to each other. Although there are many statistical and mathematical methods that might be used to analyze gene chip data, I will restrict my discussion to only those methods that have been successfully used to establish new biological findings and verify previous experimental findings.
The four methods are: 1. a manual classification, simple examination of extremes and the relationship between each class that is highly induced. 2. unsupervised clustering( no apriori comparison to standard vectors) 3. a hybrid clustering approach: parametric unsupervised clustering(fourier transform to gauge periodicity) followed by supervised clustering algorithm(comparison to standard vectors) 4. neural network approach using self-organizing maps combined with supervised clustering.
Because I didn’t find any publications using classification algorithms to analyze gene expression patterns, I will briefly mention the potential use of a classification algorithm called CART, Classification and Regression Trees, in my conclusion.
MANUAL APPROACH
Why?
One uses the manual approach to analyze gene expression patterns to filter out the ‘important’ genes which make the most significant contributions to a particular cellular process. It has been the only method of analyzing microarray data in before work from the Botstein lab by Michael Eisen.
What is it?
The ordering of genes based on the extremes of gene expression induction and repression usually two or three standard deviations above background(non-related gene like Arabidopsis genes). The main method of organizing the data for interpretation are plots of expression of log transformed ratio of fluorescence of experimental RNA vs. control RNA(y axis) versus time( x axis). These 2D plots have been used to monitor the transcriptional activity of selected groups: cytokines, chemokines and transcription factors in study on inflammatory related genes as well as genes displaying a fluctuation in transcript level during the synchronized progression of the cell cycle(Cho et al.).
A second approach involves selecting genes which display the same dramatic changes between two time points. This is used to illustrate that temporal patterns of gene expression do exist and that groups of genes coordinately act together on a molecular level to change the physiological state of an organism(yeast). Please see figure 4 in this paper
http://www.sciencemag.org/cgi/content/full/278/5338/680The table format is utilized to order genes up-regulated based on the elevation of experimental fluorescence vs. control. This can be achieved through simple ordering of genes by their "fold-increase" over control.
Please see table I in the metabolic transition paper mentioned above.
http://www.sciencemag.org/cgi/content/full/278/5338/680Advantages:
1. Efficient in screens for identifying potential tumor markers or drug targets.
The manual approach yields fast, consistent results to identify genes which are consistently up-regulated and probably play the most important roles in a particular process.
2. It is possible to construct temporal patterns of induction or repression to group genes.
Go to the following url to see this conserved pattern of expression
http://www.sciencemag.org/cgi/content/full/278/5338/680 Figures 4,5. for an example of the manual classification of data on a 2D plot of ratio of induction or repression over time. This shows that it is possible to identify patterns of gene expression representative of genes of similar function together in a manner which is visually satisfying.This application of the manual method effectively classifies pre-replication genes as a class whose peak activity precedes replication genes.
Please see:
http://www.molecule.org/cgi/content/full/2/1/65/F4. I apologize if you don’t have a molecular cell on-line account.Disadvantages:
1.The conclusions are too simple.
For example, in Heller et al.’s report on discovery and analysis of inflammatory disease related genes, there is only mention of three genes out of 1000 genes from a cDNA peripheral blood library that were differentially expressed in rheumatoid arthritis tissue in comparison to inflammatory bowel disease. There must be more than three genes which contribute significantly to the arthritic condition that aren’t responsible for the pathology of inflammatory bowel disease.
They also used a 96 element microarray selected for genes with functional roles that implicate them in promoting inflammation. The data that they present is simply a 2 dimensional graph plotting the log ratio of experimental/reference fluorescence vs. time for four known classes(transcriptionfactors, chemokines, cytokines and metalloproteinases). Please go to:
http://www.pnas.org/cgi/content/full/94/6/2150/F2This approach is very simplistic in that it doesn’t reveal any relationships such as which transcription factor is linked to which chemokine that might be revealed in a clustering algorithm. Their conclusions were very simple too that there were three genes out of 96 which revealed novel participation in promoting both disease conditions.
Table I (
http://www.sciencemag.org/cgi/content/full/278/5338/680) in Derisi et al’s report of the yeast transcriptional response to a metabolic shift displays in sequential order genes significantly induced by YAP1 overexpression which promotes stress resistance.The other parameter other than fold increase above normal was distance of a YAP1 binding site from the ATG. This table merely lists a few enzymes which are part of an energy production biochemical pathway. The distance from ATG doesn’t seem to affect expression patterns significantly in my opinion, so one is left with only a group of genes which are up-regulated by another gene. I call this a simple conclusion because one could easily establish this relationship of induction with standard molecular biological techniques such as northern blot(if they picked the ‘right markers’).
2. The approach doesn’t facilitate discovery of "higher order" relationships.
In DeRisi et al’s report exploring the transcriptional response to diauxic shift(fermentation to respiration), there appears a consistent pattern of gene expression amongst the ribosomal proteins(112 genes in the data set). While this pattern suggests the success of the manual approach in clustering genes of similar function, it fails to answer the question of the involvement of other clusters of genes and their potential contributions to the overall process e.g. establishing the early groundwork to maintain the biological decision to switch energy metabolism machinery. Intuitively speaking, it is easy to imagine that there are probably other clusters of genes with profiles of smaller induction levels which might never make the threshold cut off to be considered for plotting.
The conclusions in drawn in Cho et al’s report on genome-wide transcriptional analysis of cell-cycle merely sheds light on genes with periodic expression. The studies could only reveal a binary answer of yes or no this family of genes displays periodic changes in transcript level. The scientists reported very simply that the proteolytic effectors (known elements of APC, CDC34 or ubiquitin complexes), whose expression is important for regulation of cell-cycle periodicity, is not regulated by periodic transcription. It is very tough for the manual approach to home in on a cluster of genes whose expression ideally peaks prior to the initial transcription and later disappears(if the levels of induction were not dramatic) that might be responsible for the initial transcription of the proteolytic machinery.
3. The process is time-consuming, doesn’t make complete use of the data and subject to experimental errors.
One can only imagine the time it took to characterize each particular distinct temporal pattern of gene expression in the diauxic shift described by DeRisi et al. In order to facilitate the identification of more than the five patterns(early, late induction, induction followed by late repression), one must plot out a gene’s induction levels over time and compare the curves of genes induced at a similar time. Also because the simple algorithms don’t classify every gene, the ambiguity lies in deciding thresholds for gene expression that determine whether it is worth searching for replications of one gene’s pattern of expression. Furthermore, this process is plagued by a laborious data analysis procedure for building patterns and incomplete usage of data.
Although Heller et al mentioned that a complete survey of genes in response to inflammatory conditions induced and repressed out of 1000 was beyond the scope of the study; however, in my opinion the time and lack of sophisticated algorithms involved in manual classification method (deciding which sets to analyze, the members of a set) was the largest limiting factor in their under-reporting of their experimental results.
Since experimenters are only looking at the extremes, induced and repressed genes, experimental artifacts weaken conclusions and can hide potential findings. The manual approach is highly susceptible to cases in which duplicate genes have differential regulation for a certain time point or a single gene has ‘outlier’ properties, a gradient of induction preceded by high induction or repression instead of background. This fact weakens any conclusions about genes that are induced or repressed and necessitates that the experiment be re-performed and re-analyzed. The need for multiple observations of the same conditions is quite laborious as I have mentioned before.
UNSUPERVISED CLUSTERING
Why?
Simplistic manual approaches don’t reveal to the scientist the overall state of a cell during a particular process nor do they encourage the identification of specific groups of genes with simliar patterns of expression acting together to contribute to a biological process.
What is it?
Pairwise average-linakage cluster analysis. It is a form of hierarchical clustering very similar to phylogenetic analysis. This algorithm sorts through all the data to find the pairs of genes that behave most similarly in each experiment and then progressively adds other genes to the initial pairs to form clusters of "co-regulated genes". This method is based on calculating the standard correlation coefficient, i.e. the dot product of two normalized vectors. This statistic conforms well to the intuitive biological notion of "co-expression" by capturing the similarity in shape instead of emphasizing the magnitude of these two measurements.
The equation used to calculate similarity score(correlation coefficient) is the following:
Let Gi equal the log-transformed primary data for gene G in condition i (i.e.a time point)
For any two genes X and Y, observed over a series of N conditions a similarity score can be computed as follows:
S(X,Y)=1/N summation i=1,N(Xi-Xoff/phiX)(Yi-Yoff/phiY)
where
phiG=squareRoot(summation i=1,N (Gi-Goff)2)
Goff represents the mean of observations on G.
The Clustering Algorithm and its application to microarray data was described by Eisen et.al to construct phylogenetic trees used to establish higher-order relationships between genes with similar expression pattern.
Just as this method is used to evaluate the conservation of sequence homology across different organisms, the same principles apply: branch lengths of the trees represent degree of similarity between individual genes.
Advantages:
1. Easy to read and zoom in on detailed patterns to identify potential modes of regulation by clusters of genes.
The color coded scheme describing relative induction and repression ratios is more natural than a table of numbers displaying the degree of induction. The regulation of clusters of genes can be interpreted an indication of the status of cellular processes and that pattern’s relative contribution over time to the overall biological effect.
2. Redundant representations of genes cluster together.
Genes represented by more than one array element or genes of high sequence identity(alternate human cDNA splice products, or homologous genes in yeast) have consistently displayed similar patterns of gene expression.
3. Genes of similar function cluster together.
Genes downregulated in response to stress(diminish glucose or transfer to nutrient limited sporulation media) were dominated by ribosomal proteins and other proteins involved in translation. This finding supports previous reports of yeast response to favorable growth conditions through increased production of ribosomes through transcriptional regulation of ribosomal genes. In studying , one can answer questions about the particular state of a cellular compartment such as translational machinery, energy production or proteasome over time and deduce its contributions to the overall procedure being studied. These inferences are impossible to make when looking at the extremes of gene expression because no pattern matching is taken into account.
The study of a genes regulated in the recovery from serum starvation of human primary fibroblasts reported clusters of cholesterol biosynthesis, cell-cycle related, signalling, and immediate early. see
http://rana.stanford.edu/clustering/serum.html. One can easily imagine the potential power of monitoring clusters of functionally related proteins in more specific growth conditions and their relative contributions to the phenotype, i.e. a tumor transformed due to a mutation in a particular tumor suppressor.4. This method requires few assumptions about the nature of the data.
It isn’t biased by previous reports and fitting of data around paradigms that have been built by the past. I will discuss this problem in the sporulation paper, Chu et al, in which the scientists use a supervised clustering approach based on hand-picked expression profiles. The fact that the clustered patterns are not found with randomized data verifies that this method doesn’t alter the existing data but merely optimizes an ordering that enriches the biologically data.
5. Can be used to infer function of uncharacterized genes
The functional concordance of co-expressed genes that are found in the clusters along with a few unknowns makes inferring functional information about these uncharacterized genes attractive.
6. Solves(reduces) some problems that manual classification presents:
a) Noise(excessive hybridization which hides ‘true’ induction) present in single observations(i.e. early time pt) that necessitates repeated experiments
b). Experimental artifacts amongst similar or highly homologous genes such as high signal for one gene and low signal for an identical gene at a different location at the same time point.
All these problems cloud the classification of a particular gene as an extreme and necessitate repeat experiments to verify previous observations and filter errors. In using the manual method, an experimental artifact at a particular time point might prevent the detection of this gene’s importance in contributing to an overall biological effect. The manual approach encourages the experimenter to make repeat observations under identical conditions to standardize for these errors. As mentioned before, the manual approach is more laborious in terms of data analysis and subject to an unsophisticated methods of filtering out occasional experimental artifacts(the fact that in one experiment that the same gene responds differently makes.
The approach of clustering genes with similar expression patterns allows one to sample a small number of non-identical, but yet functionally equivalent conditions that should increase the strength of pattern recognition from a particular group and disregard patterns that are obviously experimental artifacts. For example, if one notices that a group of ten genes responds similarly (a gradual induction effect) by the eye and notices that at one point there was significant repression, the experimenter can simply increase the data set to filter out that piece of experimental error.
Since the algorithm can handle massive amounts of data and groups genes based on overall pattern of expression and not ‘significant induction’, one can be confident that the clustering of genes represents a true expression signature. Because each gene will be classified in every experiment regardless of experimental error, the robustness of the algorithm facilitates a decrease in noise when averaged out over many time points.
Disadvantages:
1. No biological bias to help classify genes.
While this is a positive when comparing results with previous data in the past in which conclusions might have been weak or even incorrect, it is a negative when one wants to cluster genes into known groups that have been previously assigned. Without parameterization, it’s harder to cluster of genes into groups previously defined to contribute to a biological process during one period of time such as in the cell-cycle.
2. Forced to form classifications on some data that is ambiguous.
In biological processes, the complexity of the answering the question of which individual contributions are important is very high and sometimes not easy to calculate. Perhaps some genes will be misclassified.
PARAMETRIC APPROACH CLUSTERING ALGORITHMS(unsupervised)
http://genome-www.stanford.edu/cellcycle/figures/figure1Bnames.html
why?
In studying the cell cycle, one would like to classify genes based on their peak time of expression and be able to draw conclusions on the critical genes necessary to promote or inhibit progression past these points. One would first like to divide genes into these functional categories, so that identifying common promoter elements might be a question that is more easily answered. Spellman et al uses a hybrid approach of unsupervised (parameterized) followed by supervised clustering to group the genes into the five categories representing the different phases of the cell cycle. The manual approach has been used by Cho et al and will be compared and contrasted to the clustering algorithm approach in this section.
What is it?
The algorithm seeks first to identify genes which exhibit periodic expression( a peak expression during the cell cycle) through applying a Fourier transform to each data point(from three separate experiments)
The Fourier Transform equation is as follows:
raw data for each point=log2ratio(ti) ti is time i.
A=summation sin(wt+ phi(rawdata))
B=summation cos(wt+phi(rawdata))
C = (A,B)
D=squareroot(A2 +B2)
Where w is the period of the cell cycle, t is the time, and phi is the phase offset. Each gene stores the a vector C.
This set of vectors was then correlated(using a standard correlation coefficient calculation) to five different profiles representative of the known genes expressed in G1,S,G2,M at different times.
After ordering the 6200 genes by CDC score, a threshold point(top 800) is set to construct a set of genes which the phasing algorithm was used to subdivide the genes further into the five categories thru using a correlation function. I believe this correlation function is identical to the previous pairwise linkage cluster analysis similarity score.
The decision as to where to set a threshold point was based on a CDC score in which 91%(95 of 104) of genes previously shown to be cell-cycle regulated are included.
The calculation of a single CDC score for an individual gene is as follows. The experiment took data from~ 400,000points(800 genes at different time points~50): synchronized for arrest in G1(alpha factor and cdc28,data set from Cho et al.) and M(cdc15). One must first calculate a vector for each gene over time for each of the three experiments and then add them together to get the final vector that will be used for subdividing into the five groups.
Since phi must be calculated differently for cdc15 and cdc28 experiments(relative to alpha whose phi=0), Spellman et al. altered the w period value for cdc15(60-80) and cdc 28(80-100). Before formally adding the three vectors, they worked on constructing an ideal phi constant(for cdc15 and cdc28) to maximize the average magnitude of the summed vectors.
The standard profile amongst which each of the 800 vectors was compared was generated through using the published timing of gene expression for known genes. How many genes they used in Spellman et al was not mentioned, but I am assuming that it was the <=91% of genes with cdc score >800.
The profiles of known gene classes were identified by calculating a reference vector representative of the average log ratio data for each of the ‘selected set of genes’ known to peak in each of the five time periods. To determine which group each individual gene in the data set belonged, the vector calculated from the Fourier transform for each gene was scaled(using the same equation to calculate similarity score) by a peak correlation value. The peak correlation score was defined as highest correlation(similarity score) value between the data series for each gene and each of the 5 profiles.
Advantages:
1. Because the algorithm is robust it allows the integration of previous data compiled and analyzed using the manual method.
This method took used previous data from Cho et al to identify cell-cycle regulated genes. This addition of data improves the signal to noise ratio on the total data set. In comparison to the previous data, in which 421 genes were found to have been cell-cycle regulated, this method included 304 of those as well as 496 genes not included in the Cho et al paper. One technical difference(besides the 25mer vs. complete cDNA) that might reflect false positives are Cho et al’s method for attempting to filter out the effects of heat shock: analyzing gene expression at a later time point. Because the parametric approach takes into account this data and attempts to decrease this noise, there is high confidence in stating that the 304 found by both methods are truly cell-cycle regulated.
2.The clustering method allows the ability to identify the characteristic pattern of cell cycle regulation.
By examining the patterns of gene expression and clustering them into different phases of the cell-cycle, one finds that a signature pattern for G1 checkpoint genes might arise in the future and serve as a gold standard when comparing other organism’s G1 progression.
The promoter studies are helpful in identifying conservation of upstream regulators, and the groups with similar expression profiles can be linked to regulatory elements in promoters, e.g. for G1 58%(vs. 6% control) have a copy of a perfect MCB element known to be bound by MBF whose activity depends on cyclin activation. For example in Spellman et al., 70% of the genes were found to respond to both inducible Cln3p(peak in G1/S) and Clnb2p ( peak in G2/M) expression
Disadvantages
1. Where did the false negatives go?
9% of cell-cycle regulated genes were still missing(9/104). Of the 9, only one displayed visually evidence of fluctuating expression and still only in one condition. The inability to detect fluctuation in other genes might be due to high noise to signal ratio(the signal detection is not linear, but on log scale); however, it was discussed that some of these genes don’t always oscillate or oscillate very little(2.5fold, this study constructed its data set from genes displaying 3 fold changes) by traditional molecular biological techniques: Northern, Western. This is an example of experiments done in the past whose conclusions were weak.
2. The forced clustering of genes with similar patterns of gene expression into different groups.
The last gene in the G2 groups and the first gene in M group are expressed at virtually the same time, but are in different groups. This problem will appear in the sporulation paper in which a similar supervised clustering method is used.
3. This algorithm don’t account for genes whose expression peaks at two time points.
Cho et al found ten genes whose expression didn’t display a single prominent peak, but two peaks instead. The algorithm(fourier transform) penalizes for multiple peaks; however visual inspection didn’t reveal multiple peaks.
SELF ORGANIZING MAPS
The sporulation paper Chu et al. uses a both unsupervised clustering(using a S.O.M.) and supervised clustering. I will methodically outline the approach below starting with the construction of a S.O.M. The truth is that I have spoken to Michael Eisen and he admits that the S.O.M. was presented in figure 4 merely for the purposes of aesthetic beauty and not used for the supervised clustering, so the analysis of sporulation data was done with the supervised clustering method.
Why?
Self Organizing Maps developed by Teuvo Kohonen (7)are useful to take complex data with large data sets that represent multiple points in space( e.g. 400,000) and order each point so that humans can analyze and draw conclusions from the relationships between the points.
What are they and how can they be applied to analyzing microarray data?
The purpose of constructing a self-organizing maps is to solve the classical "Travelling Salesman" problem who has a thousand stops, but needs to find the optimal path to travel to maximize work and minimize travel time. Its application to the microarray data has been to construct an optimal ordering, a visual representation of gene expression patterns that roughly reflect an order of genes based on time of first repression(near the top) and induction.
The "travelling salesman" problem as we can see has been significantly expanded as one seeks to construct an optimal ordering of the genes from 7000 vectors in euclidean space. The procedure of generating a self-ordering map for this data is as follows: convert the data from each time point represented by the log transformed ratio of experimental fluorescence vs. control fluoresence into a vector in 3D space normalized to magnitude of 1.
Then feed this set of input vectors into Kohonen’s neural network model (software) which constructs a self-organizing map by finding the optimal path along a line (the smoothest path in space) by calculating dot products between vectors. Kohonen’s self-organizing map algorithm finds this optimal path amongst all the vectors by minimizing a parameter, distance between adjacent vectors in space (the vectors are plotted in 3D around a sphere around which calculations are rendered). The output of the algorithm is an optimal ordering of the 1116 genes.
The S.O.M. in the sporulation paper minimizes a parameter representing differences in the temporal pattern between consecutive genes. This parameter can be thought of as an energy function representing the summation of distances in Euclidean space between each gene and its nearest neighbor. This algorithm has been used in Chu et al’s description of the yeast transcriptional response during sporulation. The number of genes that were significantly induced was 1116 classified using manual method of selecting genes affected more than 3 fold(repression or induction) and number of time points: 6( plus time zero).
Advantages:
1. Hard boundaries of classification don’t need to be drawn for ambiguous data.
Since neural nets are not subjected to previous empirical categorical measurements, they form the most logical patterns a priori and its ordering scheme is not forced to make rigid classifications on ambiguous data. The self-organizing map algorithm produced a nice picture representing expression profiles of 1116 genes affected by a switch to sporulation media as seen in figure 4(
http://cmgm.stanford.edu/pbrown/sporulation/fig4.html).2. The optimal ordering is sometimes better(visually) than other clustering approaches.
This neural network approach has ordered the genes in such a manner that the top fourth of the 1116 ordered genes is very repressed(GREEN) at early time points and then released(BLACK) while the bottom fourth the genes are either not induced(BLACK) or repressed(GREEN) at early time points and highly induced(RED) at later time points. The ordering of the genes appears visually satisfying to the biologist, as it reflects an ordering based on time of first induction or repression.
The neural net based ordering algorithm on sporulation data, albeit I never saw the clusters from Eisen’s unsupervised clustering algorithm on sporulation data (he said it was less impressive), in comparison to the clustering algorithm used on serum starvation of primary fibroblasts in figure2 of Eisen et al. (
http://rana.stanford.edu/clustering/serum.html) looks like a very effective approach to ordering genes based on their expression patterns.
3. Handle massive amounts of data and order them in a fashion that is easily assimilated by the biologist.
In ordering vectors in space, one needs to find an algorithm that doesn’t have some biological bias in order to discover patterns of similarity that might not be found if one weights his algorithm towards biological preference that might be incorrectly classified.
Disadvantages:
1. Incorrect classifications that make no biological sense.
Just as neural nets used to predict secondary structure can often score a residue inside of an alpha helix as beta strand, this algorithm might also fail in grouping genes of similar function into meaningful groups, despite arranging genes based on first induction or repression. I don’t have any examples as I didn’t examine the data closely(figure5 Chu et al) and my knowledge of yeast sporulation genes is insufficient to draw any conclusions which might support this point.
SUPERVISED CLUSTERING:
Why?
For the sporulation paper, there is no parameterization applied to each point since it’s ambiguous as to how to divide genes with mathematical weight such as periodicity into the seven classes to group induced genes in the set of 1116 genes significantly affected by sporulation. I am assuming this based on lack of mention of periodic fluctation of transcripts during the course of sporulation.
The scientists sought an approach which would agree with previously reports of distinct sets of genes assigned to various stages of sporulation.
What is it?
In this case, the authors performed the same basic correlation coefficient test as above in parameterization method except I can’t be sure as to how the reference vectors (probably constructed with same algorithm as cell-cycle selected groups) were constructed since I couldn’t get in contact with Joe Derisi who constructed the figure 5
http://cmgm.stanford.edu/pbrown/sporulation/fig5.html)Advantages:
1. This approach is mathematically sound and more visually appealing than than the manual approach.
It does appear that the overall trend of induction is preserved meaning the early genes appear induced at early time points, while the late ones are repressed at early time points and induced at later time points. The fact that on average 70% of the genes within each class had a correlation coefficient >0.9 seems accurate. The fact that the genes which correlated very closely with adjacent groups is reflected in the ordering(each gene was binned into its groups by best fit) of these similar genes close to the boundaries.
The fact that the visual representation(color coded) of induced genes into temporal classes in figure 5 (
http://cmgm.stanford.edu/pbrown/sporulation/fig5.html) preserves the overall patterns seen in shape of the plots(manual) generated of known early and late genes used to construct reference vectors lends credence to the effectiveness of the method to classify genes known to be important in the transcriptional program of sporulation.Disadvantages
1. Unappealing classification of genes clustered near boundaries of the classes.
As was the case in Spellman et al, this report classifies some genes which seem to have identical expression patterns into different sets: e.g. the set of genes near the end of the "early middle" set look identical to ones in the beginning of the "middle" set: see Figure 5.
http://cmgm.stanford.edu/pbrown/sporulation/fig5.htmlThe classification of the metabolic genes near the top of the set might be better classified as an early genes as they appear look induced early and repressed later. The argument for vice versa can also be visually made of a few early I genes with stable expression throughout the time whose induction decreased only in the last time point.
Perhaps the problem lies in method of averaging the log ratios of chosen ‘metabolic’ genes or the choice of genes that were ‘forced’(by the scientists in the field) to take part in the temporal profile.
I’m sure that better parameterized functions might be developed in the future to make the data look more visually appealing.An alternative ‘objective’ approach of constructing an ordering of the genes into groups not predefined. The algorithm used here would be identical to the neural net except not taking into account the distance and smoothing functions that the self organizing maps do. I’m not sure what the precise parameters are of the equation used to construct the data into an arbitrary number of groups, but the general principle is to minimize an energy function representing the summation of the distances between a vector and its nearest neighbor(Eisen, personal communication). Whether this approach can be classified as a neural net, I’m not absolutely sure.
Food for thought:
Grouping by chromosome position, or promoter
?Not possible, not enough sequence and 3D structural information out there yet! Biology isn’t explained all by sequence information. Grouping by chromosome position the different genes and their unique cell cycle fluctuation time period: as in figure 3 of Cho et al into reveals nothing significant to my eye although the authors claim that more than 25% of the genes displaying periodic transcript levels were positioned directly adjacent to another gene induced in the identical cell cycle phase.
Please see:
http://www.molecule.org/cgi/content/full/2/1/65/F3The dream of finding hotbeds of transcriptional activity localized to specific chromosomes may be possible in other biological processes in the future, but so far it hasn’t looked very impressive for cell-cycle (Eisen, personal communication).
Grouping by promoter element is at a very elementary stage due to the insufficient knowledge of regulatory elements. Only correlations can be made such as in the sporulation paper of the gene’s degree of similarity to a consensus DNA binding site of that particular transcription factor. Also it is limited by the fact that some genes with similar function and clustered by the average linkage algorithm might not have any similar promoter element and wouldn’t appear as part of the set although its expression pattern is identical.
CLASSIFICATION ALGORITHMS
The classification algorithm, called CART(8): an acronym for Classification and Regression Trees, is a decision-tree based procedure that might be used in the future to describe consistent patterns of expression with a high degree of confidence. It essentially subjects each data point to a tree-shaped visual flow chart of binary questions which direct it in the appropriate class.
This algorithm can classify apriori a set of points into classes, but needs a large data set to effectively train the algorithm into classifying the points into groups which make sense. In comparison to neural nets, this algorithm is much quicker and the simple table format makes it easier for users to see the hierarchical interactions of the variables.
As the number of data sets grows for cDNA microarray analysis, this algorithm might be more amenable to classification of a gold standard, signature of a particular process like transcriptional targets of a signalling pathway such as MAPK which activates transcription factors through phosphorylation.
The algorithm might be trained on Yeast data(using dominant-negative mutants), and validated by c.Elegans, Xenopus and Human data.
Some disadvantages of this algorithm include the lack of biological bias and necessity of a large data set to teach the tree how to subdivide the data.
KMEANS Clustering Algorithm:
It is an alternative to the pairwise average-linkage cluster analysis used by Botstein’s lab.
It is used by Trent lab at NHGRI. The paramters of the algorithm I am unsure of since no publications have reported successful clustering based on using it. I have only seen one figure of the results of this algorithm in a Nature Genetics editorial(9). The approach displayed a 3 dimensional chart(x: the genes y: R/G ratio z: days of expression over time. The idea is to cluster genes according to the similarity of expression profiles over time. Whether this approach will be used to draw novel conclusions remains to be seen. I mention this merely as another statistical/mathematical algorithm to analyze gene expression patterns in a sensible way for the biologist to interpret.Bibliography:
Manual Classification:
(2) Heller et al. Discovery and analysis of inflammatory disease-related genes using cDNA microarrays PNAS Vol 94 pp. 2150-2155. 1997
(3) Cho, et al. A Genome-Wide Transcriptional Analysis of the Mitotic Cell Cycle
Molecular Cell. Vol2. pp.65-73.1998
Clustering Algorithms:
(4) Eisen, M et al. Cluster Analysis and Display of Genome-wide Expression Patterns.
PNAS 1999(in submission)
(5) Spellman, P et al. Comprehensive Identification of Cell Cycle-regulated Genes of Yeast Saccharomyces cerevisiae by Microarray Hybridization. Molecular Biology of the Cell Vol.9 pp.3273-3297 1998
(6) Chu et al. The Transcriptional Program of Sporulation in Budding Yeast.
Science Vol 282 p699-705. 1998
Neural Nets:
(7) Kohonen,T Self Organizing maps c1995
Other:
(8) Breiman et al. Classification And Regression Trees c1984.
(9) Getting hip to the chip Nature Genetics Vol 18 No.3 p.195 1998
(10) Abbot A, DNA chips intensify the sequence search Nature 379: pp.392 1996