We considered several choices to
measure gene correlations: Pearson
correlation, Euclidean distance and Spearman correlation. Pearson correlation measures the relative
shape of the gene regulations rather than the absolute levels. This is a natural choice because it is
widely used to measure gene correlations.
Euclidean distance measures the
absolute level of gene regulation, and would not be appropriate for this
analysis. For example, two genes whose
expression levels were perfectly parallel to one another across the database
could still be far apart in Euclidean space if the absolute levels in each
experiment were different. The
Euclidean distance can also make genes that are uncorrelated appear close
together; for example, if two genes had expression levels close to 0 across the
database but were otherwise randomly correlated they could still appear close
in Euclidean space.
The Spearman correlation uses ranks
rather than raw expression levels which makes it less sensitive to extreme
values in the data. The possibility
existed that extreme values in the dataset might significantly influence the
Pearson calculation and thereby enable a small number of microarray experiments
to have a disproportionately large effect on our gene similarity measure. We checked for any such outlier affects
before using the Pearson correlation.
We computed Spearman correlations between every pair of genes and
compared the Spearman correlation with the Pearson correlation for every
pair. We found that there was good
overall agreement between the two measures in all four organisms (Fig.
S3). Since the two measures are similar,
we decided to use the Pearson correlation because it is more sensitive.