Choice of correlation measure

 

 

We considered several choices to measure gene correlations:  Pearson correlation, Euclidean distance and Spearman correlation.  Pearson correlation measures the relative shape of the gene regulations rather than the absolute levels.  This is a natural choice because it is widely used to measure gene correlations. 

 

Euclidean distance measures the absolute level of gene regulation, and would not be appropriate for this analysis.  For example, two genes whose expression levels were perfectly parallel to one another across the database could still be far apart in Euclidean space if the absolute levels in each experiment were different.  The Euclidean distance can also make genes that are uncorrelated appear close together; for example, if two genes had expression levels close to 0 across the database but were otherwise randomly correlated they could still appear close in Euclidean space.

 

The Spearman correlation uses ranks rather than raw expression levels which makes it less sensitive to extreme values in the data.  The possibility existed that extreme values in the dataset might significantly influence the Pearson calculation and thereby enable a small number of microarray experiments to have a disproportionately large effect on our gene similarity measure.  We checked for any such outlier affects before using the Pearson correlation.  We computed Spearman correlations between every pair of genes and compared the Spearman correlation with the Pearson correlation for every pair.  We found that there was good overall agreement between the two measures in all four organisms (Fig. S3).  Since the two measures are similar, we decided to use the Pearson correlation because it is more sensitive.