Meta-genes: Identification of orthologous groups

 

We considered four methods to generate sets of orthologous genes:  1). clusters of orthologous groups (COGS) defined by NCBI, 2). eukaryotic gene orthologs (EGO) defined by TIGR, 3). reciprocal best blast hits, and 4). transitive reciprocal best blast hits.  Transitivity indicates that if human gene A is a reciprocal best blast hit of worm gene B and fly gene C, then worm gene B and fly gene C also need to be reciprocal best blast hits in order for genes A, B and C to be grouped as orthologs. 

 

We could not use the definition of orthologs found in the COG database because some sets of orthologs contained a large number of genes from a single organism.  For example, in some cases over 100 genes from C. elegans were grouped together.  Having a large number of genes from a single organism would complicate the gene correlations; a single human gene would have Pearson correlations to each of the 100 worm genes in the same orthologous group.

 

We could not use the EGO database because the same gene was sometimes assigned to separate orthologous groups.  For example, tentative orthologs 336024, 350993 and 402694 each contain the same yeast gene encoding nuclear transport factor 2.  Having multiple orthologous groups would complicate gene correlations since a gene from one organism would have Pearson correlations for each group.

 

We used reciprocal best blast hits to define orthologous genes (Fig. S4 A), and then compared results using this approach to results using the other three approaches.  First, we found that most meta-genes contained a single gene from each organism (Fig. S4 B), so this approach avoids the problem found with the COG database.  Second, we found that over 78% of the meta-genes exhibited transitive relationships.  This result indicates that this method and the approach requiring transitivity would generate similar sets of orthologous genes, although the transitive method would be somewhat more restrictive and hence generate a smaller number of meta-genes.  Third, we compared the meta-genes to the sets of orthologs defined by the EGO database.  We examined a random set of 47 meta-genes, which contain a total of 201 pairs of orthologous genes.  Of these 201 ortholog pairs, 184 (91.5%) were also linked together in the EGO database.  Hence, the method used in this paper and the method used by the EGO database generate the same orthology relationship in most cases.  Fourth, it could be that some of the links do not identify true orthologs but rather close homologs; this might be the case for the 22% of meta-genes that do not exhibit transitivity.  In order to determine whether using a close homolog rather than an ortholog would significantly affect the network relationships, we calculated the Pearson correlation between close homologs and found them to be significantly high.  This result indicates that the meta-gene network would yield similar results using close homologs rather than true orthologs.