Supplemental Figure 6. Commitment Algorithm: Comparison of scoring methods
Our commitment algorithm predicts the level of commitment to gene expression of each cell during the development of the worm. We make these predictions by combining the gene expression profiles of the annotated cells in the L1 worm, and the cell lineage. Our approach finds the commitment to gene expression of internal nodes (common ancestors of the cells annotated in the L1) that minimize the overall change in expression throughout development.
Minimizing the overall gene expression change can be scored in multiple ways. In the main text, we chose to minimize the linear sum of all changes along the tree. There are many ways one could score expression values in order to compute commitment to gene expression. We tried two methods (linear scoring and sum of squared changes) in order to determine whether the gene expression commitment algorithm was sensitive to the scoring method. As described below, both methods yielded similar results, and so chose to use linear scoring. This is a more intuitive solution, and also offers the flexibility of different penalties for an increase or decrease in gene commitment in the future.
However, we wanted to confirm the method would yield similar results using different scoring functions. We therefore tested the sum of squared changes (SSC) method. We are presented with 363 annotated gene expression values for each of the 363 terminal cells in the L1 lineage that were scored. As before, we want to minimize the overall changes in gene expression used during embryogenesis to generate the final pattern of gene expression. Rather than use linear scoring, we alter the scoring function so that we are minimizing the sum of the squared changes (rather than the linear sum of changes) between the parent, , and daughter cells, .
As before, at any leaf node, we set
where is the observed expression at the cell represented by leaf node, .
To compare the results from the method described in the main text to the SSC approach described above, we analyzed the activity map generated for each. First, we compared the two scoring methods for similarity in gene expression commitment for 9 individual genes, and found that both methods were similar by visual inspection. Supplemental Figure 6A, second row, shows an example for C08B11.3. Second, we combined the results from all 93 genes to look at the overall change in molecular signature in the embryonic cell lineage. Through visual inspection, we again find that the molecular divergence maps for the SSC method (Supplementary Figure 6B, second row) and linear scoring strongly resemble each other. As a result, we are confident in our predictions, and choose the most intuitive description of the problem (linear scoring) as the reported results.
Commitment Algorithm: Sensitivity to Un-annotated Cells
The commitment algorithm predicts the changes in gene expression commitment based on a subset of annotated cells in the L1, and as a result we use a modified cell lineage in the method (described fully in the methods section). As in evolutionary biology, we make our predictions based only on observed data. Therefore, our commitment and subsequently the molecular divergence map examine differences between common ancestors of observed cells only.
However, unlike evolutionary biology where the total number of species present in Earth’s history is unknown, we acknowledge that the complete cell lineage of C. elegans is known. This gives us the opportunity to examine what effect these additional cells might have on the predictions.
We considered assigning all possible gene expression values for each of the cells, but this is not computationally feasible. We also considered random assignments of gene expression values to the set of un-annotated cells based on a distribution derived from the observed expression values in the known cells. This is biologically unsound because such a distribution would assume that gene expression values between cells with similar fates are not correlated. That is, any statistical analysis would assume that the expression levels of all cells are independent and identically distributed.
Instead, we analyze the system at two of the highest perturbation levels. We allow the un-annotated cells to receive either the highest or the lowest gene expression value. While there are other possible gene expression patterns that cause a higher level of perturbation, such information would be difficult to interpret since many would need to be analyzed, and as already stated the resulting distribution has little meaning.
For each gene, we solved the gene expression commitment map using either the maximum values or the minimum expression values for the un-annotated cells. In comparing these commitment maps, we found that the shared branches showed similar predictions. Supplemental Figure 6A, third and fourth rows, shows an example for C08B11.3.
We then combined the results from all 93 genes to form the gene molecular divergence map. We compared the original map using data only from annotated cells, to the two maps modeled using the maximum and minimum expression data for the un-annotated cells. We find that the vast majority of the shared branches between the molecular divergence maps remain consistent, indicating that the tree shown in Figure 5 in the text is robust to very large changes in expression in the un-annotated cells (Supplemental Figure 6B, third and fourth rows). This is largely due to the fact that the un-annotated cells are often segregated to entire subtrees. However, the un-annotated cells may have effects on certain nodes in the tree, particularly those cells that are ancestral to the un-annotated sublineage.