Sequential t-SNE profiling
Supplemental Fig. 1 summarizes our previous work [13] regarding the extent to which t-SNE-aided clustering of transcripts from 15 pathways with established roles in cancer [14, 16,17,18] can be used to predict long-term survival differences across all 34 cancer types representing 10,227 individual tumors from TCGA [9, 12, 13]. As an example, the analysis of 514 kidney clear cell carcinomas (KIRC) with the 23 transcripts comprising the Pyrimidine Biosynthesis Pathway generated two distinct t-SNE clusters containing nearly identical tumor numbers and associated with highly significant median survival differences (2090 days vs. > 4500 days, P = 5.6 × 10− 7, Fig. 1a and b and ref. [13]. A similar analysis performed on the same tumors with the 30 transcripts comprising the Notch Pathway also generated two distinct t-SNE clusters associated with significant survival differences (1912 days vs. 3554 days, P = 7.0 × 10− 4, Fig. 1c and d).
The fact that the above groups remained heterogeneous following t-SNE-based evaluation suggested that sequential analysis with transcripts from second pathway might further delineate the groups. We therefore re-analyzed tumors from the two Notch Pathway t-SNE clusters shown in Fig. 1c and d with transcripts from the Pyrimidine Biosynthesis Pathway. These results (Fig. 1e and f) showed that each Notch Pathway t-SNE cluster could be further divided into distinct Pyrimidine Biosynthesis Pathway t-SNE clusters. Specifically, the original favorable survival Notch Pathway Cluster 1 (median = 3554 days, Fig. 1d) was now shown to be comprised of an even more favorable group with median survival > 4500 days and a significantly more unfavorable group (median survival = 2386 days, P = 5.9 × 10− 5, Fig. 1e). This latter group was comparable in its survival to each of the short-term survival groups initially delineated with a single t-SNE analysis (P > 0.05 in each case). Similarly, analysis of the original unfavorable survival Notch Pathway Cluster 2 (median = 1912 days, Fig. 1d) also identified two clusters with significant survival differences (median = 3615 days vs. 1661 days, P = 0.023, Fig. 1f).
We next analyzed 511 low-grade gliomas using a similar sequential approach. Initial t-SNE profiling with transcripts from the Notch pathway identified two distinct Clusters with significant median long-term survival differences (3978 days vs. 1891 days, Fig. 1g and h, P = 3.0 × 10− 4). Analysis of the same tumors using the 25 transcripts from the Wnt Pathway produced four distinct t-SNE clusters (Fig. 1i). Of these, Cluster 1 individuals survived longer relative to Clusters 2 and 3 (median survival = 3200 days vs. 1915 days and 2433 days, respectively, P = 2.0 × 10− 4 in each case.
Clusters 1 and 2 each contained a sufficiently large tumor population to allow a meaningful second analysis to be performed with transcripts from the Notch Pathway. In the case of Wnt Cluster 1, this produced the expected two t-SNE Clusters similar to those seen in Fig. 1g (not shown) with significant differences in their median long-term survival (4695 days vs. 1933 days, P = 4.1 × 10− 5, Fig. 1k). A similar sequential analysis of the unfavorable Wnt Pathway Cluster 2 survival from Fig. 1i also produced two Notch Pathway t-SNE clusters with significantly different long-term survival of 4084 days and 1547 days (Fig. 1l, P = 0.01). A comparison of each of the favorable and unfavorable populations from Fig. 1k and l indicated significant differences in median survival (4695 days vs. 4084 days, P = 0.0034 and 1933 days vs. 1547 days, P = 0.008) as well as significant differences in survival when compared to most and least favorable survival obtained using only single t-SNE analyses (ex. 4695 days vs. 3978 days [Fig. 1h], P = 0.01 and 1547 days vs. 1891 days [Fig. 1h, P = 0.04]). Thus, unlike KIRCs, where a second t-SNE analysis was able to further subdivide groups into additional favorable or unfavorable long-term survival cohorts, neither of which survived significantly longer or shorter than those analyzed by only a single t-SNE analysis, the sequential t-SNE profiling of LGGs did identify patient subsets with particularly favorable or unfavorable survival that was well in excess of that predicted from the individual t-SNE analysis.
To generalize these findings, we performed similar sequential t-SNE profiling on sarcomas (SARC) and kidney renal papillary cell carcinoma (KIRP) (Fig. 2). In the first case, 259 sarcomas were analyzed by t-SNE for differential expression patterns of transcripts comprising the Myc and TGF-β Pathways. Profiling of the Myc Pathway identified two t-SNE clusters with highly significant differences in median survival (1536 days [Cluster 1] vs. 2599 days [Cluster 2], P = 0.0038, Fig. 2a and b). Profiling of the TGF-β Pathway also identified two clusters with median survival of 1649 days (Cluster 1) and > 4500 days (Cluster 2), P = 0.047, Fig. 2c and d). Sequential t-SNE profiling of the TGF-β Pathway’s inferior survival cluster with Myc Pathway transcripts allowed it to be subdivided into two groups with median survival of 1262 days and 2464 days, P = 0.005, Fig. 2e). Similarly, the TGF-β Pathway t-SNE Cluster 2, comprising 83 individuals with superior median survival (> 4250 days, Fig. 2d), could also be divided into two groups. However, most likely because this group lacked a sufficiently large number of tumors, the two survival curves were not determined to be significantly different despite a clear trend in that direction (median survival 2324 days vs > 4500 days).
Analogously, t-SNE profiling of the 288 KIRPs using the 15 transcripts comprising the Cell Cycle Pathway [13] also generated two major clusters comprised of nearly identical numbers of tumors. A third t-SNE cluster comprised of only seven tumors was not analyzed further (Fig. 2g). Highly significant survival differences were observed between the first two groups (median survival> 3900 days [Cluster 1] vs. 2624 days [Cluster 2], P = 3.39 × 10− 5). t-SNE profiling of this same tumor population using the 11 transcripts comprising the Pentose Phosphate Pathway [13] also generated two distinct clusters (Fig. 2i) with borderline long-term survival differences (each > 3900 days, P = 0.048, Fig. 2j).
As before, significant improvements in survival prediction were achieved when the above tumor samples were subjected to sequential t-SNE analysis. Thus, when the inferior survival Pentose Phosphate Pathway Cluster 1 (Fig. 2i and j) was analyzed for the expression patterns of Cell Cycle Pathway transcripts, two t-SNE clusters with significantly different long-term median survival differences were obtained (1498 days vs. > 3000 days, P = 2.2 × 10− 5, Fig. 2k). The less favorable group’s 1498 day median survival time was significantly shorter than that of either of the less favorable groups from Cell Cycle Pathway and Pentose Phosphate Pathway t-SNE clusters [1498 days vs. 2624 days, P = 0.05 (Fig. 2h) and > 5900 days, P = 0.01 (Fig. 2j)]. Sequential t-SNE profiling on the favorable survival Pentose Phosphate Pathway Cluster 2 (Fig. 2i and j) with Cell Cycle Pathway transcripts did not demonstrate significant differences in the median survival times between the two resulting groups due most likely to sample number limitations and/or survival time constraints. Nevertheless, a clear trend was observed with 87% of the “favorable group” individuals (n = 88) remaining alive at ~ 3000 days versus only 55% of the “unfavorable group” individuals (n = 70) (Fig. 2l).
Finally, we undertook a third analysis of ovarian (OV) and uterine corpus endometrial cancers (UCEC) whose t-SNE profiles were somewhat more complex and showed less pronounced inter-Cluster survival differences when interrogated with the transcripts of single pathways. For example, ovarian cancers generated four t-SNE clusters with Pyrimidine Biosynthesis Pathway transcripts (Fig. 3a). Of these, only Clusters 1 and 4 showed even borderline significant differences in their median long-term survival (1492 days vs. 1336 days, P = 0.05, Fig. 3b). Analysis of the same tumors using Cell Cycle Pathway transcripts also generated four distinct t-SNE clusters (Fig. 3c), with only Clusters 2 and 4 demonstrating modestly significant differences in median survival (1484 days vs. 1187 days, P = 0.034).
We sequentially profiled Cell Cycle Pathway Cluster 3 (median survival 1341 days, Fig. 3c) with Pyrimidine Biosynthesis Pathway transcripts. Due to the small size of the original Cell Cycle Pathway Cluster 3 (37 tumors) and the fact that its secondary analysis yielded four Pyrimidine Biosynthesis Pathway clusters, it was difficult to achieve a high degree of statistical significance among the four groups. Nevertheless, Clusters 2 and 4 showed significant differences in median survival (1946 days vs. 1341 days, respectively, P = 0.02, Fig. 3e). The much larger, 161 member Cell Cycle Cluster 2 (median survival 1484 days, Fig. 3d) could also be further sub-divided into four Pyrimidine Biosynthesis Pathway Clusters with significant median survival differences between some groups (Fig. 3f). For example, Cluster 1 (median survival 1736 days) showed significantly longer survival relative to both Cluster 3 (1336 days, P = 0.017) and Cluster 4 (1213 days, P = 0.004).
t-SNE profiling of Myc Pathway transcripts applied to 547 UCECs generated three distinct clusters (Fig. 3G) with Cluster 3 demonstrating a clear inferior median survival (3112 days) relative to the other two Clusters [each > 4000 days, P = 9.0 × 10− 4 (Cluster 1) and 1.0 × 10− 4 (Cluster 2), respectively]. Profiling with Wnt Pathway transcripts generated four clusters (Fig. 3i), with Cluster 1 having inferior median survival (3423 days) relative to Clusters 2 and 3 (> 3000 and > 3900 days, P = 0.043 and P = 0.008, respectively, Fig. 3j) and Cluster 3 showing a longer survival relative to Cluster 4 (P = 0.04).
Despite the favorable 82% long-term survival of Wnt Pathway Cluster 2 individuals (Fig. 3j), they could be further stratified into the expected three clusters following sequential analysis with Myc Pathway transcripts (not shown). Although the median survival of these clusters could not be determined, Cluster 3, which contained approximately one-firth of the individuals, showed significantly inferior survival relative to the other two Clusters (P = 0.007, Fig. 3k). Similarly, the subdivision of poor survival Wnt t-SNE Cluster 1 (Fig. 3j) using Myc Pathway transcripts identified one subgroup (Cluster 3, Fig. 3l) with particularly poor median survival (3112 days) relative to the other two Clusters (P = 0.034 and P = 0.037, respectively,).
Thus, in summary, the serial use of t-SNE to sub-classify expression patterns of transcripts from select cancer-related pathways made it possible to analyze multiple tumor types so as to achieve a higher degree of survival stratification than could be achieved with only a single t-SNE analysis. Thus, even after initial single Pathway analyses, tumor cohorts remained heterogeneous with regard to their cumulative long-term survival.
Sequential hierarchical clustering/t-SNE profiling
Numerous studies have indicated that otherwise histologically similar tumors may nonetheless display distinct differences in their transcriptomes that correlate with survival and/or other behaviors [19,20,21,22,23,24]. We recently showed for some cancers that the ability to predict survival using this approach could be improved when combined with t-SNE profiling [13]. We decided to extend these findings by including a more comprehensive evaluation of all cancers in TCGA for which whole transcriptome profiling was available.
Hierarchical clustering of the previously described LGG transcriptomes allowed the tumors to be divided into four groups [19], termed “Dendros 1–4” or “D1-D4” (Fig. 4a), with individuals in D2 having a particularly poor long-term survival relative to the others. (P < 3.1 × 10− 8) None of the remaining three Dendros showed any significant differences in survival (Fig. 4b).
Profiling the entire LGG group with 93 transcripts from four cancer-related pathways (Pyrimidine Biosynthesis, Hippo, PI3-kinase signaling and Wnt signaling) produced four t-SNE clusters in each case (not shown but see ref. [13]. When these Clusters were matched to the individual tumors in each of the Dendros, several non-random associations were seen. For example, t-SNE Cluster 1 of the Hippo Pathway contributed disproportionately to the Dendro 3 subset (P = 1.03 × 10− 15), whereas t-SNE Cluster 3 of the Hippo Pathway and t-SNE Cluster 3 of the PI3 kinase family of transcripts contributed disproportionately to the Dendro 2 group (P = 1.4 × 10− 10 and P = 2.95 × 10− 7, respectively) (Fig. 4a).
We next compared the survival of individuals in each t-SNE Cluster, either collectively or within the context of individual Dendro groups. In the first case, we found all tumors associated with Pyrimidine Biosynthesis Pathway t-SNE Cluster 1 to be associated with significantly shorter survival relative to the other t-SNE Clusters (P = 9.14 × 10− 6-6.25 × 10− 9, Fig. 4c). This was consistent with the disproportionate representation of these Cluster 1 tumors within Dendro 2 (P = 3.16 × 10− 22). Indeed, the only remaining Cluster 1 tumors were associated with Dendro 4 and while few in number (n = 7), the individuals in this group had a particularly short survival relative to those with tumors in the other t-SNE Clusters comprising this Dendro (P = 0.0012–4.3 × 10− 7, Fig. 4d).
Hippo Pathway Cluster 4 tumors also contributed disproportionately to Dendro 2 (P = 1.44 × 10− 12). Consistent with this, Cluster 4, both overall and in its Dendro 2 context, was associated with the shortest survival relative to the other t-SNE Clusters (P = 0.023–6.8 × 10− 11, Fig. 2e&f). The only remaining Hippo Pathway Cluster 4 tumors were associated with Dendro 4. While associated with extremely short survival, they were too few in number (n = 2) to make a reliable statement concerning the significance of this. However, individuals with tumors in Dendro 4 (median survival = 2433 d) could be further distinguished by a long-term survival t-SNE 1 Cluster (median survival = 3470 d) and a shorter-term survival t-SNE 3 Cluster (median survival = 1547 d) (Fig. 4g).
Similar associations could be made in the case of PI3-kinase Pathway transcripts where, across all tumors t-SNE Cluster 2 individuals had longer survival than either Cluster 1 or Cluster 3 individuals (P = 7.0 × 10− 4 and.
7.1 × 10− 6, respectively, Fig. 4h). Additionally, t-SNE Clusters 1 and 2 clearly could be used to further delineate survival within the Dendro 4 cohort (median survival =1891 d vs. 3200 d, respectively, P = 0.03, Fig. 4i).
Finally, the four t-SNE Clusters generated from Wnt Signaling Pathway transcripts were associated with significant differences in survival across all tumors (Fig. 4j). Among the most significant of these were the inferior survival of individuals with tumors in Cluster 1 vs. Cluster 2 and Cluster 1 vs. Cluster 3 (P = 2.0 × 10− 4 in each case). Furthermore, the survival difference between Clusters 1 and 3 could be utilized in an analysis of the Dendro 4 cohort to improve overall survival prediction within this group (median survival 3200 d vs. 2235 d, P = 0.05, Fig. 4k).
Another example in which the tandem sequential hierarchical clustering/t-SNE approach was found to be particularly useful in allowing more refined stratification of patient survival was seen in the case of 374 hepatocellular carcinomas (HCCs). For these tumors, hierarchical clustering generated six Dendros which showed only relatively modest survival differences (Dendro 1 vs. Dendro 4, P = 0.021, Fig. 5a and b). t-SNE profiling with four pathways (Purine Biosynthesis, Pyrimidine Biosynthesis, PI3-kinase signaling and TGF-β signaling), performed either alone or sequentially on each Dendro was far more useful in identifying subsets of patients with particularly favorable or unfavorable long-term survival. For example, t-SNE profiling alone of all tumors with Purine Biosynthesis Pathway transcripts identified three Clusters with significant differences between Clusters 1 and 2 (median survival = 1229 d vs. 2116 d, respectively (P = 0.01 and ref. [13] and Clusters 2 and 3 (median survival = 2116 days vs. 1694 days, respectively, P = 0.035) (Fig. 5c). When t-SNE profiling with Purine Biosynthesis Pathway transcripts was applied to Dendro 3 however, much more substantive differences in survival were observed, with Clusters 1 and 2 showing median survivals of 643 days and > 3500 days (P = 0.007) and Clusters 2 and 3 demonstrating median survivals of > 3500 days and 837 days (P = 0.01) (Fig. 5d).
t-SNE profiling of all HCCs with transcripts from the Pyrimidine Biosynthesis Pathway generated two Clusters with significant survival differences (2131 days vs. 1397 days, P = 0.04, Fig. 5e). However, the 734 day difference in these median survivals was significantly extended to 1283 days when the Dendro 6 cohort of patients was divided according to t-SNE cluster, where median survivals of 2131 days and 848 days were obtained (P = 0.008) (Fig. 5f).
Additional t-SNE profiling of PI3-kinase Pathway signaling transcripts was also found to be useful when used to evaluate all HCCs. Three clusters were identified with significant survival differences between Clusters 1 and 3 (1397 days vs. 2456 days, P = 0.014) and between Clusters 2 and 3 (1490 days vs. 2456 days, P = 0.011) being observed (Fig. 5g). As before, increased survival stratification was achieved when t-SNE profiling was applied against Dendro 2 where Clusters 1 and 3 showed median survival differences of 425 days vs. > 3500 days (P = 0.034) (Fig. 5h). When applied against Dendro 3, Clusters 1 and 3 showed similarly large disparities in median survival (802 days vs. > 3500 days, respectively, P = 0.034) (Fig. 5i).
Lastly, the three t-SNE Clusters of TGF-β Pathway transcripts were associated with differential survival among all individuals with HCC (Fig. J and ref. [13]). Significant differences in median survival were observed for Clusters 1 vs 2 (1397 days and 2131 days, respectively, P = 0.016) and for Clusters 2 and 3 (2131 days vs. 1423 days, P = 0.025). However, when applied only to the Dendro 4 group, t-SNE profiling of TGF-β Pathway transcripts was able to discern highly significant survival differences between Clusters 1 and 2 (median survival = 1271 days vs. > 3500 days, P = 0.009 (Fig. 5k).
A comprehensive, interactive collection of human cancers amenable to sequential analysis
Given the ability of sequential profiling to improve survival stratification, we constructed an interactive website (https://chpupsom19.shinyapps.io/Survival_Analysis_tsne_umap_TCGA and https://github.com/RavulaPitt/Sequential-t-SNE/). that allows the transcriptional profiles of > 10,000 specimens from 34 different human cancers in TCGA to be queried using either of the approaches described above. In addition to the limited number of examples shown here (Figs. 1, 2, and 3), this website allows for the sequential t-SNE analysis of all tumor groups in TCGA using any of the pathways that revealed survival differences among t-SNE clusters (Suppl. Fig. 1 and ref. [13]). An additional section of the website permits tumors whose whole transcriptome profiles correlate with survival differences to be secondarily analyzed by t-SNE (Figs. 4 and 5). This is particularly useful for some of the larger TCGA cancer cohorts such as KIRC, breast cancer and non-small cell lung cancer, where well over 500 well-curated samples in each group are available. Factors other than the total sample size, which that can limit the robustness of these types of analyses, include the number of Dendros and t-SNE Clusters.