Improved Personalized Survival Prediction of Patients With Diffuse Large B-cell Lymphoma Using Gene Expression Proling

Background 30-40% of patients with Diffuse Large B-cell Lymphoma (DLBCL) have an adverse clinical evolution. The increased understanding of DLBCL biology has shed light on the clinical evolution of this pathology, leading to the discovery of prognostic factors based on gene expression data, genomic rearrangements and mutational subgroups. Nevertheless, additional efforts are needed in order to enable survival predictions at the patient level. This study investigated new machine learning models of survival based on transcriptomic and clinical data. in different publicly available retrospective cohorts were Cox regression and clustering were in to identify associated with overall on was to compare the was Results were in independent


Methods
Gene expression pro ling (GEP) in 2 different publicly available retrospective cohorts were analyzed. Cox regression and unsupervised clustering were performed in order to identify probes associated with overall survival on the largest cohort. Random forests were created to model survival using combinations of GEP data, COO classi cation and clinical information. Cross-validation was used to compare model results in the training set, and Harrel's concordance index (c-index) was used to assess model's predictability.
Results were validated in an independent test set.
Results 233 and 64 patients were included in the training and test set, respectively. Initially we derived and validated a 4-gene expression clusterization that was independently associated with lower survival in 20% of patients. These genes were TNFRSF9, BIRC3, BCL2L1 and G3BP2. Thereafter, we applied machine-learning models to predict survival. A set of 102 genes was highly predictive of disease outcome, outperforming available clinical information and COO classi cation. The nal best model integrated clinical information, COO classi cation, 4-gene-based clusterization and 50 gene expression data (training set c-index, 0.8404, test set c-index, 0.7942).

Conclusion
This study indicates that modelling DLBCL survival with transcriptomic-based machine learning algorithms can largely outperform other important prognostic variables such as disease stage and COO.

Background
Diffuse Large B-cell Lymphoma (DLBCL) is the most frequent type of lymphoma, accounting for 25% of all cases of non-Hodgkin lymphoma (NHL). DLBCL has an estimated incidence in the United States of 6.9 new cases per 100,000 people/year [1]. Despite its aggressivity, 60-70% of patients achieve curation after rst-line immunochemotherapy with R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, prednisone) [2]. Nevertheless, the remaining 30-40% of cases exhibit relapsed or refractory disease which frequently precludes a dismal prognosis [3].
Improved biological characterization of DLBCL has led to the identi cation of new disease subtypes with prognostic implications. DLBCL cases with dual rearrangement of MYC and BCL2 and/or BCL6, frequently named "double-hit" lymphomas, are associated with signi cantly shorter survival and have been reclassi ed as a new group of lymphomas by the World Health Organization [4,5]. Similarly, using gene expression pro ling (GEP), DLBCL can be classi ed in two broad groups by their cell-of-origin (COO) status, namely germinal center B-cell (GCB)-like and activated B-cell (ABC)-like. Those among the latter show an adverse prognosis with respect to the GCB-like DLBCLs [6]. More recently, different groups reported the identi cation of new DLBCL subgroups based on co-occurrent genomic alterations [7,8], paving the path towards a more individualized approach to this disease.
In the meantime, the emergence of arti cial intelligence has brought new expectations to the eld of medicine, particularly for disease diagnosis and prognostication. Classical models such as cox proportional hazard model and the log-rank test assume that patient outcome consists of a linear combination of covariates, and do not provide decision rules for prediction in the real-world [9]. On the contrary, machine learning is a eld of arti cial intelligence that performs outcome prediction based on complex interactions between multiple variables. Machine learning makes little assumptions about the relationship between the dependent and independent variables [10]. In machine learning, a model is trained with examples and not programmed with human-made rules [11]. In the case of survival data, machine learning needs to take into account the time to event and censoring of the data.
Machine learning (ML) has been applied to predict survival in different clinical scenarios with encouraging results. The implementation of ML-based survival models is increasingly popular in order to provide patient-centered risk information that can assist both the clinician and the patient. Kim et al. (2019) recently published a deep-learning model that uses clinical parameters to predict survival of oral cancer patients with high concordance with reality [12]. Similarly, random forest-based models have been created to predict 30-day mortality of spontaneous intracerebral hemorrhage [13] and overall mortality of patients with acute kidney injury or in renal transplant recipients [14,15].
In this study, we used gene expression data from DLBCL cases in order to create new models for survival prediction based on retrospective data. Initially, we sought to identify transcripts and gene expression patterns associated with prognosis. Afterwards, we used this information to t a forest model that predicts overall survival with high-concordance. Comparisons with clinical data and COO classi cation are provided. We believe that our results will facilitate the establishment of individualized survival predictions in DLBCL.

Data origin and normalization
The gene expression database GSE10846 was used for training and the gene expression database GSE23501 was used as an independent test set ( Table 1). GSE10846 contains gene expression data from whole-tissue biopsies of 420 patients diagnosed with DLBCL according to World Health Organization (WHO) 2008 criteria [16], of which we selected 233 cases treated with R-CHOP-like regimens in the rst line. GSE23501 contains 69 DLBCL whole-tissue biopsies of patients treated with R-CHOP-like regimens as a rst line [17]. Both studies used Affymetrix HG U133 plus 2.0 arrays for gene expression quanti cation. As the data from GSE23501 depends from British Columbia biobanks and part of the data from GSE10846 also originated from the same location, we used Spearman correlation to rule out duplicate samples. Indeed we detected 4 samples with almost perfect correlation (>0.99) which we treated as duplicates and were removed from downstream analysis. A case treated with rituximab, doxorubicin, bleomycin, vinblastine and dacarbazine was also discarded, making a nal validation set of 64 cases. No other pairs of samples were strongly correlated at the gene expression level (>0.9). COO classi cation was originally deposited with gene expression data, and in both cases this classi cation was inferred exclusively from gene expression data. Log2-transformed expression data for both cohorts were obtained from the Gene Expression Omnibus (GEO) database [18]. Rank normalization was applied to the data in order to make the results comparable.

Clusterization
The Mclust algorithm [19] was used in order to detect the 2 most likely patient clusters according to the expression of each probe (Mclust function, parameter G = 2). Brie y, the Mclust algorithm determines the most likely set of clusters according to geometric properties (distribution, volume, and shape). An expectation-maximization algorithm is used for maximum likelihood estimation, and the best model is selected according to Bayes information criteria. The association of each of these probe-level clusters with overall survival was calculated using cox regression. Thereafter, those probes whose clusterization was signi cantly associated with survival (Bonferroni adjusted p-value <0.05) were selected for multivariate clusterization using the same Mclust algorithm. Cluster prediction was performed on the test set using parameters estimated in the training cohort, and cox regression was used to verify the association of this clusterization with overall survival. The Shoenfeld's test was used to assess the proportional hazards assumption.

Random Forest Survival Analysis
We initially tested the association of each probe with overall survival in the training set using multivariate cox regression. Schoenfeld's model was used to assess the proportional hazards assumption. Those probes which violated this assumption (p-value < 0.05) were discarded from further analysis.
Random forest survival models were created with the rfsrc function implemented in the randomForestSRC package in R [20]. We decided to use this type of model because, in contrast with deep networks, random forest can quantify the relative importance of each variable, and thus enable the ltering of low-importance variables for model reduction and performance improvement. Parameter tuning was performed using the tune.rfscr function, which optimizes the mtry and nnodes variables. Random forests were implemented on survival data of the training cohort. Bootstrapping without replacement was performed with the default by.node protocol. Continuous rank probability score (CRPS) was calculated as the integrated Brier score divided by time. Survival prediction on the test cohort was performed using the predict.rfsrc function with default parameters. Harrel's concordance index (c-index) was used to assess model discriminative power on the bootstrapped training set and on the test set. Cindex re ects to what extent a model predicts the order of events (e.g., deaths) in a cohort [21]. C-indexes below 0.5 indicate poor prediction accuracy, c-indexes near 0.5 indicate random guessing and c-indexes of 1 represent perfect predictions.
Variable reduction was performed by iteratively removing those variables with low importance. Variable importance was calculated with the vimp function, and we iteratively removed those samples with negative or low weight (importance < 1 x 10 -4 ). The number of random splits to consider for each candidate splitting variable ("nsplit") was optimized by testing the performance of the algorithm in the training set with values in the range of 1 to 50 splits. Finally, we used the best model in terms of c-index for replication in the validation set.

Gene expression-based clusterization
Single probe clusterization revealed the existence of four probes strongly associated with overall survival (Bonferroni p-value <0.05). These probes corresponded to the following genes: TNFRSF9, BIRC3, BCL2L1 and G3BP2. Two of these genes were signi cantly associated with survival in the test set, namely TNFRSF9 (p-value 0.04) and BCL2L1 (p-value 8.59 x 10 -3 ).
Multivariate clusterization using the 4 genes identi ed a cluster of 21.46% of patients signi cantly associated with worse prognosis (p-value 1.95 x 10 -6 , Hazard Ratio (HR) 3.53, 95% con dence interval (CI) HR 2.01-5.93; Figure 1A, Figure 2A). Furthermore, multivariate association evidenced a signi cant effect independently of patient sex, age, Ann Arbor stage (I-IV) and COO classi cation (p-value 2.06 x 10-9, HR 6.93, 95% CI HR 3.68-13.06). Cluster prediction on the independent test set classi ed a group of 20.31% of the patients in this cluster, and multivariate regression con rmed an independently signi cant adverse outcome for these (p-value 5.43 x 10 -3 , HR 6.80, 95% CI HR 1.76-26.26, Figure 1B, Figure 2B). Patient characteristics for botch clusters in the two cohorts can be consulted in Table 2.

Survival Prediction Using Random Forests
Clinical and molecular biology parameters were used to predict survival using random forests survival models. Initially, we tested the accuracy of the model using clinical data (patient sex, age and Ann Arbor stage), rendering C-indexes of 0.6340 and 0.6202 in the training and test cohorts, respectively (Table 3). Adding COO classi cation to the model improved concordance moderately (training c-index=0.6761, test c-index=0.6837), and including the previously described 4-gene expression-based clusterization increased discrimination capacity furhter (training c-index, 0.7059; test c-index, 0.7221).
Afterwards, we studied survival predictability using expression data of those genes associated with overall survival (Supplementary Table 1). We initially analyzed different sets of genes in order to select the best combination. Survival prediction with those genes associated with survival at 3 different signi cance thresholds were selected: univariate cox q-value below 0.01 (GEP_0.01), 0.05 (GEP_0.05) and 0.1 (GEP_0.1). GEP_0.01 (3 genes) performed poorly (training c-index=0.5934, test c-index=0.6301). GEP_0.05 (12 genes) improved predictability (training c-index 0.7530, test c-index 0.6649). Notwhistandintly, the best prediction accuracy was achieved using GEP_0.1 (102 genes, Supplementary Table 2). This model achieved a high concordance with survival in the bootstrapped training cohort (cindex 0.7783) and in the test cohort (0.7415). Interestingly, only 6 of the genes included in this pattern match that of the Nanostring COO assay [22].
Finally, we tested several combinations of GEP-based variables and clinical information ( Table 3). The best model included clinical data, GEP_0.1, 4-gene expression clusterization and COO classi cation (cindexes of 0.8051 and 0.7615 after parameter optimization in the training and test sets, respectively). By iteratively removing variables with negative or low importance values (< 1 x 10 -4 ) and tuning the "nsplit" parameter in the training cohort, an improved model was constructed based on 54 items (Supplementary Table 3), achieving concordance measures in the training set of 0.8404, and in the test set of 0.7942.
Predicted individual survival curves according to this model for patients in both cohorts are represented in Figure 3. Training set out-of-bag CRPS is represented in Supplementary Figure 1. Notably, the importance of MS4A4A expression (probe id: 1555728_s_at) was the highest of all variables, followed by that of 4gene expression clusterization. The expression of SLIT2 (probe id: 230130_at), NEAT1 (probe id: 220983_s_at), CPT1A (probe id: 203633_at), IGSF9 (probe id: 229276_at) and CD302 (probe id: 205668_at) were superior to that of COO classi cation.

Discussion
In this study we present a random forest-based model to predict survival in DLBCL based on clinical and gene expression data. Using cox regression and unsupervised clustering we identi ed a set of transcripts and a 4-gene expression cluster associated with overall survival. This information was used to t predictive models of survival using random forests. The best model outperformed some of the most important prognostic factors known in the eld of DLBCL. Moreover, its combination with clinical information and COO classi cation rendered survival predictions that show high concordance with reality.
The importance of gene expression biomarkers in DLBCL has been known for a long time. The COO classi cation was described almost two decades ago, linking DLBCL cellular ontogeny with clinical outcome [23]. Similarly, the prognostic role of double-expressor DLBCLs (DLBCLs with high expression of MYC and BCL2 or BCL6 but no accompanying genomic rearrangement) was described several years ago [24]. Recent studies have recently reported interesting prognostic patterns using GEP in the eld.  [27]. Our results add to the growing evidence that improved transcriptome-based risk strati cation beyond classical biomarkers is possible. Importantly, the 4-gene expression clusterization described here includes important driver genes of lymphomagenesis, such as TNFRSF9 [28], BIRC3 [29] and BCL2L1 [30].
Other interesting studies have reported important advances in DLBCL risk strati cation. Reddy et al (2017) used exome-sequencing data to create a genomic pro le that improved state-of-the-art prognostic models [31]. Nevertheless, their study was centered in prognostic groups rather than individualized predictions. In the same line, the accuracy of gene expression classi ers [25][26][27] for making personalized predictions was not tested. Recently, machine learning techniques were used by Biccler et al. (2018) for individualized survival prediction in DLBCL. They reported a stacking approach that incorporated clinical and analytical variables to predict survival in DLBCL patients from Denmark and Sweden, achieving high performance (training cohort cross-validated c-index, 0.76; test cohort c-index, 0.74) [32]. In comparison, the results of our GEP-based random forest model show a superior performance in terms of concordance. Surprisingly, we observed that transcriptomic data alone outperforms the combination of COO classi cation and limited clinical data. Another advantage of random forests is the quanti cation of variable importance. In this case, it is notable that variable importance for 6 transcripts was superior to that of COO classi cation.
To our knowledge this is the rst approach to combine GEP with arti cial intelligence for the survival prediction of DLBCL patients. Machine learning models come along with substantial bene ts in the area of survival prediction. Firstly, there is no prior assumption about data distribution, and complex interactions between the variables can be modelled. Secondly, they do not simply rely on pre-de ned assumptions about the pathology (for example, COO status). Finally, gathered information is used to directly predict patient outcome, and individualized survival curves are obtained. These personalized approaches overcome the imperfect patient subgrouping derived from classical studies, and thus they are more useful in clinical practice. Our results might be particularly useful in order to select high-risk patients for inclusion in clinical trials.
This study, like many others in the eld of disease prognostication, has some limitations. Firstly, some important prognostic features were not available for this study, such as fragility scores, International Prognostic Index (IPI), NCCN-IPI and "double-hit" status. Although the IPI has proven to improve prognostic strati cation of gene expression arrays [16], there is still room for improvement of its predictive accuracy. In this line, the suboptimal performance of IPI and NCCN-IPI must be highlighted (cindexes of 0.66 and 0.68 for IPI and NCCN-IPI, respectively; Biccler et al. 2018 [32]). Furthermore, comorbidities and cause of death were not reported in any of the two studies. Finally, competing variables such as the type of salvage therapy and/or having undergone an autologous stem cell transplantation were unknown. Additionally, some heterogeneity related to the inclusion of different high grade lymphoma subtypes (for example, double and triple-hit lymphomas) and the variability of techniques for COO classi cation used should be considered as potential limitations. Therefore, it is tempting to speculate that the combination of GEP with improved histopathological and clinical pro les will provide even better predictive models of DLBCL survival.

Conclusion
This study presents a machine learning-based model for survival prediction based on GEP data and limited clinical information of two retrospective cohorts. We demonstrate that this model outperforms classical prognostic variables such as disease stage and COO, as well as other state-of-the-art machine learning approaches. We anticipate that these individualized predictions will be useful in clinical practice and might prompt the development of novel rst-line therapeutic interventions for selected patients. This study is based on publicly available data and no ethics approval or consent to participate was needed.

Availability of data and materials
All data is available in the properly referenced data repositories.

Competing interests
The authors declare no competing interests.

Funding
The publication costs are partially funded with a grant from the Fundación Galega de Hematoloxía e Hemoterapia (FGHH).
Authors' contributions AMO designed the study and performed the research.
AMO, JADA, MCL, APR and BAR analyzed the results and wrote the paper. CAS, NAV, ABL, AAB, LBP, MSGP, MMPE, MFFR and JLBL critically evaluated the paper, made suggestions and gave nal consent for publication.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. SupplementaryTables.xlsx