Novel bioinformatic classification system for genetic signatures identification in diffuse large B-cell lymphoma

Diffuse large B-cell lymphoma (DLBCL) is a spectrum of disease comprising more than 30% of non-Hodgkin lymphomas. Although studies have identified several molecular subgroups, the heterogeneous genetic background of DLBCL remains ambiguous. In this study we aimed to develop a novel approach and to provide a distinctive classification system to unravel its molecular features. A cohort of 342 patient samples diagnosed with DLBCL in our hospital were retrospectively enrolled in this study. A total of 46 genes were included in next-generation sequencing panel. Non-mutually exclusive genetic signatures for the factorization of complex genomic patterns were generated by random forest algorithm. A total of four non-mutually exclusive signatures were generated, including those with MYC-translocation (MYC-trans) (n = 62), with BCL2-translocation (BCL2-trans) (n = 69), with BCL6-translocation (BCL6-trans) (n = 108), and those with MYD88 and/or CD79B mutations (MC) signatures (n = 115). Comparison analysis between our model and traditional mutually exclusive Schmitz’s model demonstrated consistent classification pattern. And prognostic heterogeneity existed within EZB subgroup of de novo DLBCL patients. As for prognostic impact, MYC-trans signature was an independent unfavorable prognostic factor. Furthermore, tumors carrying three different signature markers exhibited significantly inferior prognoses compared with their counterparts with no genetic signature. Compared with traditional mutually exclusive molecular sub-classification, non-mutually exclusive genetic fingerprint model generated from our study provided novel insight into not only the complex genetic features, but also the prognostic heterogeneity of DLBCL patients.

Understanding the molecular basis of this heterogeneity may facilitate individualized management strategies.
Researchers have focused on developing a robust algorithm to discover distinct subsets and subclassify DLBCLs. In 2000, based on gene expression profiling results, Alizadeh et al. identified two molecularly distinct forms of DLBCL, germinal center B-cell-like (GCB) and activated B-cell-like (ABC), representing different stages of B-cell differentiation [6]. Nevertheless, according to the cell-oforigin (COO) classification, approximately 10%~20% of DLBCLs remain unclassified, and the molecular pathogenesis of DLBCL remains obscure. The rapid development of next-generation sequencing (NGS) technology has revealed accumulated recurrent genetic alterations, which have improved our understanding of the genetic landscape of DLBCL. However, these genomic studies had largely focused on single types of alterations. In 2018, Schmitz et al. studied tumor specimens from 574 patients with DLBCL [4]. By developing a subclassified algorithm based on coding region mutations, CN variations, and structure variations (SVs), they further identified four distinct genetic subtypes of DLBCL, which included BN2 (based on BCL6 fusions and NOTCH2 mutations), N1 (based on NOTCH1 mutations), MCD (based on the co-occurrence of MYD88 L265P and CD79B mutations), and EZB (based on EZH2 mutations and BCL2 translocations). Chapuy et al. also carried out a comprehensive genetic analysis of 304 primary DLBCLs and identified five subgroups of DLBCLs with prominent genetic features (C1-C5), [5]. These studies provided us with a novel roadmap for an actionable DLBCL classification for precision-medicine-based strategies in DLBCL.
Although several specific subgroups, such as primary DLBCL of the central nervous system (PCNSL), primary mediastinal (thymic) large B-cell lymphoma (PMBL), primary cutaneous DLBCL, leg type (PCDLBCL-LT), high-grade B-cell lymphoma, not otherwise specified (HGBL, NOS), and HGBL with MYC and BCL2 and/or BCL6 translocations (HGBL-DH/TH), are all welldefined entities, they share clinicopathologic features and genetic alterations with DLBCL, NOS according to previous studies [7][8][9][10]. In fact, they belong to a disease spectrum rather than discrete entities. In 2018, Scott et al. defined a clinically and biologically distinct subgroup of tumors within GCB DLBCL characterized by a gene expression signature of HGBL-DH/TH-BCL2 [11]. Simultaneously, Westhead et al. also defined a molecular high-grade (MHG) group by applying a gene expression-based classifier [12]. Chapuy et al. also mentioned in their study that molecular heterogeneity existed within the C3 subgroup [5]. Their studies suggest that beyond current definitions of double-and triple-hit DLBCLs, more specified molecular subgroups of DLBCL requires further exploration.
In this study, instead of mutually exclusive subclassification, we determined several non-mutually exclusive genetic signatures for the factorization of complex genomic patterns in a continuous spectrum of Bcell lymphomas based on a random forest algorithm. Using this model, we also presented a single-center primary site-relevant mutational pattern based on the Chinese population. This study aims to develop a novel approach to understand the molecular features, provide a distinctive insight nosologically, and orient targeted therapeutic strategies and prognostic evaluation of the spectrum of large B-cell lymphomas.

Patients and samples
This study was approved by the institutional review board of Tongji Hospital. From June 2008 to September 2018, a cohort of 342 patient samples diagnosed with DLBCL in our hospital were retrospectively enrolled in this study. All cases were reviewed by at least three experienced hematopathologists. Most samples were pretreatment biopsies from de novo cases (n = 298), including DLBCL, NOS (n = 239), PMBL (n = 13), PCNS L (n = 7), PCDLBCL-LT (n = 6), HGBL, NOS (n = 7), and HGBL-DH/TH (n = 26). The remainder consisted of relapsed DLBCL, NOS (n = 25) after R-CHOP or CHOP-like chemotherapy and transformed follicular lymphoma (tFL) samples (n = 19). All samples were obtained from formalin-fixed paraffin embedded (FFPE) tissue. Tumor content was estimated to be at least 30% in all subjects. Genomic DNA was extracted with the GeneRead™ DNA FFPE Kit (Qiagen) according to the manufacturer's instructions.
DLBCL samples were classified into GCB and non-GCB subtypes by the IHC-based Hans algorithm [13,14]. Translocations in MYC, BCL2, and BCL6 and CN aberrations in TP53 were examined by performing fluorescent in situ hybridization (FISH). Cases were excluded if COO subtype or FISH findings were not available. Clinical data, including age group of onset, International Prognostic Index (IPI), primary site of lymphomagenesis, chemotherapy regimen, initial response to therapy, overall survival (OS), and progression-free survival (PFS), were collected (Table 1) (Supplementary Table S1). We restrictively included individuals in which biopsy tissue was evaluated as the primary site of lymphomagenesis, according to integrated judgment based on PET-CT scanning, pathological findings, and clinical manifestation [15]. Criteria of response assessment and definition of OS and PFS were followed by the Lugano Classification [16]. Only patients receiving R-CHOP or R-CHOPlike regimens were included in the prognostic analyses. Multivariable Cox proportional hazard regression models were used to evaluate proposed prognostic factors.
Targeted high-throughput sequencing A total of 46 genes were selected in this study (Supplementary Table S2). Most genes were frequently altered in DLBCL according to data from several previously published large-scale DLBCL cohort studies [3][4][5]. Additionally, several infrequently mutated genes were also included because they are specifically related to several DLBCL subtypes. In detail, ID3, TCF3, DDX3X are related with double-hit lymphoma [17]/Burkitt lymphoma [18]; XPO1 is related with primary mediastinal large Bcell lymphoma [8]; RRAGC, POU2AF1 are related with Follicular lymphoma [19]; KLF2 is related with marginal zone lymphoma [20]. Besides, in our panel we also selected genes functioning in signaling pathways which are crucial for DLBCL pathogenesis (e.g. FBXW7 for NOTCH pathway; NRAS, KRAS, MAP 2 K1 for MAPK pathway). Using genome build hg19/GRCh37 as a reference, a sequencing panel covering the coding sequences (CDS) within 5 intronic base pairs around exons in 46 genes was designed online (Designstudio Sequencing, Illumina, San Diego, USA). Sequencing libraries were prepared with AmpliSeq™ Library PLUS for Illumina, using 20 ng of input genomic DNA per sample. Library sequencing was performed to 2000× coverage on a Next-Seq™ 550 system using an Illumina NextSeq™ 500/550 High Output v2 Kit (Illumina, San Diego, USA). The alignment and variant calling were performed using a DNA Amplicon workflow with default parameters on BaseSpace Sequence Hub (Illumina). Generated variants were further annotated using Annovar [21].
Variant filtering was performed by the following cascade of steps: 1) select exon nonsynonymous or splice donor/acceptor site variants; 2) exclude variants with population frequency > 0.0001 in the gnomAD database unless variant is included as a somatic variant of lymphoid neoplasm in the COSMIC database; 3) exclude variants present in an in-house curated blacklist. The formation of our variant screening blacklist was based on the idea described previously by Schmitz et al. [4] As these false positive variants were presumed to be artifacts generated either by the high throughput sequencing platform itself or due to errors in alignment or annotation of the sequencing reads by the analytical pipeline. Typically, these variants were abnormally prevalent, identified exclusively in specific sequencing platform, and are not recurrent variants included in the major public cancer somatic mutation database (COS-MIC database, https://cancer.sanger.ac.uk/cosmic). Therefore, as such variants were unique in our center and there were no universal criteria for identification, the blacklist was built for future rapid and accurate variants' screening. Detailed variants information included in the blacklist were listed in Supplementary Table S7; 4) exclude variants found in regions with poor coverage; and 5) exclude variants with quality less than 30 or read depth less than 20. For activation-induced cytidine deaminase (AID) somatic hypermutation (SHM) analysis, we additionally selected synonymous variants and variants in intron/UTR regions, and each variant also needed to fulfill the aforementioned criteria from step 2 to step 5 [22].
Sanger sequencing in matched normal DNA was performed if it was available for each missense mutation that passed all preceding filters and met the following conditions: 1) variant allele frequency (VAF) more than 0.40 and less than 0.60; 2) variant was not included in the COSMIC database; 3) variant was included in the gnomAD database. Confirmed germline mutations were excluded in further analyses.

Fluorescent in-situ hybridization analysis
Interphase fluorescence in situ hybridization (FISH) studies were performed using commercially available probes (Abbott Molecular, Downers, Grove, IL, USA). The LSI IGH/IGHV (14q32), and LSI MYC (8q24) Dual Color, Break Apart Rearrangement Probes were used to detect the rearrangement of BCL2, BCL6 and c-Myc respectively. A 17p13.1 (P53) probe (Vysis, Downers, Grove, IL) was used to detect 17p deletion. Sample preparations and hybridizations were conducted following the manufacturer's recommendations and 200 cells were analyzed for each probe.

Bioinformatic algorithm
Artificial Intelligence (AI) is the intelligence manifested by a human-made machine. It usually refers to the ability of a computer to simulate human thought processes in order to mimic human abilities or behaviors. AI not only deals with problems under pre-set rules, but also develops capabilities to generate judgements under new situations through feature identification. The performance of these feature-driven algorithms can improve as they are exposed to more data over time, which is similar to the human learning process. Therefore, such algorithms are named machine learning. Among existing learning methods, random forests are an ensemble learning method for classification or regression that operate by constructing a multitude of decision trees at training time and outputting the classification or regression of the individual trees. In this study, our model was trained using the R package 'randomForest'. The number of trees was set to 100; all other hyperparameters were set to their default values. Detailed information on the algorithm was described in Supplementary Appendix.

Statistical analysis
All statistical analyses were evaluated by R v3.5.1. Differences were analyzed using Fisher's exact test for categorical variables. The significance of the co-occurrence or mutual exclusivity was calculated using a Fisher exact test, for numerous tests, p values were FDR -adjusted using Bonferroni method. The Kaplan-Meier method and log-rank test were used for survival analysis. Unless otherwise specified, a two-sided P-value < 0.05 was considered statistically significant for all analyses.
In addition, in this study we specifically focused on the analysis of CD79B mutation pattern. In general, the variant frequency of CD79B was relative higher compared with recent related studies [3-5, 23, 24]. In detail, for the typical hotspot variant CD79B Y196 , we found the top co-occurrent mutation with CD79B Y196 was MYD88 L265P (n = 37) (adjusted P value 1.09 × 10 − 9 ) in our de novo DLBCL, NOS cases, which was mainly identified in non-GCB subtype (4.5% of GCB cases vs. 22.6% of non-GCB cases, P = 5.87 × 10 − 7 ) (Fig. 1b). Meanwhile, we found Fig. 1 Overview of the genetic features in 342 cases. a Frequency of genetic alterations that distinguish the GCB and non-GCB subtypes of 342 cases, sorted by log 10 P value for the difference between the two subgroups. b The correlation among different types of MYD88 and CD79B mutations. c Circos plot depicting the correlation among different types of CD79B mutations (Y196 missense, truncating, and non-Y196 missense mutations). d Positions and types of somatic mutations encoded in CCND3 (NP_001751.1) and CD79B (NP_000617.1). e The sequence of CD79B (chr:62007140-62,006,802, GRCh37/hg19). The black arrow denotes the splice acceptor site mutation c.550-1G > A (NM_000626.4). The red arrow denotes two exposed potential splice acceptor sites. Coding sequences are highlighted by black frames. f Genetic alterations that are most related to the MYC-trans signature, BCL2-trans signature, BCL6-trans signature, and MC signature. Recurrent altered genes in GCB and non-GCB cases without our set of genetic signatures were also shown. g Venn diagrams describing the difference between cases exhibited initially defined signatures, and cases exhibited extended genetic signatures obtained from a convergence predicted by an iterative random forest algorithm that CD79B non-Y196 codon mutations accounted for 41% (33/80) of all CD79B mutations. Moreover, through validation Sanger sequencing of tumor and paired normal tissue DNA, we determined several novel hotspot intron splice site mutations, including c.550-1G > A, c.550-1G > C, c.550-3_552del, c.549 + 1G > A, c.549 + 1G > C, and c.540_549 + 1del (Fig. 1d). Focusing on the molecular impact of c.550-1G > A mutation, we subsequently performed RNA sequencing and revealed that this mutation resulted in exposing two novel potential splice acceptor sites, thereby synthesizing two truncating proteins (Fig. 1e). Furthermore, we also found that CD79B truncating mutations were mutually exclusive with CD79B Y196 (adjusted P value 0.011). Similar tendency was witnessed for CD79B truncating mutations with MYD88 L265P , while without statistical significance (adjusted P value 0.14) (Fig. 1b, Fig. 1c).

Identification of genetic signatures via iterative random forest (RF) algorithm
In this study, based on targeted sequencing results and FISH findings, we attempted to identify several nonmutually exclusive representative genetic signatures instead of categorizing subjects into several mutually exclusive distinct subgroups. Therefore, we decided to seed our analysis from 5 genetic alterations which participated in the most important cellular signaling pathways in DLBCL pathogenesis, i.e. cellular proliferation (MYC translocation), apoptosis resistance (BCL2 translocation), immune cell differentiation abruption (BCL6 translocation) and activation of inflammation pathway (CD79B Y196 and MYD88 L265P ). Moreover, all five genetic alterations were specifically enriched in either GCB or non-GCB subtype DLBCL patients (> 20% positive in GCB or non-GCB DLBCL patients). In addition, these alterations exhibited most distinctive frequencies between GCB and non-GCB DLBCL subtypes by Fisher's test (Fig. 1a). Thus, using the five main features above, we initially defined four non-mutually exclusive genetic signatures: 1) the MYC-trans signature, with MYC translocation (n = 54); 2) the BCL2-trans signature, with BCL2 translocation (n = 59); 3) the BCL6-trans signature, with BCL6 translocation (n = 91); and 4) the MC signature, with MYD88 L265P and/or CD79B Y196 mutations (n = 72) (Fig. 1f).
Among the above-mentioned four signatures, MC signature combined CD79B Y196 and MYD88 L265P variants as they not only presented as hotspot mutations in DLBCL patients, but also exhibited statistically significant tendency for co-occurrence (adjusted P value 1.09 × 10 − 9 ). In addition, previous researches also revealed that both variants resulted in constitutive activation of NF-κB signaling pathway [5]. Inspired by the study conducted by R. Schmitz et al., we aimed to evolve and maximize each genetic signature with our set of genetic features while appropriately maintaining the pattern suggested by the initial genetic signature. To alleviate such semisupervised problems, we developed an iterative random forest (RF) algorithm (Supplementary Appendix). The label of each genetic signature among cases gradually propagated and obtained convergence ( Supplementary Table S4; Fig. 1g). Additionally, 8 (14.8%), 10 (16.9%), 17 (18.7%), and 43 (59.7%) cases were predicted to exhibit the MYC-trans, BCL2-trans, BCL6-trans, and MC signatures, respectively, suggesting that the initial definition of the MC signature might be conservative. As a result, 252 out of 342 cases (73.7%) were finally confirmed to be associated with at least one genetic signature.
Next, we investigated other genetic mutations statistically associated with one of these genetic signatures. As illustrated in Fig. 2, genetic mutations of each case were combined and clustered within different genetic signatures, and were shown in factorized mutational heatmap. Firstly, MYC and ID3 mutations were associated with the MYC-trans signature (P < 0.001), and 40% (8/20) of cases with isolated MYC-trans signatures harbored mutations in the ID3-TCF3-CCND3 pathway. We also recognized that all MYC hypermutations were identified in cases with MYC-trans signatures (20/20, 100%), while MYC non-hypermutations were common in cases with either MYC-trans signatures (10/25, 40%) or BCL6-trans signatures (15/25, 60%). Secondly, BCL2, EZH2, CREBBP, STAT6, and KMT2D mutations were significantly related to the BCL2-trans signature (P < 0.001). Although the BCL2 mutation was associated with the BCL2-trans signature, cases harboring the BCL2 hypermutation usually implied that they had a combined MYC-trans and BCL2-trans signature (6/6, 100%). For chromatin modification-associated genes such as KMT2D and CREBBP, cases harboring co-occurring mutations in KMT2D and CREBBP generally indicated a BCL2-trans signature (21/24, 87.5%). Thirdly, for the BCL6-trans signature, the CD70, KLF2, NOTCH2, and RRAGC mutations were specifically identified (P < 0.001). Although the CCND3 mutation was more specifically associated with the MYC-trans signature (P = 0.001), it was also frequent in cases with the BCL6-trans signature (16/108, 14.8%). A vast majority of KLF2 zinc finger mutations (15/22, 68.2%) were identified in cases with BCL6 translocation (or BCL6-tran signature, 21/22, 95.5%), which had not been previously reported. Finally, for the MC signature, in addition to the CD79B Y196 and MYD88 L265P mutations, other types of mutations, such as PIM1 and PRDM1, were also significantly related to the MC signature (P < 0.001). XPO1 E571K, a hotspot mutation in chronic lymphocytic leukemia (CLL) and PMBL, was also frequently identified in cases with MC signatures and was usually accompanied by the BCL6trans signature [25,26].

Model comparison with classical DLBCL subtype classifier and its prognostic significance
In order to validate our genetic classification algorithm, we compared our model with the classical DLBCL genetic classifier built from Schimtz et al. [4] for 239 de novo DLBCL NOS cases in our study cohort. As illustrated from Fig. 3a, 65% of all cases (n = 155) were successfully classified into four genetic subtypes (MCD n = 66, BN2 n = 55, EZB n = 30, N1 n = 4). In comparison, 75% of all cases (n = 175) could be classified in at least one signature subtype. COO classification also demonstrated similar type distribution (GCB and non-GCB) between two models (Fig. 3b). As for each genetic subtype of Schimtz et al. (Fig. 3c), the majority of cases within MCD subtype could be grouped in MC-trans signature (63 of 66, 95.4%), and the consistent result was seen in BCL6-trans signature within BN2 subtype (54 of 55, 98.2%) and BCL2-trans signature within EZB subtype (29 of 30, 96.7%). However, in addition to the consistency between two models mentioned above, we did find that a portion of the DLBCL cases within each subtype of Schimtz's model carried 2 or more signatures.
To evaluate the prognostic value of our genetic subtype model, we selected all de novo patients with large B-cell lymphoma who received R-CHOP or R-CHOPlike chemotherapy (n = 280, maximum follow-up 60 months, median follow-up 26 months). We next constructed a multivariate Cox proportional hazard regression model considering both genetic signatures and IPI scores as variables. The MYC-trans signature was the most unfavorable genetic signature, and the MYC-trans signature had a hazard ratio (HR) of 2.00 compared with the absented MYC-trans signature (OS: P = 0.006) (Supplementary Table S5). Those who presented a BCL2trans signature had a relatively favorable 5-year PFS, with a borderline significance (P = 0.087). According to the non-mutually exclusive nature of our set of four genetic signatures and several latest research achievements [3,5,11,12,27], we aimed to explore the differences in prognostic impact for de novo DLBCL cases with various genetic signature numbers. Firstly, in order to exclude the potential influences of confounding factors, Fig. 2 Schematic of the association between genetic alterations and genetic signatures. All 342 cases were clustered and arranged according to the absent/present status of four genetic signatures. We determined the prevalence of each genetic alteration in the following six subsets: 1) cases presented isolated MYC-trans signatures, 2) cases presented isolated BCL2-trans signatures, 3) cases presented isolated BCL6-trans signatures, 4) cases presented isolated MC signatures, 5) GCB cases without any genetic signatures, and 6) non-GCB cases without any genetic signatures. Genetic alterations were thus clustered into six corresponding classes depending on their maximum prevalence among the six subsets. Genetic alterations in the same cluster were ranked by the significance between cases with isolated corresponding genetic signatures and cases without corresponding genetic signatures (or "other GCB"/"non-GCB" vs. the remaining), with log 10 P value depicted to the right of the factorized heatmap. Color code of genetic alteration types: missense mutation or in-frame deletion/insertion (blue), truncating mutation, splice donor/ acceptor site mutation, or copy number loss in TP53 (red), SHM (yellow), translocation (orange), and nondetected (gray). COO classification was also indicated above the factorized heatmap especially IPI score, we examined the statistical differences of IPI score group distribution (low 0-1, intermediate 2-3, high 4-5) between groups of patients with varying number of genetic signatures (0-sig, 1-sig, 2-sig, 3-sig). As a result, no statistical differences of IPI level distribution were identified between 0-. Sig, 1-sig, 2-sig and 3-sig patient groups (p > 0.05, Chisquare and Fisher Exact test with Bonferroni adjustment). As reflected by the 5-year OS and PFS time ( Fig. 4a-b), we found that individuals carrying three signatures had much worse prognosis than individuals without any genetic signature (OS: P = 0.0084; PFS: P = 0.3274), while patients with only one genetic signature exhibited no significant difference in prognosis compared with those without any signature (Fig. 4c-d). In addition, further subgroup survival analysis indicated that within EZB subtypes of Schmitz model, patients carrying BCL2-trans plus BCL6-trans or MC-trans signatures exhibited significantly inferior prognosis, compared with patients carrying BCL2-trans signature only (OS: P = 0.002; PFS: P = 0.039) (Fig. 4e-f). However, no prognostic differences were identified in patients carrying different number of signatures within MCD, BN2 or N1 subgroups. The above findings provided evidence that these non-mutually exclusive genetic signatures exhibited cumulative prognostic influences, and patient heterogeneity still existed in traditional mutually exclusive classification model for DLBCL patients in our cohort, which requires further confirmation in larger multi-center cohort studies.

Discussion
In this study we retrospectively analyzed NGS sequencing results of DLBCL cohort. Among all panel genes sequenced, we focused on the variant pattern of CD79B. In general, variant frequency of CD79B was relative higher comparing with reports from other centers, which could be possibly explained with ethnical difference and limited sample number. As has been described in previous reports, the majority of CD79B variants including hotspot Y196 clustered in ITAM domain, which were related with NF-κB pathway activation. Our result was consistent with the above findings as the majority of cases bearing CD79B mutations were classified into non-GCB subtype. Moreover, it was worth mentioning that in our study we reported for the first time that a series of intron splice-site were identified as recurrent variants in DLBCL patients (e.g. c.550-1G > A, c.550-1G > C, c.550-3_552del, etc.). Our results further indicated that such splice-site variants probably result in CD79B protein truncating, causing CD79B dysfunction in a unique way compared with CD79B Y196 variant. Furthermore, co-mutation analysis indicated significantly difference compared with classical CD79B Y196 variant in terms of variant co-occurrence with MYD88 L265P . COO classification also demonstrated differences in DLBCL patients carrying CD79B splice-site variants, in comparison of those with CD79B Y196 variant.
In summary, our results provided evidences that such obviously different mutation pattern of CD79B splicesite variants suggested differential impact on DLBCL pathogenesis. However, the exact impact of CD79B truncating protein on the physiological signal transduction of NF-κB pathway and DLBCL pathogenesis calls for further functional study.
Until now, comprehensive studies have revealed that the genetic landscape of DLBCL is heterogeneous, which aids in our understanding of oncogenic mechanisms and provides novel insight into exploring better treatment strategies. To date, Schmitz et al. and Chapuy et al. showed that most DLBCL cases could be subcategorized into several distinct subsets, each of which had unique clinical, molecular and transcriptional characteristics. However, there is still inevitable heterogeneity within each group in their models. Nevertheless, if all expanding factors were taken into consideration, the system would gradually become too complicated to apply in routine clinical scenario. Therefore, in this study, instead of mutually exclusive classification, we aimed to define several non-mutually exclusive genetic signatures to describe and understand the complex molecular features of DLBCL, based on molecular information that was feasible to obtain including MYC/BCL2/BCL6 translocations as well as mutation data from a limited panel of genes. In this study, based on previous research methods and results, we preliminarily determined four genetic signatures using a machine learning-based algorithm. In addition, through analysis of prognostic data based on their signature types, we demonstrated unique cumulative prognostic impact based on the number of signatures each patient carry. Therefore, this model is applicable for target-oriented in therapeutic decision making and prognostic evaluation.
Notably, in this study, some of cases with single genetic signature also carry mutations commonly identified in other B cell malignancies. For those carrying single MYC-trans signatures, 8 of 20 cases were affected by ID3-TCF3-CCND3 pathway mutations, which were prevalent in Burkitt lymphoma (BL) [27][28][29]. For cases with single BCL2-trans signature, a significant proportion (21/23, 91.3%) of cases harbor gene mutations in chromatin modification, including KMT2D, CREBBP, EP300, and EZH2, and several other signaling pathways (STAT6, SOCS1, TNFRSF14) which were similar to the genetic feature described in follicular lymphoma [30]. In addition, for cases with single BCL6-trans signature, the mutations in several genes were also frequently determined in marginal zone lymphoma [31][32][33] It should be noted that, limited by gene panel and sample size, the genetic signatures of certain cases might be mislabeled by a RF prediction algorithm, and some other important genetic signatures might remain undiscovered in our study. Several other genetic alterations have already been revealed to be of potential importance in understanding the mechanism of pathogenesis, classification, therapeutic guidance, and prognosis evaluation but were not included in our set of genetic features (Supplementary Table S6). Additionally, due to the single-center nature of our high-throughput sequencing study, current study lack external data to further support our theories. However, we did aim to validate our model in an expanded-scale multi-centered study in future exploration. Considering the DLBCL genetic diversities among different human races, we believed that future research including multiple populations would provide more consolidated evidences. Our future work will focus on undertaking a multiplatform analysis of genetic features on expanding-scale cohort, so as to promoting the lymphoma signature landscape description and to facilitate the precise determination of genetic signature.

Conclusion
Unlike mutually exclusive molecular sub-classification, our observations supported novel insight into understanding complex genetic features by identifying the status of several non-mutually exclusive clustered genetic fingerprints. The identification of genetic signatures was also helpful for disease classification, but it was also expected to reveal actionable targets for novel therapy development and precise prognostic evaluation. comparison analysis in this study. YQG collected clinical sample and related data. KFS performed high-throughput sequencing experiment. MLZ performed high-throughput sequencing experiment. HDC performed highthroughput sequencing experiment. JCW collected clinical sample and related data. YW performed FISH experiment and gather related data. LW performed statistical analysis. YC performed statistical analysis. NW conducted clinical data analysis. XHT conducted clinical data analysis. KHY oversighted the whole study and provide professional consultation. MX designed the study and oversight the study progress. JFZ oversighted the study and provide professional consultation. All authors stated that they have read and approved the manuscript submission.

Funding
This work was funded by grants from the National Science Foundation of China (No.81770211, No. 81700145). Both grants provided fundamental support for the bioinformatic data analysis, statistical clinical data investigation and manuscript preparations.

Availability of data and materials
The clinical and genetic variants data analyzed in this study are provided in this article and in supplementary files. The raw datasets analyzed during the current study are available from the corresponding author on reasonable request.

Ethics approval and consent to participate
This study was approved by the Ethics Committee of Tongji Hospital. All patients have known the details of our project and signed an Informed Consent Form.

Consent for publication
Not applicable.