Discovering the molecular differences between right- and left-sided colon cancer using machine learning methods

Background In recent years, the differences between left-sided colon cancer (LCC) and right-sided colon cancer (RCC) have received increasing attention due to the clinicopathological variation between them. However, some of these differences have remained unclear and conflicting results have been reported. Methods From The Cancer Genome Atlas (TCGA), we obtained RNA sequencing data and gene mutation data on 323 and 283 colon cancer patients, respectively. Differential analysis was firstly done on gene expression data and mutation data between LCC and RCC, separately. Machine learning (ML) methods were then used to select key genes or mutations as features to construct models to classify LCC and RCC patients. Finally, we conducted correlation analysis to identify the correlations between differentially expressed genes (DEGs) and mutations using logistic regression (LR) models. Results We found distinct gene mutation and expression patterns between LCC and RCC patients and further selected the 30 most important mutations and 17 most important gene expression features using ML methods. The classification models created using these features classified LCC and RCC patients with high accuracy (areas under the curve (AUC) of 0.8 and 0.96 for mutation and gene expression data, respectively). The expression of PRAC1 and BRAF V600E mutation (rs113488022) were the most important feature for each model. Correlations of mutations and gene expression data were also identified using LR models. Among them, rs113488022 was found to have significance relevance to the expression of four genes, and thus should be focused on in further study. Conclusions On the basis of ML methods, we found some key molecular differences between LCC and RCC, which could differentiate these two groups of patients with high accuracy. These differences might be key factors behind the variation in clinical features between LCC and RCC and thus help to improve treatment, such as determining the appropriate therapy for patients.


Background
Colorectal cancer (CRC) is a common and lethal disease. Although its mortality has been declining since 1990, its mortality rate is currently approximately 1.7-1.9% [1]. CRC remains the third most common cancer according to the World Health Organization. This disease can be characterized based on the embryological origin [2]. Right-sided colon cancer (RCC) originates from the midgut, including the cecum, ascending colon, and hepatic flexure. In contrast, left-sided colon cancer (LCC) originates from the hindgut, including the splenic flexure, descending colon, and sigmoid colon [3]. Over the past few years, the differences between LCC and RCC have received increasing attention due to their different prognoses, outcomes, and clinical responses to chemotherapy. In many publications, it has been reported that there are significant differences regarding the mutations, epidemiology, survival, pathology, and clinical presentation between RCC and LCC [4,5]. Compared with LCC, RCC was reported to occur more in older patients and females, having a poorer prognosis [6]. RCC tumors were also reported to be poorly differentiated, and to be larger and have more advanced stages [4,7,8]. However, some conflicting results concerning the differences between RCC and LCC were reported, and it remains a topic of considerable debate whether tumor location itself has a significant impact on prognosis [7]. Furthermore, the differences in molecular features between LCC and RCC remain unclear [9]. Studies have found that BRAF was preferentially mutated in RCC, while epidermal growth factor receptor (EGFR) was generally amplified in LCC [10]. Several studies have also reported that mutations and protein expression of p53 differed significantly between LCC and RCC [11][12][13]. However, another study showed that p53 protein expression had no significant difference between LCC and RCC [14].
In light of this background, there is a need to comprehensively survey the differences of gene mutations and expression levels between LCC and RCC. Knowledge of the differences at the molecular level would help us to obtain an in-depth understanding of LCC and RCC and further improve their diagnostic and treatment strategies in clinical practice. The rapid development of high-throughput sequencing technologies has provided us with opportunities to characterize the diverse array of genomic changes found within each cancer type. Projects like The Cancer Genome Atlas (TCGA) have compiled mutation, gene expression, Fig. 1 Feature selection and classification model based on mutations between LCC and RCC. a AUC scores of 100 times 10-fold cross-validation using different feature numbers. b The ROC curve of the classification model on the test dataset. c The importance score of the 30 mutation features methylation, and copy number data across cancer types [15].
As demonstrated by many researchers, machine learning (ML) is becoming increasingly important in cancer prognosis and prediction. In this context, we here established a study to use ML methods to explore gene mutation and expression data from TCGA to infer the molecular differences between LCC and RCC.

Data collection
We initially downloaded gene mutation data of 283 colon cancer patients (from tumor tissue) from TCGA data portal. Next, we added clinical information of each patient to the mutation file using the unique patient ID. Among these patients, 112 had LCC and 171 had RCC. All level 3 mRNA expression (FPKM) and raw count  data of 323 colon cancer patients (from tumor tissue) were also obtained from TCGA, of whom 189 had RCC and 134 had LCC. The descending colon, sigmoid colon, and splenic flexure of colon were classified as LCC and the ascending colon, cecum, and hepatic flexure of colon were classified as RCC [3].

Differential analysis and annotation
Analysis of the differential expression of genes was implemented using the R package DEseq2 using raw count data [16]. Genes with adjusted P-values of less than 0.01 and absolute values of log2 fold change (log 2 FC) above 1 were considered to be differentially expressed genes (DEGs). Fisher's exact test was used to calculate the significance of differences in the frequency of each mutation between LCC and RCC samples. Annotation of the mutations was conducted by ANNOVAR [17]. The pathway annotation of DEGs was performed using the R package clusterProfiler with adjusted P-values less than 0.05 [18]. Mann-Whitney test was used to compare the difference of mutation number between LCC and RCC. To compare the mean mutation number of each sample among LCC and RCC groups, we divided the total number of mutations in each group by sample size.

Machine learning methods
Extreme Gradient Boosting (XGBoost) is a boosted tree method that is often used for supervised learning problems [19]. It has excellent scalability and performance and has become an outstanding machine learning method in many fields of study. In this study, we attempted to use it to classify LCC and RCC patients based on mutation data and gene expression data (FPKM). The mutations existing in at least two samples and DEGs were separately used as raw features for XGBoost to do the following feature selection. To find the most appropriate feature number, for each iteration, we fed XGBoost with different number of features and evaluated its performance using the mean area under the curve (AUC) score of 100 times 10-fold crossvalidation. The selected features were further used to construct the final classifier model. Owing to the relatively small sample sizes, we controlled the complexity of the models to avoid overfitting; an L2 regularization term was applied and the maximum depth of each tree was set to 3. Other hyper-parameters in the XGBoost models were assigned the default settings. For each type of data, 65% of samples were randomly selected as a training dataset, and the remaining 35% of samples were selected as a testing dataset using a stratified sampling method. AUC was used to evaluate the models. All of the functions were accomplished using the Python package Scikit-learn (sklearn). The significance of AUC was estimated by a permutation test using the R package sigr.

Network construction
We used STRING to identify the correlations between all DEGs and genes with selected mutations by XGBoost, as shown in Cytoscape [20,21]. Owing to the fact that mutation features were discrete variables and gene expression data were continuous variables, we analyzed the correlations of mutations with DEGs using logistic regression (LR), the type of colon cancer (LCC or RCC) was used as a confounder. If the false discovery rate (FDR) of the coefficient for each gene (x) to a mutation (y) was below 0.05, it was considered that a significant correlation existed between the gene and mutation. The idea behind using LR to calculate correlations between the two types of variable is that, if there is a relationship between continuous and discrete variables, an accurate predictor of the discrete variable would be constructed using the continuous variable. If the coefficient of the variable in the model is significant, we can conclude that the two variables have a relationship and are indeed correlated. The LR models were constructed using the R function glm.

The 30 most important mutations classifying LCC and RCC
Among the 283 samples, there were 169,298 mutations in total. The mean number of mutations in each sample of RCC group was 2.86 times that in LCC (P-value< 0.001). We initially used Fisher's exact test to calculate the significance of the difference between LCC and RCC samples in the frequency each mutation. The results are Fig. 4 Annotation results of DEGs in KEGG (a) and GO analyses (b). Only genes that were upregulated in RCC compared with the level in LCC were enriched in KEGG pathways and GO analyses (adjusted P-value< 0.05) shown in Fig. S1. Among the mutations, the most significant ones were rs113488022 (BRAF, V600E, P-value, 4.72e-15) and rs112445441 (KRAS, P-value, 8.74e-11). The rs113488022 mutation was only found in 2 (1.8%) LCC patients, but in 35 (20.5%) RCC patients. The BRAF V600E mutation was reportedly found in 8-10% of colorectal tumors and was associated with a more aggressive tumor phenotype, lymph node metastasis, and high microsatellite instability (MSI) [20]. It was also found to be associated with less benefit from treatment [22]. In our study, the rs112445441 mutation in the KRAS gene was present in 8 LCC (7.1%) patients and 26 RCC (15.2%) patients.
To select the most appropriate feature number to construct the model that could classify LCC and RCC patients, the AUC scores of 100 times 10-fold crossvalidation for models with different feature numbers were obtained, as shown in Fig. 1a. Finally, we chose 30 features to construct the final model; its AUC score in the test dataset was 0.8 (p < 1e-05). Receiver Operating Characteristic (ROC) curves and the importance score of the 30 features are shown in Fig. 1b and c. The BRAF V600E mutation was the most important feature for the classifier model. Detailed information and the heatmap of the 30 mutation features are presented in Table 1, Supplementary File 1 and Fig. 2a.

The 17 most important DEGs classifying LCC and RCC
Overall, 144 genes were upregulated and 60 were downregulated in RCC compared with the levels in LCC. A volcano plot and MA plot of DEGs are shown in Fig. 3. The genes with a higher expression level in RCC were particularly associated with the vitamin digestion and absorption pathway, cholesterol metabolism pathway, and Staphylococcus aureus infection pathway (Fig. 4a). The Gene Ontology (GO) annotation results for the main category of biological process (BP) are shown in Fig. 4b. Among these results, some immunity-related processes such as positive regulation of T-cell chemotaxis (GO:0010820), which could increase the rate, frequency, or extent of T-cell chemotaxis; lymphocyte chemotaxis (GO:0048247), which could direct the movement of a lymphocyte in response to an external stimulus; and granulocyte chemotaxis (GO:0071621), which could induce the movement of a granulocyte in response to an external stimulus, were particularly associated with the upregulated DEGs.
On the basis of these DEGs, we further selected 17 features to construct a model that could accurately classify LCC and RCC samples. Figure 5a shows the AUC scores of 100 times 10-fold cross-validation, with the feature number varying from 2 to 30. For the final model, the AUC score in the test dataset (35%) was 0.96 (Fig. 5c, p < 1e-5). Among the 17 features, PRCA1 obtained the highest score for the model (Fig. 5b); PRCA1 is a novel small nuclear protein that is specifically expressed in the human prostate and colon [23]. The relative expression value (Fragments Per Kilobase Million, FPKM) of the top 4 genes with the highest scores from the classifier model among LCC and RCC groups is shown in Fig. 5d. The FPKM of all 17 genes in the LCC and RCC groups are shown in Fig. S2. A heatmap of the expression of the 17 genes in LCC and RCC is shown in Fig. 2b.

Analysis of the correlations of DEGs with mutations
A network of the correlations of all DEGs with mutant genes (genes containing the 17 selected mutations) was constructed using STRING [20] (Fig. 6). TP53 and KRAS are hub genes that were found to be connected to many DEGs such as WIF1 and KRT17.
The correlations of 30 mutations with 17 DEGs selected by ML were determined by logistic regression (LR) and visualized using Cytoscape [21] (Fig. 7). Five relationships were found (two mutations, four DEGs, FDR < 0.05). Among them, the BRAF V600E mutation was correlated with four DEGs (ULBP2, CA8, HOXC6, AFAP1-AS1). The coefficients and FDR values of the correlations for BRAF V600E mutation to the four genes are listed in Table 2.

Discussion
It has been hypothesized that there are significant differences between RCC and LCC in terms of the molecular features, which might be the cause of clinicopathological differences [24]. However, the differences of molecular features between RCC and LCC patients have remained unclear. Using~300 LCC and RCC samples from TCGA, we attempted to unearth more valuable Fig. 6 Network of all of the DEGs and genes with the selected 30 mutations (Produced by Cytoscape Version 3.7.1). Circle nodes represent DEGs, while triangles represent mutated genes. Nodes with a light yellow color represent genes with mutations, dark turquoise represents downregulated DEGs, while dark orange represents upregulated DEGs. The line color represents the score of the connection between two nodes, ranging from 0.4 to 0.99. Node size represents the degree of the node: the larger the node size, the higher the degree of the node information on the differences between LCC and RCC by applying ML methods. It has been reported that RCC has a higher incidence of KRAS mutation than LCC (57.3% vs. 40.4%; P-value < 0.0001) [25], and a higher frequency of BRAF mutation (18.4-22.4% vs. 1.3-7.8%) [26]. However, other studies found no significant differences in BRAF and KRAS mutation rates [27]. In our study, RCC was also found to have higher incidences of KRAS mutation (49.7% vs. 33.0%, P-value = 0.007) and BRAF mutation (23.4% vs. 3.6%, P-value = 2.8e-6) than LCC. However, no significant difference was found in the expression of BRAF (FDR = 1, log 2 FC = 0.1) and KRAS (FDR = 0.92, log 2 FC = − 0.04) in our study, which implies that the mutations may have no impact on the transcription of KRAS and BRAF. The top four genes with the highest mutation rates in LCC were APC (84.8%), TP53 (68.8%), TTN (54.5%), and KRAS (33%). In RCC, the top four again included APC as the most common (63.2%), followed by TTN (63.2%), KRAS (49.7%), and TP53 (49.7%).
Using ML methods, we selected 30 mutations to build an XGBoost classifier with AUC of 0.80 in the test dataset. The feature with the highest score in the model was rs113488022 in BRAF. The top seven mutations scored by XGBoost were in the BRAF, KRAS, APC, and TP53 genes. APC and TP53 are tumor suppressor genes, while KRAS and BRAF are oncogenes. The differences in the frequencies of these mutations may be the reason for the clinicopathological differences between the two types of colon cancer.
The genes that were upregulated in RCC compared with the levels in LCC were particularly associated with some immunity-related processes. Using DEGs, we constructed a model with AUC of 0.96 in the test set using only 17 features, implying large differences between LCC and RCC at the level of gene expression. Among these features, small nuclear protein PRAC1 (FDR < 0.001, log 2 FC = − 4.1) was the most important, which was highly expressed in LCC. The higher expression of PRAC1 in LCC than RCC was also identified in other studies [5]. However, the function of PRAC1 in colon cancers remains elusive. Mutations in this gene have been found to be associated with a predisposition to prostate cancer and it is a candidate for the hereditary prostate cancer 1 (HPC1) allele. The second most important feature in the model, HOXC6 (FDR < 0.001, log 2 FC = 1.03), was highly expressed in RCC; it belongs to the homeoprotein family of transcription factors, members of which play important roles in morphogenesis and cellular differentiation during embryonic development [28]. The higher expression of HOXC6 in RCC than LCC was also described in another study [5]. Furthermore, the overexpression of HOXC6 has been detected in several human carcinomas, including breast, gastrointestinal, and lung cancers, as well as leukemia [29]. High expression levels of HOXC6 have also been found to be associated with lymph node metastasis [30]. The differential expression of PRAC1 and HOXC6 and other genes may be the reason for the different characteristics between LCC and RCC, which warrants more attention in further study.
In the correlation network, it was shown that some of the mutant genes, such as TP53 and KRAS, were the hub nodes. The LR analysis also showed a close relationship between BRAF V600E mutation and the expression of four genes such as HOXC6 and CA8. These findings suggest that the differences of gene mutations and expression, and the associations between them may be the key reasons for the differences in clinical features between LCC and RCC.

Conclusions
In this study, we used ML methods to clarify some of the key molecular differences between LCC and RCC. Fig. 7 The correlation network of 30 mutations and 17 DEGs calculated by logistic regression model (FDR < 0.05, Produced by Cytoscape Version 3.7.1). Nodes with red color represent mutations, mutation1 represents rs113488022 (BRAF, V600E mutation), mutation2 represents mutation in the 3′-UTR of ELFN1(chr7_1747914_1747915_TA_T-represents the chromosome, the position, and the mutated base). Nodes with a blue color represent DEGs Two classification models were constructed using the selected 17 DEGs and 30 mutations separately with good performance in the prediction of the two types of colon cancers. The expression of PRCA1 and the BRAF V600E mutation were the most important features for the two classifier models. Furthermore, BRAF V600E mutation was found to correlate with four genes among the 17 DEGs which should be paid more attention in further studies about colon cancer. Overall, the classifier models and the identified different mutations and genes in LCC and RCC might help us to obtain an in-depth understanding and further improve the diagnostic and therapy for patients.