A glycolysis-based three-gene signature predicts survival in patients with lung squamous cell carcinoma

Background Lung cancer is one of the most lethal and most prevalent malignant tumors worldwide, and lung squamous cell carcinoma (LUSC) is one of the major histological subtypes. Although numerous biomarkers have been found to be associated with prognosis in LUSC, the prediction effect of a single gene biomarker is insufficient, especially for glycolysis-related genes. Therefore, we aimed to develop a novel glycolysis-related gene signature to predict survival in patients with LUSC. Methods The mRNA expression files and LUSC clinical information were obtained from The Cancer Genome Atlas (TCGA) dataset. Results Based on Gene Set Enrichment Analysis (GSEA), we found 5 glycolysis-related gene sets that were significantly enriched in LUSC tissues. Univariate and multivariate Cox proportional regression models were performed to choose prognostic-related gene signatures. Based on a Cox proportional regression model, a risk score for a three-gene signature (HKDC1, ALDH7A1, and MDH1) was established to divide patients into high-risk and low-risk subgroups. Multivariate Cox regression analysis indicated that the risk score for this three-gene signature can be used as an independent prognostic indicator in LUSC. Additionally, based on the cBioPortal database, the rate of genomic alterations in the HKDC1, ALDH7A1, and MDH1 genes were 1.9, 1.1, and 5% in LUSC patients, respectively. Conclusion A glycolysis-based three-gene signature could serve as a novel biomarker in predicting the prognosis of patients with LUSC and it also provides additional gene targets that can be used to cure LUSC patients. Supplementary Information The online version contains supplementary material available at 10.1186/s12885-021-08360-z.


Background
Lung cancer is the leading cause of cancer-related mortality worldwide. There are two clinical subtypes for lung cancer: non-small cell lung cancer (NSCLC) (approximately 85% occurrence), and small cell lung cancer (SCLC) (approximately 15% occurrence) [1]. Based on pathological and molecular features, NSCLC is divided into the following major subtypes: lung squamous cell carcinoma (LUSC), lung adenocarcinoma (LUAD), and large cell lung cancer [2]. Recent advances in targeted treatments, such as epidermal growth factor receptor (EGFR) kinase inhibitors, have increased the overall survival (OS) of patients with LUAD [3]. However, no specific biomarkers or relatively optimal targeted therapies have been identified for LUSC patients, and the 5-year survival rate of LUSC is less than 20% [4]. Therefore, it is necessary to explore specific diagnostic and prognostic biomarkers for LUSC.
Energy metabolism reprogramming is a process that promotes cancer cell growth and proliferation via adjustment of energy metabolism, and it has been regarded as an emerging hallmark of cancer [5]. Under aerobic conditions, normal cells obtain energy through mitochondrial oxidative phosphorylation. Under anaerobic conditions, the cells obtain energy via glycolysis instead of oxygen-consuming mitochondrial metabolism [6]. Glycolysis, also known as the Warburg effect, is often observed in human cancer cells, in which the cancer cells favor glucose metabolism via glycolysis even in the presence of oxygen [7]. This phenomenon is a unique energy metabolism that exists in cancer cells. In recent years, many biomarkers for LUSC have been discovered, including glycolysis-associated genes such as kininogen 1 (KNG1) [8] and tripartite motif-containing protein 59 (TRIM59) [9].
With the development of high-throughput sequencing, various patient genome databases have been constructed, which enables us to acquire a deep understanding of genomic changes. Based on database mining, an increasing number of biomarkers have been identified that are related to the survival of patients with cancer [10,11]. However, a single gene cannot be used to obtain satisfactory predictive effects. A multigene prognostic model from an original tumor biopsy can guide clinicians to choose more effective treatment strategies. Thus, a signature based on multigene expression associated with glycolysis should be established to predict the prognosis of LUSC patients.
In the present study, we used a genome-wide analysis of LUSC patient mRNA expression profiles from The Cancer Genome Atlas (TCGA) to construct a glycolysisrelated gene signature that could effectively predict the prognosis in LUSC patients.

Patient dataset
The mRNA expression profiles of LUSC patients and their corresponding clinical information were obtained from the TCGA database (https://portal.gdc.cancer.gov/ ). A total of 502 LUSC samples and 49 tumor-adjacent normal samples were downloaded from TCGA. Each tumor specimen was approximately 1 cm 3 in size and weighed between 100 mg and 200 mg, in general [12]. Then, we downloaded the clinical data for LUSC (including 504 patients) from TCGA. There were 501 matched LUSC patients between the mRNA expression files and the clinical information. Therefore, a total of 501 tumor samples and 49 tumor-adjacent normal samples were included in our study. Clinical information, including age, sex, American Joint Committee on Cancer (AJCC) stage, T, M, N, survival time, and survival status were included in the present study (Table 1). Additional information regarding surgically extracted LUSC can be seen in the TCGA collection protocols [12,13].
Gene set enrichment analysis (GSEA) GSEA was performed to determine whether there were significant differences in the identified gene sets between the LUSC and normal groups. The expression levels of 443 glycolysis-related genes were analyzed in LUSC samples and in adjacent non-cancerous tissues. A normalized p value less than 0.05 was considered statistically significant.

Prognostic analysis
We conducted univariate Cox proportional hazard regression analysis to determine the relationship between glycolysis-related genes and OS in LUSC patients. If p < 0.01, the corresponding glycolysis-related genes were retained and regarded as candidate prognostic genes for LUSC. Then, a multivariate Cox proportional hazards regression analysis was performed among the pooled candidate prognostic glycolysis-related genes to establish the prognostic model. These analyses were performed with the use of the R package for survival.

Statistical analysis
The selected mRNAs were divided into the risky [hazard ratio (HR) > 1] and protective (0 < HR < 1) types. Based on a linear combination of the expression level of filtered mRNAs weighted by the regression coefficient (β), the formula for the risk score is illustrated as follows: Risk score = expression of gene 1 × β1+ expression gene 2 × β2 + … + expression of gene n × βn. β represents the regression coefficient of the corresponding gene obtained from the multivariate Cox regression model. According to the median value of the risk score, patients were divided into high-risk or low-risk groups. Kaplan-Meier curves and log-rank tests were utilized to validate the prognostic significance of the risk score. Student's t-test or the Mann-Whitney U-test was conducted to explore the differential expression of selected genes in LUSC tissues and adjacent normal tissue. If the expression data of selected genes followed a normal distribution, Student's t-test was used to analyze differences between LUSC tissues and adjacent normal tissue; otherwise, the Mann-Whitney U-test was utilized. Filtered gene alterations in LUSC were explored using the cBio-Portal database (http://www.cbioportal.org/). All statistical analyses were performed using the R language and environment for statistical computing (R version 3.6.3). Visualization of results was performed using R software.

Initial screening of genes by GSEA
The mRNA expression data set and clinical information for 501 patients with LUSC were obtained from the TCGA database (Fig. 1). We found five glycolysis-related gene sets in the Molecular Signatures Database v7.0, including the (1) BIOCARTA_GLYCOLYSIS_ PATH WAY, (2) GO_GLYCOLYTIC_PROCESS, (3) HALLMA RK_GLYCOLYSIS, (4) KEGG_GLYCOLYSIS_GLUCO-NEOGENESIS, (5) REACTOME_ GLYCOLYSIS. We performed GSEA to explore whether there were significant differences between LUSC and normal tissues in the identified gene sets. We found that these 5 gene sets were significantly enriched ( Fig. 2 and Table 2). Then, we collected 443 genes from 5 gene sets for further analysis.

Identification of glycolysis-related genes associated with patient survival
First, univariate Cox proportional hazard regression analysis was conducted for 443 genes that were significantly enriched in LUSC samples from the GSEA. A total of 4 genes were obtained that were significantly correlated with patient survival (p < 0.01). Next, we performed multivariate Cox regression analysis to further explore the association between the 4 mRNA expression profiles and the OS of patients.
Finally, 3 genes, hexokinase domain-containing protein 1 (HKDC1), aldehyde dehydrogenase 7A1 (ALDH7A1), and malate dehydrogenase 1 (MDH1), were included to construct a prognostic model. As shown in Table 3, two of the three genes were verified as independent prognostic markers in LUSC. Among the three genes, one gene (MDH1) was considered as a protective factor according to 0 < HR < 1, whereas the remaining two genes (HKDC1 and ALDH7A1) might be prognostic risk factors with their HR > 1.
Subsequently, we explored the alterations of three selected genes in 501 LUSC samples using the cBioPortal database. The results showed that the rates of genomic alterations in the HKDC1, ALDH7A1, and MDH1 genes were 1.9, 1.1, and 5%, respectively ( Supplementary Figure 1).
The expression level of the three genes was measured between adjacent normal tissues and LUSC tissues. We found that all three genes were upregulated in LUSC tissues compared with normal tissues (Fig. 3).
Construction of the three-gene signature to predict patient prognosis To predict patient prognosis using glycolysis-related gene expression, a prognostic risk model was developed based on the regression coefficients of the multivariate Cox regression model to weight the expression level of   Table 1.
According to the risk score formula, patients were classified into the high-risk (n = 247) or the low-risk group (n = 248) with a median value of risk score as a cut-off (Fig. 4A). The survival time for each patient is shown in Fig. 4B. As shown in Fig. 4D, patients in the high-risk group had shorter survival as compared to the low-risk group (p < 0.001). The 3-year and 5-year survival rates of patients in the high-risk group were 45.4 and 35.0%, respectively. However, the 3-year and 5-year survival rates of patients in the low-risk group were 71.9 and 58.1%, respectively. Additionally, a heatmap presents the expression profiles of three mRNAs (Fig. 4C). As the risk score increased in patients with LUSC, the mRNA expression of HKDC1 and ALDH7A1 was obviously upregulated; in contrast, the mRNA expression of MDH1 was downregulated. The area under the receiver operating characteristic (ROC) curve (AUC) for the risk score at 1-, 3-, and 5-year OS was 0.629, 0.665, and 0.636, respectively (Fig. 5).
Risk score from the three-gene signature is an independent prognostic indicator Univariate and multivariate Cox regression analysis were performed to evaluate the independent risk factors in patients with LUSC. Several clinicopathological parameters, including age, sex, AJCC stage, T, N, M, as well as risk score were included. The results showed that only the risk score was associated with prognosis in the univariate Cox analysis (HR = 2.553, 95% CI: 1.710-3.811, p < 0.0001) ( Table 4). In the following multivariate Cox analysis, it was determined that age and risk score were independent prognostic indicators ( Table 4). These results indicated that the risk score was reliable in predicting the prognosis of patients with LUSC.

Validation of the three-gene signature for survival prediction by Kaplan-Meier curve analysis
To further verify the prognostic value of the risk score of the three-gene signature associated with glycolysis, patients with LUSC were stratified by age (≤65 or > 65), sex (female or male), AJCC stage (I + II or III + IV), T (T1 + T2 or T3 + T4), N (N0 or N1 + N2 + N3), and M (M0 or M1) (Fig. 6). We found no significant difference between high-risk and low-risk in patients with remote tumor metastasis (Fig. 6J). However, in the subgroup of patients without remote tumor metastasis, the risk score for the three-gene signature was still an independent prognostic indicator (Fig. 6I). Additionally, regardless of the age, sex, AJCC stage, T or N, patients in the highrisk group based on the risk score had a poor prognosis when compared to patients in the low-risk group. These findings demonstrated that the three-gene signature effectively predicts the survival of LUSC patients.

Discussion
Recently, numerous genes have been considered as biomarkers for cancer prognosis, and the clinical significance of the biomarkers has been explored. For example, a study by Tang and his colleagues found that the  overexpression of dipeptidyl peptidase 9 (DPP9) was a significant independent factor for poor prognosis in patients with NSCLC [14]. Similarly, Feng et al. [15] reported that high expression of forkhead box Q1 (FoxQ1) was associated with poor prognosis in patients with NSCLC. However, the expression level of single genes can be influenced by multiple factors, and thus, these biomarkers can be unreliable for independent prognosis indications. Therefore, a statistical model based on a combination of multiple genes was used to improve the prediction of prognosis in cancer patients. Studies have shown that a pool of multiple genes was more accurate than a single gene in predicting the prognosis of patients with cancer [16,17].
In the present study, we obtained mRNA expression profiles for 501 LUSC patients from the TCGA database. We found that 5 glycolysis-related gene sets were significantly enriched in LUSC samples using GSEA. Univariate and multivariate Cox regression analyses were carried out to identify the risk score for the three-gene signature with prognostic value for patients with LUSC. Kaplan-Meier curve analysis indicated that patients with a high risk score had a poor prognosis when compared to patients with a low risk score. Additionally, in the stratified analysis, the risk score for the three-gene signature effectively predicted the prognosis of LUSC patients in all subgroups except for the subgroup of patients with remote tumor metastasis. The reason for this discrepancy might be that the number of patients with remote tumor metastasis was too small (n = 7). These results demonstrated that the risk score for the three-gene signature could be used as an independent prognostic indicator for LUSC patients. Moreover, measuring the patient risk score might assist clinicians in choosing optimal therapy methods.
The metabolism of tumor cells is more active than that of normal cells, and therefore, tumor cells require a greater amount of energy to maintain their higher proliferation [18]. Glycolysis and oxidative phosphorylation are the two important metabolic pathways related to energy supply. Glycolysis is a relatively low-energyproviding pathway compared with oxidative phosphorylation. In the 1920s, Warburg found that cancer cells are very active in glycolysis and require a large amount of glucose to obtain ATP for metabolic activities [19]. This aberrant phenomenon of glucose metabolism was called aerobic glycolysis, and is also known as the Warburg effect [19,20]. There was further research on the main genes and enzymes related to glycolysis to gain understanding of their functions in the metabolism of tumor cells.
In recent years, studies have shown that aerobic glycolysis plays a significant role in tumorigenesis, tumor progression, and metastasis. For example, enolase 1 (ENO1) was proved to promote cell glycolysis, growth, migration, and invasion in NSCLC [21]. Glucose transporter 1 (GLUT1) facilitated increased transport of glucose into cancer cells to maintain an elevated rate of glycolysis under aerobic conditions [22]. A high expression of GLUT1 was significantly associated with a poor prognosis in lung cancer patients [23]. However, no set of glycolysis-related genes for predicting LUSC prognosis has been established.
HKDC1, a recently identified fifth hexokinase, plays an important role in cellular glucose metabolism [24]. Aberrational expression of HKDC1 is associated with various cancers, including colorectal cancer [25], liver cancer [26], and breast cancer [27]. Additionally, Wang and his colleagues reported that HKDC1 was overexpressed in LUAD tissues, and high expression of HKDC1 promoted proliferation, migration, and invasion in LUAD [28]. Thus, HKDC1 can serve as a prognostic biomarker for LUAD patients [28]. The aldehyde dehydrogenase (ALDH) superfamily comprises 19 enzymes that play a vital role in maintaining epithelial homeostasis. ALDH activity has been implicated in detoxification, cell proliferation, differentiation, drug resistance, and response to oxidative stress [29,30]. Thus, deregulation of these enzymes could result in various cancers, including esophageal squamous cell carcinoma [31] and breast cancer [32]. Giacalone et al. [33] reported that ALDH7A1, one of the ALDH superfamily members, was correlated with OS and recurrence in patients with surgically resected stage I NSCLC. MDH1, an NAD(H)-dependent enzyme, is an important part in the malate/aspartate shuttle (MAS) [34]. This metabolic cycle contributes to maintaining intracellular NAD(H) redox homeostasis as it transfers the reducing equivalent NAD(H) across the mitochondrial membrane [34]. It has been reported that abnormal expression of MDH1 is related to tumor occurrence and progression [35]. For example, MDH1 promoted pancreatic ductal adenocarcinoma cell proliferation and metabolism through NAD production to support glycolysis [35,36]. Zhang et al. [37] reported that MDH1 expression was elevated in NSCLC tissue compared with normal lung tissue. However, there were no combinations of these three glycolysis-related genes (HKDC1, ALDH7A1, and MDH1) to predict the prognosis of LUSC.
This study is the first to report that a glycolysis-based three-gene signature can serve as a prognostic indicator for patients with LUSC. A higher risk score indicates a worse prognosis. Of course, some study limitations remain. First, the risk score model was constructed using the TCGA database and should be verified in other cohorts in future studies. Second, studies on the three predicted genes should be performed to explore concrete mechanisms in the occurrence and development of LUSC.

Conclusions
This study suggested that the three-gene signature associated with glycolysis might not only help to predict prognosis of LUSC patients, but also can provide additional gene targets that can be potentially used to cure LUSC patients.