Identification of a four-gene panel predicting overall survival for lung adenocarcinoma

Object Lung cancer is the most frequently diagnosed carcinoma and the leading cause of cancer-related mortality. Although molecular targeted therapy and immunotherapy have made great progress, the overall survival (OS) is still poor due to a lack of accurate and available prognostic biomarkers. Therefore, in this study we aimed to establish a multiple-gene panel predicting OS for lung adenocarcinoma. We obtained the mRNA expression and clinical data of lung adenocarcinoma (LUAD) from TCGA database for further integrated bioinformatic analysis. Lasso regression and Cox regression were performed to establish a prognosis model based on a multi-gene panel. Nomogram was built based on the model. The receiver operating characteristic (ROC) curve and the Kaplan–Meier curve were used to assess the predicted capacity of the model. The prognosis value of the multi-gene panel was further validated using a dataset from GEO. Gene set enrichment analysis (GSEA) was performed to explore potential biological mechanisms of a novel prognostic gene signature. expression in

enrichment pathways of GNG7 were cell cycle, RNA degradation, DNA replication, et al.

Conclusion
Our study proposed a novel four-gene panel to predict the OS of LUAD, which may contribute to predicting prognosis accurately and making the clinical decisions of individual therapy for LUAD patients. GNG7 might act a crucial role in genesis and progression of LUAD.

Background
Lung cancer is the most frequently diagnosed carcinoma and the leading cause of cancerrelated mortality worldwide, with 2.1 million new lung cancer cases and 1.8 million deaths predicted in 2018 [1]. More than 80% of lung cancers are non-small cell lung cancer, mainly lung adenocarcinoma and lung squamous cell carcinoma [2]. Among them, lung adenocarcinoma is on the rise and occupies the main part gradually [3][4][5]. Traditional treatments for NSCLC included surgery, chemotherapy, and radiotherapy. Although molecular targeted therapy and immunotherapy for NSCLC (especially lung adenocarcinoma) have made great progress in recent years, the OS of NSCLC is still poor, with a 5-year OS of less than 18% [6]. Hence, the identification of accurate prognostic biomarkers and novel and effective therapeutic targets remains particularly urgent for improving the poor survival of NSCLC patients.
Recent advances in genome-wide technologies have promoted the development of tumor biomarkers studies. Large numbers of biomarkers related to diagnosis, prognosis, and drug resistance of cancers have been discovered. However, many studies were limited to a single biomarker or a small sample cohort, which made the accuracy and availability of biomarkers insufficient. Therefore, the combination of multiple biomarkers and large sample analysis is more promising. For example, Liu et al. established a six-gene signature prognostic model (including CSE1L, CSTB, MTHFR, DAGLA, MMP10, and GYS2) using data from The Cancer Genome Atlas-Liver Hepatocellular Carcinoma Dataset (TCGA-LIHC) [7]. The mining of novel and reliable gene prognostic markers contribute to guiding the prognosis risk stratification and precision therapy of cancer patients.
In the present study, we performed lasso regression, univariate Cox regression, and multivariate Cox regression analysis to screened novel prognostic biomarkers and established a multi-gene panel as a prognostic indicator using data from TCGA-LUAD. ROC curve and Kaplan-Meier curve were used to evaluate the prognostic performance of the multi-gene panel. Then, the prognosis value of the multi-gene panel was further validated using a dataset from GEO database. Furthermore, we further investigated the clinical significance and possible biological functions of one of the key gene signatures. Overall, our results indicated that the four-gene panel might contribute to predicting OS of LUAD patients effectively and might become a novel target for precision therapy.

Materials And Methods
Identification of differentially expressed mRNA in LUAD The mRNA expression and clinical data were downloaded from the TCGA database (LUAD mRNA expression (IlluminaHiseq), containing 497 LUAD samples and 54 normal samples).
The raw expression data underwent a log2 transformation. Differential expression genes (DEGs) were screened via using limma package and edgeR package in version 3.5.2 R [8].

Establishment of the prognostic gene panel
The genes associated with OS for LUAD patients were identified using Univariate Cox regression analysis, with a cut-off of P < 0.001 being considered significant. Lasso penalized regression analysis was used to further narrow the range of prognostic genes [9]. Then a prognostic risk model of gene panel was established based on a linear combination of the multivariate Cox regression model coefficients (β) multiplied with its mRNA expression value. The risk score = (βmRNA1 * expression value of mRNA1) + (βmRNA2 * expression value of mRNA2) + (βmRNA3 * expression value of mRNA3) + ⋯ + (βmRNAn * expression value of mRNAn). Each patient was calculated a risk score according to this model. Then we divided these patients into a high-risk group and a lowrisk group according to a cut-off value calculated via the R package "survminer" and "survival" and two-sided log-rank test. The predictive performance of the gene panel for OS was estimated using a time-dependent ROC curve by the "survivalROC" package in R language [10]. The Kaplan-Meier survival curve was executed to compare the survival difference in the high-and low-risk cohort by the "survival" package in R.

Validation of the prognostic gene panel
To further validate the prognostic value of the gene panel, GSE42127 data from the GEO database were downloaded [11]. The gene expression of GSE42127 data and TCGA-LUAD data were uniformly corrected using the R package "sva" to make them comparable. The risk score was calculated with the gene-panel model for each included patient. The Kaplan-Meier curve and ROC curve were performed to validate the predictive capacity of the prognostic gene panel.
The four-gene panel is an independent prognostic factor for LUAD Univariate and multivariate Cox regression analyses with forwarding stepwise procedure were performed to investigate whether the four-gene panel could be an independent prognostic factor for LUAD patients. Clinical parameters included including gender, age, TNM stage.

Establishment of a predictive nomogram
Nomogram, a simple data evaluation model for the probability of an event, is often used to predict tumor prognosis [12]. The clinical parameters and risk scores from TCGA-LUAD patients were used to build a nomogram in the R package "rms" to detect the predictive probability of 1-year, 3-year, and 5-year OS for LUAD. The discrimination of the nomogram was assessed by using the concordance index (C-index) with a bootstrap method. The calibration curve was performed to validate the prediction efficiency of the nomogram.

MethHC database
MethHC (A database of DNA Methylation and gene expression in Human Cancer) is an online analysis web based on TCGA database resource focused on the DNA methylation of human diseases. We explored DNA methylation level and mRNA expression of GNG7 using MethHC (http://methhc.mbc.nctu.edu.tw/) [13].

GSEA
To explore potential biological mechanisms of prognostic gene signature expression on LUAD prognosis, GSEA was used to investigate enrichment of a priori defined set of genes between the high-and low-expression groups [14]. Gene sets enriched significantly were screened according to the criterion: a normal P-value < 0.05 and false discovery rate (FDR) < 0.1.

Statistical analysis
Univariate Cox regression, lasso regression, multivariate Cox regression analysis, Kaplan-Meier curve, the ROC curve, and log-rank test were used in the present study. All statistical analyses were operated by R software version 3.5.2. The statistical significance was set at P < 0.05.

Identification of DEGs in LUAD
A flowchart for our study was shown in

Establishment of a four-gene panel as a prognostic indicator
Univariate Cox regression analysis was performed for identifying the DEGs associated with OS using the "survival" package of R language. Of the 3,581 DEGs, 523 genes were identified as being associated with OS for LUAD patients (p < 0.01, Additional file 4). Then, lasso regression analysis was implemented to further obtain a stable set of genes . Each patient from the TCGA-LUAD database was awarded a risk score based on the Cox regression model composed of the four genes. Then, these patients were divided into a low-risk and high-risk group according to a cut off value obtained by the R package "survminer" and "survival". The predictive performance of the four-gene panel for OS was estimated using the time-dependent ROC curve by R package "timeROC" and "survival". The Area under the ROC curve (AUC) of this four-gene panel as a prognostic indicator was 0.740 and was superior to other clinical indicators used for prognostic classification. The OS in the high-and low-risk groups were compared using the Kaplan-Meier survival curve via the R package "survival". The results indicated that high-risk group had a worse prognosis compared with the low-risk group (Fig. 3A).

Validation of the four-gene panel as a prognostic indicator
To further validate the accuracy of the four-gene panel as a prognostic indicator, we computed the risk score of each patient in GSE42127 using the same model. Consistent with previous results, a significantly worse OS was observed in the high-risk group compared with the low-risk group. ROC curve showed that the AUC for OS was 0.752, indicating a better predictive performance compared with other clinical indicators used for prognostic classification (Fig. 3B).

Independent prognostic value of the four-gene panel
To determine whether the four-gene panel is an independent prognostic indicator for LUAD patients, 344 patients from the TCGA-LUAD database with complete clinical information including age, gender, and TNM stage were included for further analysis. Multivariate Cox regression analysis suggested that only the risk score calculated from the four-gene panel was independent prognostic factors for OS (Fig. 4A, 4B). The consistent result was obtained in the patients from GSE42127 (Fig. 4C, 4D).

Establishment of predictive nomogram
We established a nomogram to predict 1-year, 3-year, and 5-year OS in 460 patients with complete clinical information from the TCGA-LUAD database using five factors including risk score, age, sex, pharmaceutical, and pathologic stage (Fig. 5A). The C-index for the nomogram model was 0.710 (95% CI 0.624-0.796). Calibration curve showed that the nomogram had the superior prediction efficiency (Fig. 5B). These results indicated that the nomogram might be to serve as a prognostic model used for clinical management of LUAD patients.

The expression of GNG7 in LUAD
We further investigated the role of GNG7 in lung adenocarcinoma using the expression and clinical data from the TCGA-LUAD database, including patients' gender, age, TNM classification, survival status, and TNM stage. The level of GNG7 expression in LUAD tissues was decreased compared with that in noncancerous tissues (Fig. 6A). A paired comparison of LUAD tissues and tissues adjacent to carcinoma from the same patients also showed the same results (Fig. 6B). To further verify GNG7 expression in LUAD tissues, we selected two independent datasets (GSE18842 [15], GSE75037 [16]) for differential expression analysis, and the results showed that GNG7 expression was higher in LUAD tissues compared with noncancerous tissues (Fig. 6C, 6D). In addition, differences in GNG7 expression were also found according to gender, TNM classification, and TNM stage ( Fig. 6E-I).

DNA methylation level and mRNA expression of GNG7
We further explored the relationship between DNA methylation level and mRNA expression of GNG7 using MethHC (A database of DNA Methylation and gene expression in Human Cancer) database. The results showed that the DNA methylation levels of GNG7 were significantly higher in 18 kinds of cancerous tissues than adjacent noncancerous tissues ( Fig. 7). Furthermore, methylation level of the promoter and CpG Island region was negatively correlated with mRNA expression of GNG7 (Fig. 8).
GSEA identifies GNG7-related signaling pathways GSEA analysis was used to identify signaling pathways enriched in low and high GNG7 expression. The results revealed that 17 signaling pathways significantly enriched in GNG7 high expression group, and 27 signaling pathways in GNG7 low expression group (Table 1, 2). Genes involved in cell cycle, ubiquitin mediated proteolysis, RNA degradation, aminoacyl tRNA biosynthesis, DNA replication, proteasome, small cell lung cancer, and P53 signaling pathway were enriched in GNG7 low expression patients. The results suggested that the absence of GNG7 expression may be significantly related to the genesis and progression of LUAD, especially in the regulation of cell cycle (Fig. 9C, 9D).  DKK1, also named as DKK-1(dickkopf WNT signaling pathway inhibitor 1), has been proved to be differential expression in various tumors and participate in the regulation of growth, invasion, angiogenesis and metastasis of tumor [19][20][21][22]. In NSCLC, DKK1 be considered to be involved in tumor cell migration, invasion, and EMT processes, and could be used as an effective diagnostic and prognostic indicator and a potential therapeutic target [23][24][25].
LDHA(lactate dehydrogenase A), as a crucial enzyme of energy metabolism, is elevated in various cancers compared with normal tissues. Previous studies showed that LDHA could promote tumor cells proliferation, invasion, migration, tumor progression, and metastasis, and might be a potential therapeutic target [26][27][28][29][30]. MELTF, also known as MAP97, MTf (Melanotransferrin), or MTF1 (metal regulatory transcription factor 1), as an iron (Fe) binding transferrin homolog, is mainly expressed in melanoma and is low expression in noncancerous tissues. Previous studies indicated that MTf plays a crucial role in cell invasion and migration [31,32]. Subsequent studies reported that it could promote carcinoma cell invasion, migration, proliferation, and EMT progression and be an attractive target [33][34][35][36][37].
GNG7 (G protein subunit gamma 7), a novel possible tumor suppressor gene, is proved to be down-regulated in various carcinoma, including head and neck squamous cell carcinoma, clear cell carcinoma of kidney, pancreatic cancer, oesophageal cancer, lung adenocarcinoma et al. [38][39][40][41][42]. However, the mechanism of its role in tumorigenesis and progression is still little known. In our current study, GNG7 was down-regulated in LUAD patients from the TCGA database compared with noncancerous lung tissues, and the consistent result was validated in two independent datasets (GSE18842, GSE75037). Also, high expression of GNG7 was an independent favorable prognosis factor for OS with LUAD patients. DNA methylation is one of the common mechanisms of regulating genes expression. Our results showed that DNA methylation levels of the GNG7 were significantly higher in multiple tumors than in normal tissues. Furthermore, methylation level of the promoter and CpG Island region was negatively correlated with mRNA expression of GNG7.
It indicated DNA methylation of GNG7 may be involved in regulation of its expression.

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
The datasets supporting the conclusions of this article are available in the TCGA-GDC (https://portal.gdc.cancer.gov/) repository.

Competing interests
The authors declare that they have no competing interests.

Funding
This project was supported by grants from National Natural Science Foundation of China (81660018).
XZ and CL designed this study; CL performed most data collection and analysis; QL, DZ, JL helped to perform analysis and collected the data; CL and XZ wrote and revised the manuscript. All authors read and approved the final manuscript.

Supplementary Information
Additional file 1: List of differentially expressed genes.
Additional file 2: Figure S1. Heatmap of differentially expressed genes Additional file 3: Figure S2. Volcanic map of differentially expressed genes Additional file 4: Differentially expressed genes associated with OS for LUAD.
Additional file 5: Figure S3. LASSO Cox regression model.    Forrest plot of the univariate and multivariate Cox regression analysis in LUAD.  The differential expression analysis of GNG7 in LUAD. A The differential expression of GNG7 in LUAD tissues and noncancerous tissues in TCGA. B The paired differential expression of GNG7 in LUAD tissues and noncancerous tissues in TCGA. C The differential expression of GNG7 in GSE18842. D The differential expression of GNG7 in GSE75037. E The differential expression of GNG7 in different gender. F The differential expression of GNG7 in different "N" stage. G The differential expression of GNG7 in different "M" stage. H The differential expression of GNG7 in different "T" stage. I The differential expression of GNG7 in different TNM stage.   Supplementary Files