A prognostic model for predicting recurrence-free survival in hepatocellular carcinoma patients

Hepatocellular carcinoma (HCC) remains the most frequent liver cancer, accounting for approximately 90% of primary liver cancers worldwide. The recurrence-free survival (RFS) of HCC patients is a critical factor in devising a personal treatment plan. Thus, it is necessary to accurately forecast the prognosis of HCC patients in clinical practice. Using the The Cancer Genome Atlas (cid:0) TCGA (cid:0) dataset, we identied genes that are associated with RFS. A robust likelihood-based survival modeling approach was used to select the best genes for the prognostic model. Then, the GSE76427 dataset was used to evaluate the prognostic model’s effectiveness. We identied 1331 differentially expressed genes associated with RFS. Seven of these genes were selected to generate the prognostic model. Validation in both the TCGA cohort and the GEO cohort demonstrated that the 7-gene prognostic model has the capability of predicting the RFS of HCC patients. Meanwhile, the result of multivariate Cox regression showed that the 7-gene prognostic model could work as an independent prognostic factor. In addition, according to the time dependent ROC curve, the 7-gene prognostic model performed better in predicting the RFS of the training set and the external validation dataset than the classical TNM staging and BCLC. What’s more, these seven genes were found to be related to the occurrence and development of liver cancer by exploring other three databases. Our study identied a seven-gene signature for HCC RFS prediction that is a novel and convenient prognostic tool. The seven genes might provide potential target genes for metabolic therapy and the treatment of HCC.


Background
In 2018, liver cancer remained among the top six prevalent carcinomas. There were 841,080 new patients, and 781,631 patients died of liver cancer, according to the Global Cancer Statistics (1) (2). Hepatocellular carcinoma (HCC) is the most frequent liver cancer, accounting for approximately 90% of primary liver cancers (3). Despite the continuous development of medical technology, the outcome of many patients who receive treatment remains poor, and the prognosis of liver cancer remains poor, with a 2-year recurrence rate of 76.9% (4). The recurrence-free survival (RFS) of HCC patients is a critical factor in devising a personal treatment plan (5). Thus, it is necessary to accurately forecast HCC prognosis to improve the prognosis of HCC. Most previous studies constructed prognostic models using the TNM (tumor-node-metastasis) staging system to assess the prognosis of HCC patients (6). However, the TNM staging system does not predict the prognosis of HCC. Therefore, it is important to develop a reliable tool for clinicians to predict the prognosis of patients with HCC.
Given the remarkable development of high-throughput technologies, the development of The Cancer Genome Atlas (TCGA) and the intergovernmental Gene Expression Omnibus (GEO) database provides an abundance of high-quality information for HCC (7). Hence, it is urgent to develop methods to identify reliable therapeutic gene targets that could enable earlier prognostic evaluation and better therapeutic strategies (8). Therefore, we considered whether we could build a gene-based risk score model (9). Our goal is to generate simple and effective prognostic tools based on several genes and other factors that may affect RFS (10). Using the TCGA dataset, we selected 7 genes by robust likelihood-based survival modeling and built a risk score system (11). We used an independent dataset (GSE76427) to validate the effectiveness of the risk score system, demonstrating that its clinical value for predicting RFS in HCC patients is better than that of the TNM staging system.

Data collection and survival analyses
First, we downloaded the gene expression pro les and clinical information from The Cancer Genome Atlas-liver hepatocellular carcinoma (TCGA-LIHC)(12) dataset, which included 335 HCC samples. We used GSE76427 as a validation group, which contained 115 HCC samples, including gene expression and clinical information. Samples in TCGA-LIHC and GSE76427 that met the following inclusion criteria were included in this study: all samples had mRNA sequencing data and clinical information on RFS(13).

Identi cation of genes associated with RFS
The raw count data were normalized with a log(a+1) transformation. Then, using the "surv t" function in the "survival" package, we plotted the Kaplan-Meier curves for the high and low expression groups of each gene. When the P-value of the log rank test is less than 0.05, it was considered statistically signi cant (14).

Enrichment analysis of GO functions and KEGG pathways
For the selected genes, we used WebGestalt (http://bioinfo.vanderbilt.edu/webgestalt) based on GO (Gene Ontology) functions and the Kyoto Encyclopedia of Genes and Genomes (KEGG) to understand the biological signi cance of the identi ed genes(15).

Identi cation of the best genes for modeling
A robust likelihood-based survival approach was used to identify the best genes for modeling after determining the genes associated with RFS (16). We used the "rbsurv" package in R to complete this modeling process. The algorithm for modeling is summarized as follows: 1. We randomly divided the experimental group into two sets: a training set with N * (1-p) samples and a validation set with N * p samples (p = 1/3). Next, we adjusted a gene in the training set to obtain estimates of the genetic parameters. Log-likelihoods were evaluated using parameter estimates and a set of validation samples. This evaluation was performed in all genes.
2. We repeated the above process 10 times and obtained 10 log-likelihoods for each gene. We selected the best genes with the largest mean log-likelihoods. All of the best genes related to overall survival were selected using a robust likelihood approach.
3. We calculated every two-gene model and selected the one with the largest mean log-likelihood by repeating the two steps above. 4. We continued this process of gene selection. As a result, a series of models was generated using the Akaike Information Criterion (AIC) for all the candidate models and choosing the best model with the smallest AIC.

Construction and validation of the risk score system
Multivariate Cox regression analysis and "rbsurv" analysis were performed to identify the genes related to recurrence-free survival and to construct the prognostic gene signature. The "survivalROC" package in R was used to investigate the time-dependent prognostic value. Optimal cut-off values based on ROC curves were obtained to classify patients into low-risk groups and high-risk groups. The Calibration curve and concordance index (C-index) were used to evaluate the risk score system.

External validation of the risk score system
We calculated the risk score in the GSE76427 dataset. Then, the AUCs for 12-month, 15-month, and 18month RFS and the Kaplan-Meier curves were used to verify the risk score system. The calibration curve was used to validate the risk score system.
In addition, the prognosis-related genes in the risk score system were veri ed at the protein level by using The Human Protein Atlas database. CBioPortal for cancer genomics was used to study genetic alterations and in the risk score system (17).

Statistical analysis
Statistical tests were performed using R software and SPSS. Univariate and multivariate Cox regression were performed with a forward stepwise procedure. When the P-value was less than 0.05, it was considered statistically signi cant (17).

Acquisition of Gene Expression and Clinical Data
We downloaded the TCGA-LIHC dataset from The Cancer Genome Atlas (http://portal.gdc.cancer.gov/). The TCGA-LIHC dataset included 334 samples, and all samples had data on the recurrence-free survival time and censoring status. The GSE76427 dataset was downloaded from the Gene Expression Omnibus database (http://www.ncbi.nlm.nih.gov/gov/). The GSE76427 dataset included 115 samples from HCC patients, but 7 patients had missing information on recurrence-free survival time and censoring status.
Thus, 108 samples were included in this study. The median RFS times of the TCGA and GSE76427 series were 390 and 252 days, respectively, and the two datasets contained clinical information, such as sex, age, and TNM stage.

Genes associated with recurrence-free survival
We used the "surv t" function from the "survival" package and found 1331 genes associated with RFS. To explore the genetic biological implications, we next analyzed the 1331 genes through Gene Ontology (GO) functional and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses. As shown in Figure 1, through KEGG analysis, we found that these genes are enriched in signaling pathways such as the cell cycle, homologous recombination, DNA replication, the Fanconi anemia pathway, complement and coagulation cascades, and the T cell receptor signaling pathway.

Construction of the prognostic model in TCGA-LIHC
Then, "rbsurv" identi ed seven genes to construct the risk score system. The seven genes included in the system were TTK protein kinase (TTK), chromosome 16 open reading frame 54 (C16orf54), phosphoribosyl pyrophosphate amidotransferase (PPAT), CD3e molecule associated protein (CD3EAP), Solute carrier organic anion transporter family member 2A1 (SLCO2A1), acetyl-CoA acetyltransferase 1 (ACAT1), and Growth-arrest speci c 2 like 3 (GAS2L3) (Table ). A total of 334 patients were divided into two groups (134 high-risk patients and 200 low-risk patients) with a cut-off of 4.97976 for the risk score. Furthermore, the survival curve revealed that the recurrencefree survival was signi cantly poorer in the high-risk group than in the low-risk group (p<0.0001; Figure 2).

Validation of the prognostic model in GSE76427
We validated the risk score system in the GSE76427 cohort. A total of 108 patients were divided into two groups (45 high-risk patients and 63 low-risk patients) with a cut-off of 3.4144 for the risk score. Furthermore, the survival curve revealed that the recurrence-free survival was signi cantly poorer in the high-risk group than in the low-risk group (p=0.011; Figure 3).
In summary, these results indicate that the prognostic model has moderate sensitivity and speci city.

Association between the prognostic model and the clinical characteristics of patients
In clarifying the correlation between the seven-gene signature and the clinical characteristics of HCC patients, we found that a high risk score was signi cantly correlated with TNM stage (p<0.001), grade (p=0.001), and AFP (p=0.014) but was not signi cantly associated with the sex, age, BMI, or Child-Pugh score of patients with HCC (Table ).
In GSE76427, the results showed that the 7-gene signature was not signi cantly associated with sex, age, BCLC (Barcelona Clinic Liver Cancer) or TNM stage (Table ). 3.6 Independent prognostic role of the prognostic gene signature Moreover, the results of univariate Cox regression showed that the TNM stage (HR=1.680, p<0.001) and our prognostic model (HR=3.607, p<0.001) were both independent factors for recurrence-free survival among the 334 TCGA-LIHC patients. However, among the 108 patients in the GSE76427 cohort, TNM stage was not an independent prognostic factor for recurrence-free survival (18). The prognostic model (HR=2.407, p=0.014) was also an independent factor for recurrence-free survival. (Figure 4).  (Table ). Overall, our prognostic model showed bene t for predicting recurrence-free survival, which might help doctors with targeted treatment (Figure 6).

Development of the calibration curve
We calculated the C-index and drew the calibration curves for the 12-, 15-and 18-month survival predictions to evaluate the calibration in the TCGA-LIHC dataset and the GSE76427 dataset. The C-index in the TCGA-LIHC dataset is 0.717 and in the GSE76427 dataset is 0.647, as shown in Figures 7 and 8.

External validation in an online database
The representative protein expression of SLCO2A1, PPAT, GAS2L3, CD3EAP, ACAT1 were explored in the Human Protein Pro les. Then we explored the TTK, C16orf54, PPAT, CD3EAP, SLCO2A1, ACAT1, and GAS2L3 genes in the CBioPortal for cancer genomics. TTK exhibited the most frequent genetic alterations (3%), of which deep deletion was the most frequent alteration. The second most altered gene was CD3EAP (1.3%), and the most frequent alterations were ampli cation mutations (Figure 9). The expression of the seven genes in different cancers were shown in Figure 10. In summary, the aberrant expression of these seven genes may explain some of the abnormal expression of these genes.

Discussion
In this study, we developed a risk score of seven genes that had the ability to predict the probability of recurrence-free survival in HCC patients and is more accurate than clinical indicators. Using this model, we can predict which patients with HCC have a higher risk of recurrence, indicating that they need more attention. In the TCGA-LIHC dataset, a total of 1331 genes were found to be associated with the recurrence-free survival of HCC patients. Through KEGG analysis, we found that the 1331 genes were enriched in signaling pathways such as the cell cycle, homologous recombination, DNA replication, the Fanconi anemia pathway, complement and coagulation cascades, and the T cell receptor signaling pathway. This nding suggests that the 7-gene signature might affect the RFS of HCC through those pathways. Then, we selected the best 7 genes to develop the risk score model: TTK, C16orf105, PPAT, CD3EAP, SLCO2A1, ACAT1, and GAS2L3. Additionally, our study also showed that the TNM staging system was not an accurate indicator for predicting RFS in HCC patients, which was consistent with the results of other studies. According to the prognostic model, we divided the patients into low-and high-risk groups, which had signi cant differences in recurrence-free survival. This result indicated that the prognostic model could be used as a conventional tool in predicting the RFS of HCC patients.
The prognostic model was validated using another independent dataset, GSE76427. The area under the curve showed the ability of the prognostic model to differentiate the patients' prognoses; the survival curve represents the survival of the high-risk group, which had a worse prognosis, compared with that of the low-risk group. These ndings demonstrate that the prognostic model has the ability to forecast RFS in HCC patients.
Most of the seven genes in our prognostic model have been reported to be involved in cancer. TTK protein levels are different in human liver cancer between liver cancer cells and adjacent noncancerous liver cells (19). The study also tested the utility of TTK-targeted inhibition and demonstrated therapeutic potential in an experimental model of liver cancer in vivo. Furthermore, our study demonstrated its effectiveness and incorporated it into the prognostic model. PPAT, a member of the purine/pyrimidine phosphoribosyl transferase family, regulates pyruvate kinase activity and cell proliferation and invasion and is a biomarker for lung adenocarcinoma. Acetyl-CoA acetyltransferase (ACAT) was recently reported to be elevated in human cancer cell lines (20). ACAT1 has acetyltransferase activity, and it can acetylate pyruvate dehydrogenase (PDH), which affects tumor growth (21).
In other scholars' prognostic analysis of HCC, CD3EAP is also a predictor, suggesting that CD3EAP is an important predictor of HCC prognosis, but the function of CD3EAP is not completely clear (22). The function of GAS2L3 is still unknown; it may be involved in mediating the absorption and clearance of prostaglandins, but it has not reported in liver cancer (13). Moreover, SLCO2A1 and C16orf105 have not been reported in previous HCC studies, indicating that these genes may be potential factors in the treatment of HCC. Understanding the function of these genes may promote the development of HCC treatment.
However, despite the potential substantial clinical signi cance of our results, this study still has some limitations. One was that, although the calibration curve performance and the AUC value were excellent in the validation group, multicenter clinical application is also needed to evaluate the external utility of the prognostic model (23). Second, only 1331 genes were de ned as genes associated with RFS and evaluated for the prognostic model construction. Some important genes could have been excluded before building the prognostic model (24). In addition, signaling pathways are urgently needed to reveal the functions of these genes in HCC.

Conclusions
In conclusion, we developed and validated a prognostic model for predicting the recurrence-free survival probability of HCC patients. The simple prognostic model had the ability to predict RFS, and it could be a useful tool for doctors conducting an evaluation of HCC and selecting treatment plans for HCC patients.

Declarations Con ict of Interest
The authors declare no competing interests.
Author Contributions Statement WW, LW and YY conceived and designed the study. WW analyzed the data. XX and YL performed literature searches. WW and LW wrote the paper. QL reviewed and edited the manuscript. All authors read and approved the manuscript.