Development and validation of a fourteen- innate immunity-related gene pairs signature for predicting prognosis head and neck squamous cell carcinoma

Background Immune-related genes is closely related to the occurrence and prognosis of head and neck squamous cell carcinoma (HNSCC). At the same time, immune-related genes have great potential as prognostic markers in many types of cancer. The prognosis of HNSCC is still poor currently, and it may be effective to predict the clinical outcome of HNSCC by immunogenic analysis. Methods RNASeq and clinical follow-up information were downloaded from The Cancer Genome Atlas (TCGA), the MINiML format GSE65858 chip expression data was downloaded from NCBI, and immune-related genes was downloaded from the InnateDB database. Immune-related genes in 519 HNSC patients were integrated from TCGA dataset. By using multivariate COX analysis and Lasso regression, robust immune-related gene pairs (IRGPs) that predict clinical outcomes of HNSCC were identified. Finally, a risk prognostic model related to immune gene pair was established and verified by clinical features, test sets and GEO external validation set. Results A total of 699 IRGPs were significantly correlated with the prognosis of HNSCC patients. Fourteen robust IRGPs were finally obtained by Lasso regression and a prognostic risk prediction model was constructed. Risk score of each sample were calculated based on Risk models and divided into the high-risk group (Risk-H) and low Risk group (Risk-L). Risk models were able to stratify the risk in patients with TNM Stage, Age, gender, and smoking history, and the AUC > 0.65 in training set and test set, shows that 14-IRGPs signature in patients with HNSCC has excellent classification performance. In addition, 14-IRGPs had the highest average C index compared with the prognostic characteristics and T, N, and Age of the 3 previously reported HNSCC. Conclusion This study constructed 14-IRGPs as a novel prognostic marker for predicting survival in HNSCC patients.


Background
Human head and neck squamous cell carcinoma (HNSC C) is one of the most common tumors today, with approximately 550,000 people worldwide suffering from this disease each year, and approximately 300,000 patients die [1]. Long-term repeated inflammatory stimuli are considered to be one of the main causes of the disease, including smoking, drinking, repeated trauma, and human papillomavirus (HPV) infection [2]. HNSCC is characterized by high proliferative, regional lymph node metastasis and poor prognosis [3]. It is urgent to investigate the development of novel and sensitive HNSCC tumor prognostic markers to reduce the number of HNSCC patients not diagnosed prior to the onset of invasive disease.
Cancer immunotherapy aims to enhance the activity of the immune system to fight cancer, has always been the main driving force of personalized medicine [4,5]. In recent decades, immunotherapy has developed rapidly and has become a treatment for many cancers [6]. The expression of PD-L1 is usually higher in HNSCC tumors with a positive rate of 46 to 100% in several studies [7]. Tadalafil and anti-tumor vaccinemediated immune rejection reversal also lead to upregulation of PDL1 in recurrent HNSCC, suggesting that immunological checkpoint treatment may be effective in patients with HNSCC [8]. In 2016, the US food and drug administration (FDA) approved the first immunotherapy treatments-nivolumab and pembrolizumab for patients with recurrent (HNSCC with platinum-based regimens that are difficult to treat) [9]. Although these findings support the importance of immunology in HNSCC, the molecular mechanisms remain unclear, especially for immune-related genomic effects. With the advent of public large-scale gene expression data sets, cancer researchers have been able to accurately identify tumor-related prognostic biomarkers [10]. Li et al analyzed the prognostic value of IRGPs to develop individualized immune features that improve prognosis in patients with non-squamous non-small cell lung cancer [11]. However, the clinical relevance and prognostic significance of IRGPs in HNSCC have not been studied in depth.
In this study, we integrated immune-related genes in 519 HNSCC patients based on the TCGA dataset. Multivariate COX analysis and Lasso regression were used to identify robust IRGPs that predicted HNSCC clinical outcomes and establish a risk prognosis model related to immune gene pairs. IRGPs was found to be a strong prognostic biomarker and predictor of HNSCC.

Data collection and processing
In April 30, 2019, RNA-seq data and the latest clinical follow-up information were downloaded from TCGA using GDC API, including 612 RNA-seq data samples. Similarly, a set of chip data set GSE65858 in MINiML format were downloaded from NCBI, including the expression profile data and clinical follow-up information of 270 HNSCC sample. All patients underwent surgery with a negative surgical margin, receive no adjuvant or neoadjuvant therapy. A total of 1039 immune-related genes (removing the name-repeated gene) were downloaded from the InnateDB database (https://www.innatedb.com/).
For the TCGA RNAseq data, we screened 517 tumor samples with follow-up information and OS > 0, extracted the expression profile of the immune-related gene set and removed the gene with 0 expression level in 50% of the samples. For chip data sets, we screened samples with follow-up information and OS > 0, R package GEOquery was used to map the chip probes to Gen-eSymbol, the probes was mapped to multiple genes were removed, multiple probes were mapped to a single gene to take the median, gene expression profile were obtained, and the expression profile of the immune gene set were extracted. The clinical information of TCGA and GEO patients is shown in Table 1. The workflow is shown in Fig. 1.

Sample grouping
For better model building and validation, we randomly divided the TCGA data set into two groups, one as a training set (N = 260), one as an internal validation set (N = 259), and the GSE65858 data set as an independent external validation set. During the random grouping process of TCGA, we kept the two groups of samples similar in age distribution, clinical stage, follow-up time, and proportion of patient deaths, while the number of samples after clustering the gene expression profiles of the two groups was close to each other, and the statistical characteristics of the two samples are shown in Table 2.

Construction of IRGPs
A total of 539,241 gene pairs were obtained by randomly permutation and combination of 1039 immune genes. For arbitrary gene i (IRG i ) and gene j (IRG j ), IRGP ij ,were calculated. The IRGP values were defined as follows: Where IRG indicates the amount of gene expression, we calculated all IRGPs values for all samples and further filtered IRGPs with a standard deviation of 0, a total of 18,182 IRGPs were obtained.

Univariate cox survival analysis
Univariate Cox proportional hazard regression analysis was performed on each IRGPs as in Jin-Cheng et al. [12] to screen for those genes that were significantly associated with OS in the training data set, with p < 0.05 as the threshold.

Screening for robust immune-related prognostic features
LASSO is a popular method for regression modeling with a large number of potential prognostic features, because it can perform automatic feature selection in a manner that results in signatures with generally good prognostic performance [13]. The LASSO method has been extended to the Cox model for survival analysis and has been successfully applied for the purpose of building sparse signatures for survival prognosis in many application areas including oncology [14][15][16], First, we used training set samples to conduct univariate Cox proportional risk regression analysis for each IRGPs, with log rank p < 0.05 as the threshold, 669 IRGPs with significantly correlated prognoses were identified. Furthermore, R software package glmnet [17] was used to screen robust prognostic immune-related gene pairs, and 3-fold cross validation was used to evaluate the optimal characteristics. The degree of LASSO regression complexity adjustment is controlled by the parameter λ, where the larger λ is, the greater the penalty for a linear model with more variables, so that a model with fewer variables is eventually obtained. In this study, the optimal model is obtained when λ = 0.1218186, and we choose the features incorporated in the model at this time as the optimal combination of features, i.e., 14-IRGPs. Multivariate Cox regression analysis was conducted using the stepwise regression method to determine the coefficient of each IRGPs in the 14-IRGPs, and the following risk score model was constructed: Exp k Ãe HR k Where N is the number of prognostic IRGPs, Exp k is the IRGP value of prognostic IRGPs, and e HR k is the estimated regression coefficient of IRGPs in the multivariate Cox regression analysis.

Validation and assessment of the IRGPs signature
To validate the IRGPs signature, patients in test datasets were divided into low risk and high risk group according to the median value of the risk score, which calculated according to the prognostic signature. The log-rank test and Cox regression analysis were conducted to evaluate overall survival difference between the low risk and high risk groups. Receiver operating characteristic curve (ROC) curve was used to assess the categorization of IRGPs signature. The IRGPs signature was also compared with the published signature by KM survival curve, ROC curve, and C-index.

RiskScore and clinical characteristics
In order to observe the relationship between riskScore and clinical phenotype, the samples were divided into two groups based on the riskScore median of the samples, and the prognosis differences between high risk-Score and low riskScore were compared respectively. Similarly, the relationship Grade, Age and Stage in High and Low TMEScore was analyzed.

Functional enrichment analysis
We used R package clusterProfiler, v3.8 [18] for GO and KEGG enrichment analysis with a p value of less than 0.05 as the threshold. GSEA [19] was performed by R package GSVA using the MSigDB [20]. Gene sets with a false discovery rate (FDR) value less than 0.05 after performing 1000 permutations were considered to be significantly enriched.

Statistical analysis
The Kaplan-Meier (KM) curve was plotted when the median risk score in each data set was used as a cutoff to compare the risk of survival between the high risk group and the low risk group. Multivariate Cox regression analysis was performed to test whether gene markers are independent prognostic factors. Significance was defined as P < 0.05. AUC analysis was performed using the R package pROC. All analyses use default parameters except for special instructions, which are performed in R software version 3.4.3.

Identification of IRGPs in patients with HNSCC
For the TCGA training set samples, we used a univariate Cox proportional hazard regression model to establish the relationship between patient overall survival and immune-related gene expression, and obtained 164 prognostic genes. According to the calculation rule of the IRGPs value, a total of 7374 IRGPs are obtained. The univariate Cox proportional hazards regression model was used to establish the relationship between IRGPs and overall patient survival. Finally, we obtained 699 IRGPs with significant prognostic differences (Fig. 2a). In order to screen robust immune-related prognostic gene pairs, we used lasso regression to perform dimensionality reduction analysis on these 699 IRGPs. The results show that as the lambda increases, the number of independent coefficients tends to 0 (Fig. 2b), 3-fold cross-validation was used to build the model, and the model is optimal when lambda = 0.1218186 (Fig.  2c). We select the model when lambda = 0.1218186 as the final model, which contains a total of 14 IRGPs, 19 genes ( Table 3). The risk scores of these 14 IRGPs in each sample are shown in Fig. 2d. Furthermore, we calculated the Risk Score of each sample based on the Risk model, and the formula is as follows:

14-IRGPs signature could be used as a prognostic marker
Multivariate regression analysis was used to establish a risk model for 14 IRGPs in the training set, validation set, TCGA dataset and independent test set data (GSE65858 dataset) for 1, 3, and 5 years. The results suggest that the average AUC of the training set is 0.758, the average AUC of the validation set is 0.659, the average AUC of the TCGA dataset is 0.709, and the average AUC of the independent test set data is 0.685 (Fig. 3a-d). With the median risk score as the threshold, the training set samples were divided into risk-H and risk-L, and the KM survival curves of 14-IRGPs in training set, validation set, all data sets of TCGA and one independent GEO testsets (GSE65858 dataset) were drawn. The results showed that the prognosis of the risk-L group of all data sets was significantly better than that of the risk-H group ( Fig.  3e-h). In summary, IRGPs have great potential as prognostic markers.

Predictive power of risk models in different clinical samples
In order to observe the robustness of risk models in different clinical characteristics, we observed the predictive power of risk models in different TNMstages, Age, gender and smoking history. We found that the 14-IRGPs signature model can be significantly distinguished into high-risk group and low-risk group not only in early patients and late-stage patients (Fig. 4a, b) (log rank p = 0.00023, log rank p < 0.0001), but also in young data sets and elderly data sets (Fig. 4c, d) (logrank p < 0.0001, log rank p < 0.0001), and in female data sets and male data sets (Fig. 4e, f) (log rank p = 0.01, log rank p < 0.0001). Finally, our analysis of the samples with and without smoking history shows that 14-IRGPs signature can also significantly distinguish the high-risk group from the low-risk group (Fig. 4g, h) (log rank p = 0.002, log rank p < 0.0001), those results indicated that our model has a very stable predictive power in patients of different ages, stages and genders.
Univariate and multivariate analysis of 14-IRGPs signature In order to identify the independence of 14-IRGPs signature model in clinical application, we used univariate and multivariate COX regression analysis to analyze relevant HR, 95%CI of HR, p value in TCGA training set, TCGA verification data set and all data of TCGA. We systematically analyzed clinical information recorded by TCGA patients, including age, T, N, AJCC Stage, Grade, Smoking, and our 14-IRGPs signature grouping information (Table 4).
In the training set of TCGA, univariate COX regression analysis found that Risk score, AJCC Stage and Smoking were significantly correlated with survival, but the corresponding multi-factor COX regression analysis found that Risk score (HR = 2.53, 95%CI = 1.54-4.13, p = 0.0002), T Stage and AJCC Stage were significantly correlated with survival.
In the verification set of TCGA, univariate COX regression analysis found that Risk score and T staging were significantly correlated with survival, but the corresponding multivariate COX regression analysis found that Risk score (HR = 1.72, 95%CI = 1.12-2.62, p = 0.0123) and T staging were significantly correlated with survival.
In all data sets of TCGA, univariate COX regression analysis found that Risk score, age, gender and AJCC stage were significantly correlated with survival, but the corresponding multivariate COX regression analysis found that Risk score (HR = 2.01, 95%CI = 1.41-2.84, p < 0.0001), age and AJCC stage staging were significantly correlated with survival.
Finally, in the GEO external data set, univariate COX regression analysis found that Risk score, age, T stage, N stage and AJCC stage were significantly correlated with survival, but the corresponding multivariate COX regression analysis found that Risk score (HR = 1.90, 95%CI = 1.23-2.92, p = 0.0035), age and T stage were significantly correlated with survival.
The above conditions indicate that our model 14-IRGPs signature has a good predictive performance in terms of clinical application value in TCGA data set, and our model may be a prognostic indicator independent of other clinical factors and has an independent predictive performance in terms of clinical application value.

Functional analysis and immune analysis of IRGPs
In order to further analyze the functions of 14-IRGPs, we first used clusterProfiler to conduct GO and KEGG enrichment analysis on 19 genes, and finally retained the results of p < 0.05. The results showed that these pathways were enriched to 356 GO BP, which were mainly T cell receptor signaling pathway, stress-activated MAPK cascade and other biological processes, and we show the most significant top 20 (Fig. 5a), Furthermore, we found that these 19 genes were significantly enriched in 39 GO CCs and 73 GO MFs, the most prominent of which were the top 20 (Fig. 5b, c).  ssGSEA was used to analyze the enrichment scores of each sample in each pathway in the TCGA data set, calculate the correlation between these pathways and risk scores, and 26 pathways with correlation > 0.25 were selected (Fig. 5d), we found that most of the samples with risk score present negative correlation, a small number of positively related with risk score. Cluster analysis was conducted according to the 26 KEGG pathway enrichment scores (Fig. 5e), it can be seen that among the 26 pathways, B CELL SIGNALING PATHWAY, PRIMARY IMMUNODEFICIENCY and other pathways increase with the increase of RiskScore, and FOCAL ADHESION, GALACTOSE_METABOLISM and other metabolismrelated pathways decrease with the increase of RiskScore. Results suggested that the imbalance of these pathways is closely related to the development of tumors.
In addition, we obtained the signature genes of three immune cells, Activated B cell, Activated CD4 T cell, and Activated CD8 T cell, from a previous study [21], and calculated enrichment scores in each sample using the method of ssGSEA to assess the sample's corresponding Immune cell scores. The differences in these three immune cell scores in the high and low risk groups of patients were analyzed and observed that B cell and T cell activation scores were all significantly lower in high risk patients with poor prognosis (Fig. 5f). Current immunotherapy-related datasets are rare. We found a cohort of PD-L1-treated patients with metastatic uroepithelial carcinoma shared by Sanjeev Mariathasan et al [22] and analyzed the differential expression of 19 genes in 14 IRGPs in patients with different response states after PD-L1 treatment. We observed a significant differential expression of 14 (73.6%) genes (Fig. 5g), suggesting that these genes are associated with immunotherapy.

Establishment and evaluation of nomogram model
In addition to 14-IRGPs, clinical features Stage and Age are also independent prognostic factors, indicating that The KEGG pathway with a correlation with risk scores greater than 0.25 has a relationship with the ssGSEA score in each sample as the risk score increases. The horizontal axis represents the sample, and the risk score from left to right increases in turn. f: The difference of B cell and T cell activation scores between high and low risk groups. g: Expression differences of 19 genes in 14-IRGPs in patients with different response states after PD-L1 treatment; CR: complete response, PD: progressive disease, SD: stable disease, PR: partial response they have complementary values. In order to further improve the accuracy of prediction, a new nomogram was established by integrating Stage, Age and 14-IRGPs using Cox model. According to this model, 14-IRGPs contribute the most to OS, followed by Age and Stage (Fig. 6a). By calculating the total score, oncologists could easily obtain the OS probability predicted by the nomogram of an individual patient. Furthermore, we used the calibration curve to evaluate the prediction accuracy of the model (Fig. 6b), results show that the predicted calibration curves of the three calibration points in 1, 3, and 5 years were close to the standard curve, which indicated that the model has good prediction performance. In addition, we also used DCA (Decision curve) to evaluate the reliability of the model (Fig. 6c). It was observed that RiskScore (14-IRGPs) and nomogram benefit significantly higher than the extreme curve, and nomogram is higher than RiskScore, and Age and Stage are close to the extreme curve. This suggests that RiskScore (14-IRGPs) and nomogram have good reliability.

14-IRGPs was compared with other signatures and clinical features
In order to observe the performance of 14-IRGPs, the prognostic signature of three head and neck cancers reported in the past (3-gene signature of Cui L et al [23], 6-gene signature of Weidong Zhang et al [24] and 3gene signature of Hongbo Zhou et al [25]) and four  clinical features of T, N, age and Stage were selected. In order to make the model comparable, we calculated the Risk score of each head and neck cancer sample in the TCGA training set with the same method according to the corresponding genes in the 3 models, evaluated the ROC of each model and C-index (Table 5). We observe that the 6-gene signature model has the highest AUC above among the three models, while the average AUC for 1, 3 and 5 years is 0.63. The 1, 3, and 5 years AUC of 14-IRGPs signature were all above 0.73. In addition, in the C-index of all models, 14-IRGPs was significantly higher than other clinical features and models, indicating that our model has good application value.

Discussion
Due to the heterogeneity of HNSCC, patients are still at great risk of recurrence and death even after complete surgical resection. The management of adjuvant chemotherapy for early HNSCC remains controversial. Therefore, it is important to develop a personalized management approach for HNSCC. Reliable prognostic biomarkers can identify patients with poor prognosis, and predictive biomarkers can inform patients who may benefit from additional systemic therapy, regardless of treatment, and therefore have more direct clinical relevance. In this study, we developed immune-related genes for signature prediction of HNSCC prognosis. Their potential for molecular stratification of HNSCC suggests different immune characteristics at different stages of the tumor. In the past decade, important studies based on prognostic signals of immune gene expression have shown that immune genes have a strong prognostic ability. Several gene expression scores have been proposed for predicting the risk of recurrence, and both Tadalafil and anti-tumor vaccine-mediated immune rejection reversals also lead to up-regulation of PDL1 in recurrent HNSCC, suggesting that immune checkpoint therapy may be effective in patients with HNSCC [8]. Immune-related gene signature reflecting immune infiltration can predict the prognosis of colorectal cancer [26]. AP001056.1 is a key immunerelated ceRNA in SCCHN, and ICOSLG encodes immune checkpoint protein as its regulatory target, which can be used as a prognostic molecule of HNSCC [27]. The 14-IRGPs we developed could be risk stratified in four data sets, with AUC higher than 0.659. The KM curve of the risk score in the four data sets indicates that high risk predicts poor prognosis, and those results indicated that the immune-related gene can be used as a factor for stratifying the prognosis risk of HNSCC.
In order to observe whether 14-IRGPs signature is dependent on TP53 and EGFR mutation characteristics, we first compared the relationship among 14-IRGPs signature, TP53 and EGFR mutation using single-factor and multifactor analysis ( Figure S1A-B). The results showed that 14-IRGPs signature had significant difference in prognosis, suggesting that 14-IRGPs signature is an independent factor. Furthermore, we compared the ROC analysis of 14-IRGPs signature in mutant and non-mutant samples and, considering the small number of EGFR mutations, only TP53 mutations were analyzed here ( Figure S1C-D). We observed that the 14-IRGPs signature had higher AUC in both TP53 mutant and non-mutant samples. We also observed the lowest AUC at 1 year in TP53 mutant samples and the lowest AUC at 5 years in non-mutated samples, suggesting that the 14-IRGPs signature has better predictive performance for long-term survival in TP53 mutant samples and for short-term survival in non-mutated samples. We downloaded exon datasets of TCGA samples and extracted mutation data from HNSCC samples, in which a total of 508 patients were tested. Nineteen genes in the 14-IRGPs signature were analyzed for their mutation frequencies in these patients ( Figure S1E), which had the highest frequency of CDKN2A mutations, especially in high-risk patients, mainly Nonsense_Mutation.
Go and KEGG analysis were conducted to identify the functions of the 19 genes involved in HNSCC. T cell receptor signaling pathway, stress-activated MAPK cascade, B cell signaling pathway and primary immunodeficiency were enriched in TCGA samples. These immune-related pathways are involved in various biological processes, such as differentiation, growth, and apoptosis, and promote cell interaction and migration [28,29]. Taken together, those pathways may facilitate the metastasis of HNSCC.
Comprehensive analysis shows that risk score is a prognostic biomarker for HNSCC and can be used to molecularly stratify prognosis. Clinical features Age, Stage and Grade are key prognostic factors in head and neck squamous cell carcinoma. Factor [30], As expected, there is a significant association between the risk score and Age, Stage, and Grade, and found that the combination of risk score and Age has a superior prognostic effect.
Although we identify potential candidate IRGPs involved in tumorigenesis in large samples by bioinformatics techniques, some limitations of this study should be noted. First, the sample lacks some clinical follow-up information, so we did not consider factors such as the presence of other health status of the patient to distinguish prognostic biomarkers. Second, the results obtained only through bioinformatics analysis are inadequate and experimental validation is needed to confirm these results. Therefore, further genetic and experimental studies of larger sample sizes and experimental validation are needed.

Conclusions
In conclusion, we studied the immunological characteristics of HNSCC and systematically studied the expression profile of immune genes. We found immune-related gene pair