Establishment of an immune-related gene pair model to predict colon adenocarcinoma prognosis

Background Colon cancer is the most common type of gastrointestinal cancer and has high morbidity and mortality. Colon adenocarcinoma (COAD) is the main pathological type of colon cancer, and much evidence has supported the correlation between the prognosis of COAD and the immune system. The current study aimed to develop a robust prognostic immune-related gene pair (IRGP) model to estimate the overall survival of patients with COAD. Methods The gene expression profiles and clinical information of patients with colon adenocarcinoma were obtained from the TCGA and GEO databases and were divided into training and validation cohorts. Immune genes were selected that showed a significant association with prognosis. Results Among 1647 immune genes, a model with 17 IRGPs was built that was significantly associated with OS in the training cohort. In the training and validation datasets, the IRGP model divided patients into the high-risk group and low-risk group, and the prognosis of the high-risk group was significantly worse (P<0.001). Univariate and multivariate Cox proportional hazard analyses confirmed the feasibility of this model. Functional analysis confirmed that multiple tumor progression and stem cell growth-related pathways were upregulated in the high-risk groups. Regulatory T cells and macrophages M0 were significantly highly expressed in the high-risk group. Conclusion We successfully constructed an IRGP model that can predict the prognosis of COAD, providing new insights into the treatment strategy of COAD. Supplementary information Supplementary information accompanies this paper at 10.1186/s12885-020-07532-7.

With the development of high-throughput omics, various omics techniques, such as whole-genome sequencing, epigenomics, and proteomics, have been applied to study COAD [7][8][9][10]. Increasing evidence has shown that COAD is not a consistent disease type but a molecularly heterogeneous disease comprising a series of genetic changes [11]. Tumor heterogeneity can alter the tumor growth rate, invasive ability, sensitivity to drugs, prognosis and other aspects, making it one of the main obstacles affecting tumor treatment [12,13]. Therefore, dividing patients with COAD into different risk groups based on gene expression profiles helps to predict the risk of tumor progression or metastasis and recurrence and is a necessary prerequisite for proper individualized treatment [14][15][16].
There is increasing evidence that the immune system plays an important role in the occurrence and development of cancer [17][18][19]. For example, Salem M [20] found that disrupting the cell surface receptor glycoprotein-A repetitions predominant (GARP) on activated regulatory T (Treg) cells reduces immune tolerance and the development of colon cancer. In recent years, a method based on the relative ranking of gene expression levels was proposed to eliminate the shortcomings of data standardization and scaling in gene expression data processing, achieving reliable results in various studies [21,22]. The present study selected immune genes that are significantly associated with the prognosis of COAD. Next, we integrated these genes to construct an immune-related gene pair (IRGP) risk model and verified its feasibility as a prognostic marker for COAD.

Data processing
The human General Transfer Format (hunman.gtf) from Ensemble (https://www.ensembl.org/index.html) [26] was downloaded, and the TCGA data were annotated using Perl language [27]. The chip data file (GSE39582 and GSE17538) was preprocessed using Perl language through the annotation file of the GPL570 platform. Using the above operations, all the gene probe IDs were converted to corresponding gene symbols. To analyze the correlation between the IRGP signature and prognosis in COAD, only patient data containing complete overall survival (OS) were selected.
Establishment of the prognostic immune-related gene pair (IRGP) model We downloaded a list of immune-related genes (IRGs) from IMMPORT (https://www.immport.org/) [28], a website with open access to immunoassay data for translation and clinical research. Next, the R language [29] limma package (version 3.42.2) was used to control the list to screen out the IRGs in the downloaded TCGA transcriptome data. To further select valuable IRGPs, we measured and stored IRGs with a relatively high variation on all the platforms in this study (as determined by the median absolute deviation (MAD) > 0.5) [30]. The expression levels of IRGs in each sample in the transcriptome, GSE39582 and GSE17538 were compared in pairs to form each IRGP according to a previously validated method [22]. Specifically, in the pairwise comparison of each sample, if the expression level of the first gene is greater than that of the second gene, the output is 1; otherwise, it is 0. Samples with a ratio of 0 and 1 less than 20% were deleted to retain gene pairs that may be related to survival. These IRGPs were merged with the survival time of the clinical data downloaded by the corresponding platform to evaluate the correlation between each IRGP in the training dataset and overall survival rate of the patient. Based on previous reports [31,32], we used the R language survival software package (version 3.1-11) to perform univariate Cox regression analysis and P < 0.001 to screen the effective IRGPs in the TCGA data. From these IRGPs, we used R language for Lasso Cox proportional hazards regression (glmnet software package, version 3.0-2) to construct the risk score, and the final prognostic model was defined using 17 gene pairs. Finally, in the training cohort, we set the overall survival to 5 years and constructed the time-dependent receiver operating characteristic (ROC) curve (survivalROC, version 1.0.3) to determine the best cutoff value for the risk score and divide patients into low-risk and high-risk groups accordingly.

Further validation of the model
Using the R package survival and survminer (version 0.4.6), Kaplan-Meier plots were applied to establish survival curves for the high-risk and low-risk groups in the training and verification cohorts. The differences in the survival curves were analyzed using the log-rank test. Cox proportional hazards analysis was used for univariate and multivariate analyses to assess the effect of the risk score and other clinical factors.

Gene expression profiles (GEPs) of immune cell infiltration in tumors
We used CIBERSORT [33] to infer the relative abundance of tumor-infiltrating immune cells in different risk groups. CIBERSORT estimated the putative proportion of infiltrating immune cells using a reference set with 22 sorted immune cell subtypes for each sample in the training cohort and validation cohorts. Monte Carlo sampling was used in CIBERSORT to calculate the P-value of the deconvolution of each sample to provide the estimated confidence. The permutation is set to greater than 100, and the corresponding P-value is generated.

Gene set enrichment analysis (GSEA)
The chemical and genetic perturbation analysis-related documents involved in the study were downloaded from the Molecular Signature Database (MSigDB C2, version 7.1) (https://www.gsea-msigdb.org/gsea/datasets.jsp). GSEA [34] was performed using the R package fgsea (version 1.12.0) with default parameters. A log 2-fold change was made between GEPs in the high-risk vs low-risk groups. The difference in the gene sets between the high-and low-risk groups was compared. Differences with an FDR-adjusted P < 0.05 were defined as significant.

Statistical analysis
For all the above tests, a P-value less than 0.05 denoted the presence of a statistically significant difference. Statistical significance was indicated as follows: *P < 0.05, **P < 0.01, ***P < 0.001.

Construction of the prognostic IRGP model
The TCGA transcriptome data were used as a training cohort. From the list of immune-related genes (IRGs) obtained by IMMPORT, the genes in the transcriptome data were searched in turn, and 1647 IRGs were identified. To ensure relatively high variation in the genes of the two platforms, we retained 325 IRGs with a median absolute deviation (MAD) > 0.5. In total, 40,375 pairs were deleted with a ratio of 0 and 1 less than 20%. Next, 12,275 immune-related gene pairs (IRGPs) were built based on these 325 IRGs. After univariate Cox regression analysis of these IRGPs in the training group, 28 potential prognostic IRGPs remained. Using Lasso Cox proportional hazards regression to define the model on the training set, 17 IRGPs were retained to form the final  (Table 1). Next, the risk score for each patient in the TCGA dataset was calculated based on the model. Finally, we used a time-dependent ROC curve analysis to classify patients into high-or low-immune risk groups. The optimal cutoff value for the risk score was set to − 0.576 (Fig. 1). This value successfully stratified the patients in the training cohort into high-and low-risk groups. In other words, the overall survival (OS) of the low-risk group was significantly higher than that of the high-risk group (Fig. 2a). We further performed univariate and multivariate Cox proportional hazards analyses to test whether the IRGP model predicted survival independently of other prognostic factors in the TCGA cohort. Among these analyses, the risk score of the model can be used as an independent prognostic factor (Fig. 3a, b).

Verification of the feasibility of the IRGP model to predict survival
To determine whether the model had consistent prognostic value in different risk groups, we applied the model to GSE39582 and GSE17538 as external validation. The patients in the verification cohort were divided into two groups according to the risk score. The OS of subgroups in the low-risk group increased significantly (Fig. 2b, Figure S1A). After performing univariate and multivariate Cox proportional hazards analyses in the validation group, we found that the results were similar to those of the training group, and the high risk score of this model suggests a poor prognostic factor (Fig. 3c, d, Figure S1B and Figure S1C).

Immune cell infiltration in different risk groups
Previous studies have revealed that tumor-infiltrating immune cells are related to prognosis [35]. To determine the infiltration of specific tumor immune cell subsets, we used CIBERSORT to estimate the relative proportion of 22 different immune cells per patient in different risk groups. Three radar charts depict a comparative summary of various immune cells in these two risk groups (Fig. 4, Figure S2 and Figure S6). In the training cohort, we found that activated dendritic cells, resting dendritic cells, eosinophils, M0 macrophages, monocytes, resting CD4 memory T cells and regulatory T cells (Tregs) were enriched in different risk groups. Among them, regulatory T cells (Tregs) and M0 macrophages were significantly and highly expressed in the high-risk group, and the rest were highly expressed in the low-risk group (Fig. 5). The high-risk group of GSE39582 highly expressed M0 macrophages, M1 macrophages, monocytes, neutrophils, CD8 T cells and follicular helper T cells ( Figure S3). The highrisk population in GSE17538 also highly expressed monocytes and Tregs ( Figure S7).

Functional evaluation of the IRGP model
To investigate the expression signatures of genetic perturbations that were significantly altered by the IRGP model, GSEA was performed in the high-risk and low-risk groups in the TCGA cohort. The bubble chart revealed that genes in the high-risk populations were enriched in stem cells and various advanced tumors (Fig. 6). The top five genetic perturbations in the high-risk group were enriched stem cells, increased breast cancer ductal invasion, a multicancer invasiveness signature, increased advanced vs early-stage gastric cancer and enriched mammary stem cells (Fig. 7). We also obtained similar results when performing the above analysis on GSE39582 and GSE17538 ( Figure S4, Figure S8). The high-risk group genes in GSE39582 were significantly enriched in breast cancer ductal invasion and stem cells ( Figure S5). The GSEA results obtained in GSE17538 also showed that the high-risk group genes are enriched in tumor cell growth and invasion ( Figure S9).

Discussion
Colon cancer is the most common type of gastrointestinal cancer and has high morbidity and mortality.  [36][37][38][39]. Patients with MSI-H have a better prognosis than those with microsatellite stability (MSS). However, the MSI-H population accounts for only approximately 10% of COAD. Most patients still face the dilemma of not having an effective prognostic indicator. Thus, the determination of new prognostic biomarkers is urgent to predict the survival of colon adenocarcinoma patients.
To obtain the robustness of the prognosis prediction in this study, we adopted a method for data analysis without considering the technical deviation of different platforms. The newly established prognostic model is based on the ranking and pairing comparison of relative gene expression values; thus, data preprocessing, such as scaling and normalization, is not required. This method has reliable results in many studies [40,41].
In this study, we identified an immune-related gene pair model to predict the overall survival for colon adenocarcinoma. The prognostic model comprises 17 immune-related gene pairs containing 26 unique immune-related genes. Most genes in this immune model are cytokine receptors and cytokines, which play a vital role in the adaptive immune response.
Among these IRGs, no evidence supports that the overexpression of IL17RB can enhance the invasion and metastasis of thyroid cancer cells [42]. STC2 overexpression is associated with a poor prognosis in patients with nasopharyngeal carcinoma (NPC) and can be used as a predictor of NPC responses to radiation [43]. The increase in IL-7 in colorectal cancer (CRC) is related to metastatic disease and tumor location [44]. Decreased CXCL14 expression indicates a poor prognosis and causes metastasis in colon cancer [45]. GRP signaling alters the invasion of colon cancer through heterochromatin protein 1 Hsβ and can improve the prognosis of patients with colon cancer [46]. Moreover, regulatory T cells (Tregs) and M0 macrophages are related to the poor clinical prognosis of many patients with cancer [47,48]. Dendritic cells are associated with cancer immunity and a favorable prognosis [49]. At the same time, the immune cell types M0 macrophages, M1 macrophages, monocytes, neutrophils, CD8 T cells and follicular helper T cells in the high-risk group of GSE39582 are all related to tumor progression and poor prognosis [50][51][52][53]. These findings are consistent with our results. In this study, we also found that several expression characteristics of genetic perturbations, such as increased stem cells, increased breast cancer ductal invasion, a multicancer invasiveness signature, increased advanced vs early gastric cancer and increased mammary stem cells, were related to the IRGP model. These results were verified by corresponding experiments [54][55][56][57][58], confirming their importance in tumor development and cell growth. These findings indicate that the IRGP Fig. 5 The abundance distribution of specific immune cells' within different risk groups. T cells regulatory and Macrophage M0 were significantly highly expressed in the high-risk group, while the rest were significantly higher in the low-risk group model may play an essential role in tumor invasiveness and progression in COAD.
The difference between this study and previously published studies [59] is that the IRGP model was established based on the TCGA database. Second, our strategy to establish a prognostic model was different. To screen out immune-related gene pairs that are significantly related to OS in patients with colon cancer, we used univariate Cox regression analysis before determining the final model using Lasso regression analysis. Finally, we conducted GSEA in the training and validation cohorts to further analyze the specific differences between the high-and low-risk groups. We found that the high-risk group genes were significantly enriched in tumor cell invasion and growth.
Similar to all RNA-seq and microarray analyses, our study had limitations. First, the training dataset to build the immune model was obtained from a retrospective study, which included fresh frozen samples; the stability and efficiency of formalinfixed and paraffin-embedded (FFPE) samples remain questionable. Therefore, it may be necessary to add more datasets with different sample attributes for more extensive verification. Second, because the prognostic model was based on TCGA and other databases, it required proficiency in bioinformatics. Additionally, the gene expression profiles produced by RNA-seq or microarray platforms require high prices and long conversion cycles. Therefore, this method is challenging to popularize in daily clinical applications.

Conclusions
In summary, our immune-related gene pair model can provide an evaluation reference for the prognostic risk of patients with colon adenocarcinoma. The immunerelated model was associated with the prognosis of patients with COAD. The tumor-infiltrating immune cells and genetic perturbations distinguished by this model in the high-and low-risk groups can further elucidate the role of our prognostic model in the development of colon adenocarcinoma. Therefore, the risk model will be a useful tool to better evaluate patients who may benefit from immunotherapy.
Additional file 1: Figure S1. Additional verification using GSE17538. Patients were stratified by immune-related gene pairs model(A). (B) represent the univariate analysis result and (C) represent the multivariate analyses result.
Additional file 3: Figure S3. The abundance distribution of specific immune cells' within different risk groups in GSE39582.
Additional file 4: Figure S4. The expression characteristics of genetic perturbations significantly changed by the IRGPs model in GSE39582.
Additional file 5: Figure S5. The top 5 results of GSEA in GSE39582.
Additional file 7: Figure S7. The abundance distribution of specific immune cells' within different risk groups in GSE17538.
Additional file 8: Figure S8. The expression characteristics of genetic perturbations significantly changed by the IRGPs model in GSE17538.