Identification of the potential novel biomarkers as susceptibility gene for Wilms tumor

Wilms tumor (WT) is the most common malignant renal tumor in children. The aim of this study was to identify potential susceptibility gene of WT for better prognosis. Weighted gene coexpression network analysis is used for the detection of clinically important biomarkers associated with WT. In the study, 59 tissue samples from National Cancer Institute were pretreated for constructing gene co-expression network, while 224 samples also downloaded from National Cancer Institute were used for hub gene validation and module preservation analysis. Three modules were found to be highly correlated with WT, and 44 top hub genes were identified in these key modules eventually. In addition, both the module preservation analysis and gene validation showed ideal results based on other dataset with 224 samples. Meanwhile, Functional enrichment analysis showed that genes in module were enriched to sister chromatid cohesion, cell cycle, oocyte meiosis. In summary, we established a gene co-expression network to identify 44 hub genes are closely to recurrence and staging of WT, and 6 of these hub genes was closely related to the poor prognosis of patients. Our findings revealed that those hub genes may be used as potential susceptibility gene for clinical diagnosis and prognosis of this tumor.


Background
Wilms tumor (WT) is common abdominal malignancy in children and accounts for up to 95% of renal tumors in children [1]. This tumor is an embryonal childhood tumor of metanephric origin, as it is histologically similar to the early stages of nephrogenesis, and many of the genetic changes that support the disease occur in genes associated with fetal kidney genes [2]. With the development of modern multimodality therapy, favorable histology WT survival has been achieved in more than 90% [3][4][5]. Of course, the described above does not include patients with advanced disease, bilateral WT patients and patients with recurrence. For approximately 10% of patients with high-risk subtypes of WT, the treatment outcome is not optimistic. Meanwhile, a challenge for all WT patients is recurrence. Approximately 15% of favorable histology WT patients will experience recurrence, and the overall survival of patients with recurrence can drop to 40 to 80% [6,7]. However, although short-term survival is high in patients with WT, longterm survival is reduced in patients with WT due to adverse therapeutic effects of cancer treatment, such as renal insufficiency, secondary malignancies, and heart failure. In addition to technical improvements, it is also important to clarify the stage of WT and avoid overtreatment. And, the high recurrence rate of WT with favorable histology indicates that new therapies are needed to improve the prognosis. Therefore, it's urgent to detect potential susceptibility gene associated with WT. The samples data for this study came from the National Cancer Institute (NCI) with large, global database of cancer-related genetic variations. Samples were constructed co-expression network according to weighted gene co-expression network analysis (WGCNA), which was performed using the WGCNA package for R [8]. WGCNA, which is a systems biology method for the analysis of microarray data, is widely used for the detection of potential biomarkers. The application of WGCNA is based on a scale-free network distribution, and its advantage over other research methods may be that it can maximize the use of effective data information. When transforming the correlation matrix into the adjacency matrix, we choose β to weight the correlation coefficient, "polarize" the correlation coefficient and make the correlation and noncorrelation more significant. WGCNA is widely used as a powerful data-driven tool to study the expression of cancer genes and to help understand the development mechanisms of various cancers, such as clear cell renal cell carcinoma (ccRCC), hepatocellular carcinoma (HCC), lung cancer.
In the study, a weighted gene co-expression network was performed to identify modules that are significantly related to WT. Then the top hub genes are screened out in the module. These genes might as potential susceptibility gene to reduce recurrence for patients with WT, avoid overtreatment by scientific pathological staging, and thus minimize toxicity treatment with other conditions unchanged, leading to better prognosis and better long-term survival. This study may have important reference value for potential susceptibility gene of this disease.

Data collection and data preprocessing
Raw gene expression profiles were downloaded from NCI (https://ocg.cancer.gov/programs/target/datamatrix). All tissue samples were from the platform named GPL96 [HG-U133A] Affymetrix Human Genome U133A Array. After the raw expression data were corrected by the robust multi-array averaging (RMA) algorithm [9], the nsFilter algorithm was used to filter the data for the next analysis. There were 6201 probes and 59 primary tumor tissue samples through filtering. The samples were named by the provider that indicated gender, age at diagnosis, overall survival time, event (recurrent) free survival time and disease stage.
Weighted genes co-expression network analysis WGCNA was carried out to screen for significantly stable modules that are related to WT. The key to constructing the gene co-expression network by the WGCNA method [10] is "weighted"; specifically, the selection of soft threshold β is the key link for subsequent analysis. By using the scale-free topological criterion to select a soft threshold [11], the co-expression network has a high biological signal and is closer to the scale-free network distribution. Generally, when R 2 > 0.8, the network was considered approximately to the scale-free network distribution. In short, the correlation matrix was converted into the adjacency matrix based on the soft threshold β, and the specific calculation is a ij= |cor (x i , x j )| β , X i and x j are the nodes i and j of the network. Then, the adjacency matrix was transformed into a topological overlap matrix (TOM) after a series of complex calculations. TOM provides a simplified diagram of the network, allowing the visualization of the network and facilitating the identification of network modules. Then the TOM graph is analyzed by average linkage hierarchical clustering based on the phase dissimilarity (1-TOM). Modules, clusters of highly interconnected genes in coexpression network, were identified after performing cluster analysis and dynamic tree shearing. The height cut-off value of dynamic tree shearing was guided by TOM. In general, the number of genes in each module is 30 and above. Meanwhile, each module was marked with a different color, and all non-characteristic genes were assigned to gray module, and the grey module was not involved in subsequent analysis [11].

Module preservation analysis and identification of clinically significant modules
Module preservation analysis was carried out to measure the stability of each module defined. The stability of the module is measured by the Z summary score (Z-score) and medianRank [12]. The higher the Z-score is, the better the module is preserved, and the more reliable the subsequent analysis will be. In general, if the Z-score is greater than 10, it is considered that the module is wellpreserved. Moreover, the medianRank of the modules close to zero indicates the high degree of module preservation.
The gene modules most relevant to clinical features were selected for subsequent analysis. In addition to qualitative analysis, quantitative calculations involving modules eigengene (ME), gene significance (GS) and module significance (MS) were performed. ME is the first principal component in the gene module and may represent the gene expression profile of the whole module. GS can be understood as the correlation between genes and traits. The higher the GS of a given gene is, the more closely the gene is related to clinical characteristics, and the more significant the biological significance is [8]. MS represents the average value of GS in the module.

Identification and validation of hub genes and functional annotation
Select hub genes in two ways: those that have been shown in previous studies to be strongly associated with disease and those that are highly connected in the gene co-expression network [13]. The study focused on the latter approach. Hub genes or key genes are those that are significantly related to clinical characteristics and are highly connected with other genes in a gene module [8]. The former can be measured by geneTraitSignificance (geneTraitSignificance> 0.20), and the latter can be measured by geneModuleMembership (geneModuleMem-bership> 0.80). In addition, protein-protein interaction (PPI) network was visualized by Cytoscape (http://www. cytoscape.org/) package. Proteins that are highly connected in the PPI network are more important than those that are not, and the same is true for the corresponding genes [14]. After the co-expression network, the analysis of PPI network is equivalent to the scoring of genes from the perspective of proteins, which is conducive to the further identification and verification of hub genes. In PPI networks, we associate the first two values to identify some more biologically significant genes as hub genes for subsequent analysis. At the same time, to better understand the mechanism of gene action and to facilitate subsequent analysis, pathway enrichment analysis of hub genes in selected modules needed to be conducted. We uploaded genes to Enrichr for enrichment analysis and functional annotation [15].
Furthermore, 224 samples from NCI to test whether the hub genes were significantly expressed in the sample based on One-way analysis of variance (one-way ANOVA) To verify the hub genes, specifically, is to verify the differential expression of the hub gene in other independent data sets is statistically significant.

Hub gene evaluation
To assess the relationship between hub genes and WT patients, the "survival" package of R 3.5.2 software was used for the log-rank test and Kaplan-Meier survival analysis. The Kaplan-Meier method is a nonparametric survival analysis method, which is widely used in survival analysis of cancer research as a smart statistical treatment method of survival time [16]. In the section, patients are usually divided into two groups according to each gene expression (high vs. low). The corresponding Kaplan-Meier estimation value was calculated, and the Kaplan-Meier survival curve was determined. Furthermore, log-rank was used to test whether the difference in survival time between the two groups was statistically significant. Furthermore, time-dependent receiver operating characteristic (ROC) analysis was used to evaluate diagnosis value of hub genes.

Weighted co-expression network construction
After a series of data preprocessing including filtering, 6201 probes and 59 samples were obtained. When the correlation matrix was transformed into an adjacency matrix, as shown in Additional file 1, β = 11 (scale free R 2 = 0.88) was selected as the weighted coefficient value. The scale free is closest and higher than 0.85 for the first time. Two samples with Z. K value <− 2.5 (TARGET.50. PAJMJK and TARGET.50. PAJNAV) were filtered out ( Fig. 1). Finally, 13 modules were identified (Additional file 2).

Module preservation analysis and identification of clinically significant modules
The study conducted module preservation analysis use "modulePreservation" and determined whether the modules are preserved according to two main parameters: the Z-score and medianRank. As shown in Fig. 2, the Zscores of the green, pink, red, brown, black, blue module and turquoise module are all above 20, and the turquoise module is highest. The high Z-scores of these modules indicate that these modules are well preserved, but since Z-scores are highly dependent on the size of modules, we also need to analyze the medianRank. Although the Z-score of turquoise module indicates that it is well preserved, the medianRank of turquoise module is not optimistic, indicating that the module may be unstable. However, the two values of green, pink and blue modules indicate that these modules have good stability.

Identification and validation of hub genes and functional annotation
The genes with geneTraitSignificance> 0.20 and gene-ModulesMemership> 0.80 were defined as candidate genes, and the hub gene was not only the candidate gene but also the gene ranked in the top 10% of the module. Candidate genes in blue and pink modules were constructed a PPI network (Fig. 4) for they associated with the same clinical characteristics. Due to the superior PPI network connectivity of the blue module gene, the PPI network was also used in the selection of the hub genes in the blue module gene. Therefore, the study selected 15 genes from the green module, 18 genes from the blue module, and 11 genes from the pink module as the hub genes in Table 1. Finally, one-way ANOVA was used to verify the hub genes that finally defined. The verification results of the green module, blue module and pink module are shown in Figs. 5, 6 and 7, respectively. The differential expressions of 44 genes were statistically significant (P < 0.05). The candidate genes in each module were uploaded into the Enrichr database for Gene Ontology (GO) analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis. The candidate genes in the blue and pink modules were functionally annotated and enriched together. The GO functional annotation indicated that the blue and pink modules were enriched in sister chromatid cohesion, condensed chromosome kinetochore and spindle midzone, etc. (Table 2), and KEGG enrichment analysis showed that the blue and pink modules were enriched in the cell cycle, oocyte meiosis, etc. (Table 3). GO and KEGG  The protein-protein network of the candidate genes in blue module and pink module. Note: Each graph represents a gene, and the size of the graph is proportional to the degree of connectivity, and the higher the degree of connectivity between the orange and red genes. The arrow represents the candidate genes in the network indicated that the expression mechanism of the green module was related to the excitatory postsynaptic potential, excitatory synapse (Table 4) and the ras signaling pathway (Table 5).

Hub gene evaluation
All patients were divided into two groups according to the median expression value of each hub genes in the three modules. The corresponding Kaplan-Meier estimate value was calculated, and the Kaplan-Meier survival curve was plotted. A log-rank test showed that the difference in survival time between the two groups corresponding to 6 hub genes (P < 0.05) in the green module was statistically significant (Fig. 8). The reason for this result may be that among three modules, the green module is most relevant to the overall survival time (Fig.  3). The image of the high-expression group of 6 hub genes was steeper than that of the low-expression group, indicating that the high expression of these hub genes was closely related to the poor prognosis of patients. In addition, we drew ROC curves for the six hub genes, as shown in Fig. 9 and Table 6, and the six hub genes had moderate diagnostic value.

Discussion
WT is one of the most common tumors in children, and although modern multimodality therapy can improve the survival rate of WT patients, not all patients are spared. At present, the stage of WT can refer to the degree of resection, perioperative rupture of the tumor capsule, lymphatic diffusion and distant metastasis [1]. The chromosome 11p contains two biomarkers of concern: 11p13 (WT1) and 11p15.5 (WT2). WT1 mutation occurs in 15% of sporadic patients, loss of heterozygosity (LOH) at chromosome 11p15.5 accounts for 70% of WT patients [17]. Loss of the entire longarm of chromosome 11 was associated with higher rates of relapse and death [18]. Chromosomes 1p and 16q are also regions of concern for genetic changes. Multiple studies have suggested that LOH at 1p and/or 16q associates with relapse and over all poor prognosis [19,20]. So far, combined LOH at 1p/16q is the only molecular marker used for risk stratification. Although LOH at 1p and 16q was sensitive in predicting recurrence, this combination was present in only 9.4% of recurrent tumors [1]. Therefore, there is still some room to learn about WT. In this study, a dataset including 59 WT samples were used to construct the co-expression network, and 13 modules were identified, among which the green module was most related to stage, while the blue module and pink module were tightly related to event-free (recurrence) survival time. After further analysis, we identified 11 hub genes in the pink module, 15 hub genes in the green module, and 18 hub genes in the blue module for a total of 44 hub genes. It is worth mentioning that differential expression of 44 hub gene in other independent data sets is statistically significant. Gene enrichment pathways in the blue and pink modules were mainly focused on those related to sister chromatid cohesion, cell division and proliferation. Sister chromatid cohesion is an important pathway mediated by cohesive proteins to ensure normal chromosome segregation in cells [21].
We have not retrieved studies on the mechanism of sister chromatid cohesion and WT, but sister chromatid   cohesion has been reported to be frequently amplified in liver cancer cells, which might a driver gene promoting the proliferation of liver cancer cells and related to poor prognosis of liver cancer. Survival rates in breast cancer patients were associated with sister chromatid cohesion [22]. In addition, sister chromatid cohesion has been implicated in studies of pancreatic, bladder, and colorectal cancers. One report suggests that identifying cancer cells with sister chromatid cohesion mutations may be a new therapeutic opportunity [23]. Genes in green module enrichment into the Ras signaling pathway that play an important role in tumor progression through proliferation, survival, invasion and immune escape [24]. The Ras signaling pathway is mutated and highly activated in Note: GO Gene Ontology thyroid cancer, melanoma and many other cancers. The gene in the green module has also been enriched to excitatory postsynaptic potential and excitatory synapse, and relatively little research has been performed on these pathways and cancers. CDK1 is a prominent member of the cell cycle and is the most connected gene in the PPI network of blue and pink modules. CDK1 is a member of cyclin-dependent protein kinases (CDKs), which was mentioned in the pathogenesis and recurrence mechanism of various malignant tumors. CDKs and the cell cycle protein, as important proteins, are essential to the control and express the cycle [25]. CDK1 is mainly responsible for the G1/S and G2/M cell-cycle transitions. Many studies have shown that increased expression of CDKs or endogenous CDK regulator/inhibitor levels drop in various cancers, hematology tumor and sarcomas are visible, and because CDKs are natural targets for the treatment of cancer, many studies have shown that CDK inhibitors (e.g., AT7519) can inhibit cancer progression [26]. In the case  Note: GO Gene Ontology of CDK1, its down-regulation may induce mitotic mutations leading to apoptosis of WT cells [27]. In addition, CDK1 is associated with poor prognosis in human pharyngeal squamous cell carcinoma [28], prostate cancer [29], ovarian cancer [30], oral squamous cell carcinoma [31], pancreatic ductal adenocarcinoma [32] and other cancers. The recognized cancer genes CCNB1 and CCNB2 were also identified as WT-related genes. The hub gene in the pink module, TRIM28, is one of the susceptibility genes of WT. TRIM28 cells showed positive immunohistochemical staining in WT cells but negative staining in other tissues and WT epithelial components [33]. This gene was identified as the new susceptibility gene of wilms tumor in a study based on 890 wilms tumor patients with lymphocyte DNA exome sequencing [34], acting as a tumor suppressor gene by LOH [35].
Considering the above reasons, some scholars suggested that patients with epithelial wilms tumor should undergo TRIM28 gene detection [34]. Although there has been less research on the hub genes in the blue and pink modules of WT, there has been even less research on the hub genes in the green modules and WT compared to the blue and pink modules. Research on these hub genes is imperative to fully elucidate how alterations in cell differentiation relate to WT. The 6 hub genes of the green module (ELAVL3, DLG4, CES1P1, MYL10, CACNA1 and PTAF R) that were shown to have statistically significant differences in survival analysis and moderate diagnostic value in ROC curve were the focus of our attention. PTAFR have been reported as biomarkers for breast cancer [36]. PTAFR is a platelet activating factor receptor associated with many characteristic and inflammatory diseases, but it has been less frequently reported in cancer [37,38]. CES1P1 are associated with the progression of gastric cancer. DLG4 is associated with poor prognosis for prostate cancer and colorectal cancer. Compared with the genes in the blue and pink modules, the genes in the green module have been poorly studied, not only in relation to WT but also to other cancers. Therefore, further investigating the genes in the green module may be the direction of our future research.
Compared with a study using the WGCNA method to identify hub genes associated with WT prognosis [39], our study samples were all from NCI. Of the 44 hub genes we identified, CDK1 and CDCA8 are also hub  Fig. 8 Overall survival analyses on hub genes. Note: The red lines represent high expression of hub genes, while blue lines represent low expression of hub genes genes in that study, which also verified these two genes.
Although there are similarities, we found many new genes that are closely related to WT pathological staging and poor prognosis. Many of these new genes are associated with the relapse of WT, which is currently one of the leading causes of death in WT patients. To the best of our knowledge, our study is the first to use the WGCNA method to identify genes associated with WT recurrence. In a study on WT using the MTT assay and clonal survival assay, mir-1180 was up-regulated in WT and may be a therapeutic target for WT in the future [40]. Another study showed that the hypomethylation level of long interspersed element-1 in WT could be used as a marker of recurrence after chemotherapy [41]. Glypican-3, which is specifically expressed in cancers including WT, is being considered as a biomarker for predicting tumor recurrence [42]. In a study of constructing a DEGs-Transcription Factors-miRNA network to explore WT-related biomolecular markers, and TFs and miRNA were mainly studied, but this study focused on hub genes [43]. Our study identified new biomarkers associated with WT recurrence, which may be a new research direction in the future. However, our study has some limitations: (1) Due to the small number of samples and possible bias, we need to conduct considerable research in the future to verify our results. (2) The original NCI data did not provide the data of progressionfree survival, which prevented this study from evaluating this important clinical outcome.

Conclusions
In summary, with the help of WGCNA, PPI network model construction, GO analysis, KEGG analysis and survival analysis, we identify hub genes are closely to