Novel significant stage-specific differentially expressed genes in hepatocellular carcinoma
BMC Cancer volume 19, Article number: 663 (2019)
Liver cancer is among top deadly cancers worldwide with a very poor prognosis, and the liver is a vulnerable site for metastases of other cancers. Early diagnosis is crucial for treatment of the predominant liver cancers, namely hepatocellular carcinoma (HCC). Here we developed a novel computational framework for the stage-specific analysis of HCC.
Using publicly available clinical and RNA-Seq data of cancer samples and controls and the AJCC staging system, we performed a linear modelling analysis of gene expression across all stages and found significant genome-wide changes in the log fold-change of gene expression in cancer samples relative to control. To identify genes that were stage-specific controlling for confounding differential expression in other stages, we developed a set of six pairwise contrasts between the stages and enforced a p-value threshold (< 0.05) for each such contrast. Genes were specific for a stage if they passed all the significance filters for that stage. The monotonicity of gene expression with cancer progression was analyzed with a linear model using the cancer stage as a numeric variable.
Our analysis yielded two stage-I specific genes (CA9, WNT7B), two stage-II specific genes (APOBEC3B, FAM186A), ten stage-III specific genes including DLG5, PARI, NCAPG2, GNMT and XRCC2, and 35 stage-IV specific genes including GABRD, PGAM2, PECAM1 and CXCR2P1. Overexpression of DLG5 was found to be tumor-promoting contrary to the cancer literature on this gene. Further, GABRD was found to be signifincantly monotonically upregulated across stages. Our work has revealed 1977 genes with significant monotonic patterns of expression across cancer stages. NDUFA4L2, CRHBP and PIGU were top genes with monotonic changes of expression across cancer stages that could represent promising targets for therapy. Comparison with gene signatures from the BCLC staging system identified two genes, HSP90AB1 and ARHGAP42. Gene set enrichment analysis indicated overrepresented pathways specific to each stage, notably viral infection pathways in HCC initiation.
Our study identified novel significant stage-specific differentially expressed genes which could enhance our understanding of the molecular determinants of hepatocellular carcinoma progression. Our findings could serve as biomarkers that potentially underpin diagnosis as well as pinpoint therapeutic targets.
Liver cancer is the second most deadly cancer in terms of mortality rate, with a very poor prognosis . It accounted for 9.1% of all cancer deaths, and 83% of the annual new estimated 782,000 liver cancer cases worldwide occur in developing countries . Liver cancer showed the greatest increase in mortality in the last decade for both males (53%) and females (59%) . Liver hepatocellular carcinoma (LIHC) or simply hepatocellular carcinoma (HCC) is the most common type of liver cancer, accounting for nearly 85% of liver cancers. 78% of all reported cases of HCC were due to viral infections (53% Hepatitis B virus and 25% Hepatitis C virus) . There are several non-viral causes of HCC as well, mainly aflatoxins and alcohol . As shown in Fig. 1, all the factors converge to a common mechanism of genetic alterations that lead to the acquisition of cancer hallmarks  and the eventual emergence of a cancer cell . Genetic alterations constitute the heart of the problem, and studying changes due to these genetic alterations is paramount to understand HCC. Earlier gene expression studies using EST data detected differential expression in cancer tissue compared to non-cancerous liver and proposed the existence of genetic aberrations and changes in transcriptional regulation in HCC . The Cancer Genome Atlas (TCGA) research network  have subtyped and identified many potential targets for HCC based on a comprehensive multi-omics analysis. An independent analysis of TCGA RNA-Seq data encompassing 12 cancer tissues has uncovered liver cancer-specific genes . Zhang et al.  have performed mutation analysis of HCC, and Yang et al.  combined TCGA expression data and natural language processing techniques to identify cancer-specific markers.
The burden of disease and mortality rate are both inversely correlated with the cancer stage. The response rate to therapy is also inversely correlated with stage. To the best of our knowledge, there are no reported research in the literature that have dissected the stage-specific features of HCC. The cancer staging system is based on gross features of cancer anatomical penetration, and one such standard is the American Joint Committee on Cancer (AJCC) Tumor-Node-Metastasis (TNM) staging . It is reasonable to hypothesize that the stage-specific gross changes are associated with signature molecular events, and try to probe such molecular bases of stage-wise progression of cancer. We had earlier published on stage-specific “hub driver” genes in colorectal cancer . A stage-focussed analysis of colorectal cancer transcriptome data yielded negative results vis-a-vis the AJCC staging system .
Normalized and log2-transformed Illumina HiSeq RNA-Seq gene expression data processed by the RSEM pipeline  were obtained from TCGA via the firebrowse.org portal . The patient barcode (uuid) of each sample encoded in the variable called ‘Hybridization REF’ was parsed and used to annotate the controls and cancer samples (Fig. 2). To annotate the stage information of the cancer samples, we obtained the clinical information dataset for HCC from firebrowse.org (LIHC.Merge_Clinical.Level_1.2016012800.0.0.tar.gz) and merged the clinical data with the expression data by matching the “Hybridization REF” in the expression data with the aliquot barcode identifier in the clinical data. The stage information of each patient was encoded in the clinical variable “pathologic stage”. The pathologic stage is essentially the surgical stage, prior to any treatment received, determined with the tissue obtained at the time of surgery. This interpretation is reinforced in the TCGA HCC sample inclusion criteria as follows: “Surgical resection of biopsy biospecimens were collected from patients diagnosed with hepatocellular carcinoma (HCC), and had not received prior treatment for their disease (ablation, chemotherapy, or radiotherapy)” (The TCGA ). The availability of this unequivocal information enables the analysis of cancer stages. The substages (A,B,C) were collapsed into the parent stage, resulting in four stages of interest (I, II, III, IV). We retained a handful of other clinical variables pertaining to demographic features, namely age, sex, height, weight, and vital status. With this merged dataset, we filtered out genes that showed little change in expression across all samples (defined as σ < 1). Finally, we removed cancer samples from our analysis that were missing stage annotation (value ‘NA’ in the “pathologic stage”). The data pre-processing was done using R (www.r-project.org).
Linear modelling of expression across cancer stages relative to the baseline expression (i.e, in normal tissue controls) was performed for each gene using the R limma package . The following linear model was fit for each gene’s expression based on the design matrix shown in Fig. 3a:
where the independent variables are indicator variables of the sample’s stage, the intercept α is the baseline expression estimated from the controls, and βi are the estimated stagewise log fold-change (lfc) coefficients relative to controls. The linear model was subjected to empirical Bayes adjustment to obtain moderated t-statistics . To account for multiple hypothesis testing and the false discovery rate, the p-values of the F-statistic of the linear fit were adjusted using the method od Hochberg and Benjamini . The linear trend across cancer stages for the top significant genes were visualized using boxplots to ascertain the regulation status of the gene relative to the control.
Monotonic mean expression
The linear model in eqn. (1) would not be sufficient to identify genes with an ordered monotonic trend of expression across cancer stages. Addressing this question would also help assess whether monotonic changes of gene expression were observed with disease progression. Towards this end, we designed a model of gene expression where the cancer stage was treated as a numeric variable:
where X takes a value in [0,1,2,3,4] corresponding to the sample stage: [control, I, II, III, IV], respectively. It was noted the mean expression of a gene could show the following monotonic patterns across cancer stages:
monotonic upregulation, where mean expression follows: control < I < II < III < IV.
monotonic downregulation, where mean expression follows: control > I > II > III > IV.
The sets of genes conforming to either (i) or (ii) were identified to yield monotonically upregulated and monotonically downregulated genes. These two sets were merged, and the final set of genes with monotonic changes of expression with cancer progression was obtained. This final set was ranked by the adj. p-values from the model estimated by eqn. (2).
To perform contrasts, a slightly modified design matrix shown in Fig. 3b was used, which would give rise to the following linear model of expression for each gene:
where the controls themselves are one of the indicator variables, and the βi are all coefficients estimated only from the corresponding samples. Our first contrast of interest, between each stage and the control, was achieved using the contrast matrix shown in Table 1. Four contrasts were obtained, one for each stage vs control. A threshold of |lfc| > 2 was applied to each such contrast to identify differentially expressed genes (with respect to the control). We used the absolute value of the lfc, since driver genes could be either upregulated or downregulated. Genes could be differentially expressed in any combination of the stages or no stage at all. To analyze the pattern of differential expression (with respect to the control), we constructed a four-bit binary string for each gene, where each bit signified whether the gene was differentially expressed in the corresponding stage. For example, the string ‘1100’ indicates that the gene was differentially expressed in the first and second stages. There are 24 = 16 possible outcomes of the four-bit string for a given gene corresponding to the combination of stages in which it is differentially expressed. This is illustrated in set-theoretic terms in Fig. 4. In our first elimination, we removed genes whose |lfc| < 2 for all stages. For each remaining gene, we identified the stage that showed the highest |lfc| and assigned the gene as specific to that stage for the rest of our analysis.
We applied a four-pronged criteria to establish the significance of the stage-specific differentially expressed genes.
Adj. p-value of the contrast with respect to the control < 0.001. The expression profile of a driver gene in cancer samples would markedly depart from that for the controls, which motivates the use of a stringent threshold here.
(ii)-(iv) P-value of the contrast with respect to other stages < 0.05. The use of a more relaxed cutoff would improve the sensitivity of stage-specific detection.
To obtain the above p-values (ii) - (iv), we used the contrast matrix shown in Table 2, which was then used an an argument to the contrastsFit function in limma.
Principal component analysis (PCA) were performed using prcomp in R. To choose 100 random genes, we used the rand function. Gene set enrichment analysis were performed on KEGG (https://www.genome.jp/kegg/) and Gene Ontology  using kegga and goana in limma, respectively. In order to visualize outlier genes that are significant with a large effect size, volcano plots could be obtained by plotting the -log10 transformed p-value vs. the log fold-change of gene expression. Heat maps of significant stage-specific differentially expressed genes were visualized using heatmap and clustered using hclust. Novelty of the identified stage-specific genes was ascertained by screening against the Cancer Gene Census v84 .
The TCGA expression data consisted of expression values of 20,532 genes in 423 samples. After the completion of data pre-processing, we obtained a final dataset of expression data for 18,590 genes across 399 samples annotated with the corresponding sample stage (available in Supplementary File S1). The stagewise distribution of TCGA samples along with the corresponding AJCC staging is shown in Table 3. A statistical summary of demographic details including age, sex, height, weight, and vital status is shown in Table 4. The body mass index (BMI) distribution was derived from patient clinical data that had both height and weight (i.e, neither was ‘NA’). The average age of onset of HCC was around 60 years, and the average BMI was about 26, indicating a possible link with ageing-associated pathology and obesity.
The dataset was processed through voom in limma to prepare for linear modelling . At a p-value cutoff of 0.05, 14,843 genes were significant for the linear model given by eqn. (1). Even raising the bar to 1E-5, 9618 genes remained significant in the linear modelling, thus implying a strong linear trend in their expression across cancer stages relative to control. This was not entirely surprising since one of the hallmarks of cancer phenotype is genome-wide instability . The linear modelling highlighted top ranked genes, some upregulated in HCC (GABRD, PLVAP, CDH13) and some downregulated (CLEC4M, CLEC1B, CLEC4G). The lfc for each stage with respect to control of top ten genes (ranked by adjusted p-value) are shown in Table 5, along with their inferred regulation status. Boxplots of the expression of the top 9 genes (Fig. 5) indicated elevated expression across cancer stages relative to control for up-regulated genes, while depressed expression across cancer stages relative to control was indicative of downregulated genes. (Boxplots of all other genes in the top 200 are provided in the Supplementary Fig. S1) It is worthwhile to note that a given gene might have maximal differential expression in any stage (not necessarily stage 4), and the linear trend does not suggest the order of expression across stages (Fig. 6).
A PCA of the top 100 genes from the linear model was visualized using the top two principal components (Fig. 7a). A clear separation of the controls and the cancer samples could be seen, suggesting the extent of differential expression of these genes in cancer samples. Hence linear modelling yields cancer-specific genes versus normal controls, and the results for the all the genes, including the top 100, are provided in order in Supplementary File S2. For comparison, a PCA plot of 100 randomly sampled genes (Fig. 7b) failed to show any separation of the cancer and control samples.
To ascertain an ordered trend of expression across cancer stages, the linear model given by eqn. (2) was fit. At a p-value of 0.05, 14,127 genes were significant, and raising the bar to 1E-5 still left 8032 genes significant. A goodness of fit with eqn. (2) does not equate with a monotonic trend of expression; i.e., a a gene with a significant linear fit is not required to follow a monotonic trend of mean expression with cancer stage. Using the definition of monotonicity given in the Methods section, we found 2109 genes showing strictly monotonic expression with the cancer stage and reaching maximum absolute mean expression in stage IV. Each such gene was annotated and ranked with the p-value from eqn. (2). This yielded 1977 genes with significant (i.e, p-val < 0.05) monotonic trends of mean expression across cancer stages, with 1602 upregulated and 375 downregulated. The top 20 such genes are presented in Table 6.
The results from the linear modelling were in contrast with those obtained by Huo et al.  and were most likely driven by an improved design and the inclusion of 51 controls in our study. These positive results provided the impetus to pursue stage-driven analysis. Given the conventional AJCC staging, gene expression differences would play a major role in driving the cancer progression. To identify the stage-specific differentially expressed genes, we applied the first contrast matrix (Table 2) and constructed the four-bit stage string of each gene. Based on the stage strings, we binned all the genes, and the string-specific gene lists corresponding to all the partitions in the Venn diagram (Fig. 4) is made available in Supplementary File S3. The size of each such partition is illustrated in Fig. 8. We eliminated the 16,135 genes corresponding to the stage string ‘0000’ (|lfc| < 2 in all stages). To establish the significance of the remaining genes, we applied the second contrast (Table 3) and passed each gene through the four filter criteria. The gradual reduction in candidate stage-specific genes as each criterion was applied, is shown in Table 7. Only genes that passed all criteria were retained as significant stage-specific differentially expressed genes. We obtained 2 stage-I specific, 2 stage-II specific, 10 stage-III specific and 35 stage-IV specific genes (Table 8). Figure 9 shows the volcano plot of these 49 stage-specific genes.
In view of the limited sample size for stage-IV and consequent low power for rejecting false-positives, we stipulated that each stage-IV specific gene would display a smooth increasing or decreasing expression trend through cancer progression culminating in maximum differential expression in stage-IV. On this basis, we pruned the 35 stage-IV specific genes to just the top ten by significance in the linear modelling. This yielded a total of 24 stage-specifc genes of interest.
A heatmap of the lfc expression of these stage-specific genes across the stages was generated (Fig. 10a) and revealed a systematic gradient in expression relative to control, involving both downregulation and overexpression. The map was clustered on the basis of differential expression (i.e, |lfc|) both across stages and across features (i.e, genes) (Fig. 10b). It was seen that stage I genes clustered together, stage II genes co-clustered with NCAPG2 and DLG5 from stage-III, all the other stage-III genes clustered together, while the stage-IV genes formed two separate clusters. It was interesting to note that GABRD emerged as an outgroup to all the clusters, demonstrating its uniqueness.
To identify the biological processes specific to each stage, we used the genes with maximal |lfc| in each stage and performed a stagewise gene set enrichment analysis on two ontologies, the GO and KEGG pathways. Salient results with respect to KEGG pathways are presented below (Table 9) and the complete KEGG and GO results are available in Supplementary Tables S1 and S2, respectively. In stage I, we found the significant enrichment of cell-cycle signaling pathways (Hippo, Wnt, HIF-1), and viral infection-related pathways (cytokine-cytokine receptor interaction, human papillomavirus infection, HTLV-I infection). In stage II, key signalling pathways (Ras, MAPK) were aberrant. Two liver-specific pathways, alcoholism and cytochrome P450 mediated metabolism of xenobiotics were enriched, as well as standard cancer pathways of bladder, brain, stomach, and skin that might involve generic genetic alterations necessary for cancer cell growth. In stage III, we noticed the significant enrichment of Metabolic pathways that summarize cellular metabolism. This might indicate the metabolic shift needed by the cancer to grow and invade neighboring tissues. Other salient significantly enriched pathways pertained to increased cell cycle progression, DNA replication, chemical carcinogenesis, p53 signaling pathway and cellular senescence, all hallmark processes critical to cancer progression. Stage IV gene set was significantly enriched for bile-related processes (bile secretion, primary bile acid biosynthesis), and ABC transporters (possibly conferring a drug-resistant advanced cancer phenotype). A signaling pathway related to diabetic complications was enriched as well, indicating the role of co-morbidities in driving liver cancer progression. The enrichment analysis of the top 100 genes of the linear model is included in the Supplementary Table S3.
When differentially expressed genes are identified in a two-class cancer vs control manner, the information about stage-specificity of differential expression is lost. By applying our protocol, this information is recovered and available for dissection. The top linear model genes and all the stage-specific differentially expressed genes (Table 10) were analyzed with respect to the existing literature.
Top genes of linear models
Three C-type lectin domain proteins (CLEC4M, CLEC1B, CLEC4G) were detected in the top ten genes of linear model given by eqn. (1). Interestingly, this identical cluster of three genes was detected as the most significantly downregulated liver cancer-specific genes in a qPCR study of an independent cohort of 65 tumor-normal matched cases . On screening the top 200 linear model (1) genes against cancer driver genes in the Cancer Gene Census, only four genes were found, namely BUB1B, CDKN2A, EZH2, and RECQL4. The top 200 genes of the linear model given by eqn. (2) overlapped with 111 genes of linear model (1) and yielded six genes from the Cancer Gene Census, namely BUB1B, EZH2, CDKN2C, CANT1, POLD1, and STIL. Both CDKN2A and CDKN2C are cyclin-dependent kinase inhibitors. CDKN2A was a member of the gene signatures for HCC prognosis independently proposed by Gillet et al.  and Yang et al. . It was remarkable that GABRD stood out as the top gene in both the linear models, and with a monotonic order of expression with the cancer stage. GABRD is discussed further in the section on Stage-IV specific genes. A gene with a monotonicity of expression may be increasingly upregulated as the cancer initiates, progresses and metastasizes, signalling its oncogenic progression; or conversely, it may be increasingly downregulated with the cancer stages, signalling the loss of tumor suppressor activity. Screening the top 200 genes with monotonic expression against the Cancer Gene Census yielded a completely different set of six genes: HSP90AB1, ALDH2, ESR1, PPP2R1A, HIST1H4I, SEPT5. HSP90AB1, a heat shock protein and molecular chaperone, was a key result of Xu et al.  where it played a dual role, one in the set of 50 hub genes correlated with Barcelona Clinic Liver Cancer (BCLC) staging of HCC patients, and another, in the set of 13 hub genes correlated with overall survival of HCC patients. HSP90AB1 might have a significant role in the aetiology of HCC, given that its expression is known to be upregulated by hepatitis B virus encoded X protein . The monotonic changes in HSP90AB1 might further facilitate its known roles in angiogenesis . The top 200 genes with monotonic expression had 15 genes in common with the top 200 of linear model (1) and 16 genes in common with the top 200 of linear model (2). However, only six genes were common to the top 200 of all three (namely GABRD, PIGU, NDUFA4L2, CRHBP, FLVCR1, TTC13; Fig. 11). NDUFA4L2 has been identified as a target gene of HIF-1 (hypoxia-inducible transcription factor-1), and a key factor driving the metabolic reprogramming in hypoxic micro-environments . Our findings established that not only was NDUFA4L2 significantly overexpressed in HCC (as noted in ), but its overexpression follows a significant monotonic pattern across cancer stages, a much stronger statement that would support the role of NDUFA4L2 in driving HCC progression. Similarly, the expression of CRHBP has been recently shown to be negatively associated with the tumor size in HCC . Our study provides a more quantitative account of the significant monotonic downregulation of CRHBP with the HCC stage. Two proteins of the glycosylphosphatidylinositol (GPI) anchoring system, PIGU and PIGC, were top genes with respect to significant monotonic expression (Table 6); of these, PIGU is a known bladder cancer oncogene .
Stage-I specific DEGs (Fig. 12)
CA9 is a member of carbonic anhydrases, which are a large family of zinc metalloenzymes that catalyse the reversible hydration of carbon dioxide. Its expression in clear cell Renal carcinoma, but not in functional kidney cells has gained attention for its use as a pre-operative biomarker . The WNT7B protein is part of the Wnt family, a family of secreted signalling proteins. Elevated WNT7B in pancreatic adenocarcinoma has been found to mediate anchorage independent growth . Surprisingly, both CA9 and WNT7B are downregulated in HCC, most so in stage-I, contrary to their role in other cancers. A concrete interpretation of the role of these genes in HCC awaits appropriately designed experimental studies.
It is pertinent to ask the following question here: which genes are essential for the initiation of HCC? Clearly these genes would be differentially expressed in stage I relative to control. All significantly differentially expressed genes with maximal |lfc| in stage-I would be the best candidates for genes involved in the initiation of HCC. These 122 genes are provided in the Supplementary File S3.
Stage-II specific DEGs (Fig. 13)
APOBEC3B, a DNA cytidine deaminase, is a known cancer driver gene in the Cancer Gene Census, but there are no literature reports of its stage-specificity in any cancer. It is known to account for half the mutational load in breast carcinoma, and its target sequence context was found to be highly mutated in Bladder, lung, cervix, neck, and head cancers as well . Further studies have attributed specific hypermutation signatures across all cancers to the APOBEC family, including APOBEC3B . Here APOBEC3B is upregulated, increasing its capacity to inflict the hypermutator phenotype, and highlighting an intriguing stage-specificity in its action. FAM186A polymorphisms have been reported in GWAS and SNP studies on colorectal cancer patients and shown to have a significant odds ratio in risk heritability . FAM163A was a component of the 8-gene signature used for the risk stratification of HCC patients .
Stage-III specific DEGs (Fig. 14)
C12orf48, also known as PARI, participates in the homologous recombination pathway of DNA repair, and its overexpression has been reported in pancreatic cancer. Further PARI was recently identified as a transcriptional target of FOXM1 , which is a well-validated upregulated gene in HCC . DLG5 is a cell polarity gene and its downregulation has been implicated in the malignancy of breast , prostate  and bladder cancers . It has been recently found that lower DLG5 expression is correlated with advanced stages of HCC and essential for invadopodium formation, an event critical to cancer metastasis . It is surprising that our study has identified a stage-III specific upregulation in DLG5. Interestingly, evidence is emerging to lend support to our finding that DLG5 might be tumor-promoting. In a very recent review, Saito et al.  reinterpreted published results on cell polarity and cancer, and advanced an alternative perspective on the role of polarity regulators in cancer biology. They argued that both cellular and subcellular polarity would be regulated by DLG5 and related polarity proteins. Subcellular polarity might improve the cellular fitness for proliferation and stemness, thereby causing tumor promotion. Hence cell polarity regulation is anti-tumorigenic and subcellular polarity regulation is pro-tumorigenic, and our analysis has uncovered the pro-tumorigenic upregulated activity of DLG5. ECT2 encodes a guanine nucleotide exchange factor that remains elevated during the G2 and M phase in cellular mitosis. ECT2 is found to be upregulated in lung adenocarcinoma and lung squamous cell carcinoma , as well as in invasive breast cancer . NCAPG2 is a component of the condensing II complex and involved in chromosome segregation during mitosis. NCAPG2 level were found to be increased in non-small cell lung cancer, and its over-expression was found to be correlated with lymph node metastasis, thus enabling the use of NCAPG2 as a poor prognostic biomarker in lung adenocarcinoma . GNMT is a methyltransferase that catalyses conversion of S-adenosine methionine to s-adenosyl cysteine. In the absence of GNMT, S-adenosine methionine causes hypermethylation of DNA, which represses GNMT levels and is found in HCC samples . This is an epigenetic mechanism for loss of function of tumor suppressors and our study here confirmed the downregulation of GNMT expression. PRR11 is found to be over-expressed in lungs, and its silencing using siRNA resulted in cell cycle arrest and apoptotic cell death, followed by decreased cell growth and viability . A similar knock out experiment of PRR11 in hilar cholangiocarcinoma cell lines resulted in decreased cellular proliferation, migration, and tumor growth . WDHD1 is a key post-transcriptional regulator of centromeric, and consequently genomic, integrity  and its overexpression has been identified as biomarker of acute myeloid leukemia , and lung and esophageal carcinomas . C15orf42 has been implicated in nasopharyngeal carcinoma . ORC6L overexpression has been identified as a prognostic biomarker of colorectal cancer possibly by enhancing chromosomal instability . XRCC2 was found to increase locally advanced rectal cancer radioresistance by repairing DNA double-strand breaks and preventing cancer cell apoptosis . XRCC2 was also highlighted in the gene signature for HCC prognosis advanced by Gillet et al. .
Stage-IV specific DEGs (Fig. 15)
GABRD, which was the top gene in the linear models as well, encodes for the delta subunit of the gamma-amino butyric acid receptor. The GABA receptor family was found to be frequently downregulated in cancers, except for GABRD, which was found to be up-regulated. Gross et al.  proposed that the GABA receptor gene family might play a role in the proliferation independent differentiation of cancer cells. GBX2 is part of the GBX gene family, which are homeobox containing DNA binding transcription factors. GBX2 is overexpressed in prostate cancer and studies show that expression of GBX2 is required for malignant growth of human prostate cancer . PECAM1 overexpression has been linked to peritoneal recurrence of stage II/III gastric cancer patients . CEND1 has been identified as a cell-cycle protein . PGAM2 is a glycolytic enzyme whose upregulation is essential for tumor cell proliferation . NR1I2 downregulation has been used in constructing a prognostic 9-genes expression signature of gastric cancer . GDF5 has been shown to be a downstream target of the TGF-beta signaling pathway , stimulating angiogenesis required for the growth and spread of the cancer. GPR1 has been reported to be involved in promoting cutaneous squamous cell carcinoma migration . Two other stage-IV specific genes, namely the downregulated CXCR2P1, which is a C-X-C motif chemokine receptor 2 pseudogene 1, and LOC25845, are minimally documented in the literature in the context of HCC, other cancers or any other condition. It is worth mentioning however that CXCR2, a member of the GPCR protein family binding the interleukin IL8, has been reported as an effective non-invasive blood based biomarker for HCC . It is notable that ARHGAP42, a Rho GTPase activating protein, was another key result of Xu et al. , finding a place both in their set of 50 hub genes correlated with the BCLC staging of HCC patients, and in the set of 13 hub genes correlated with overall survival of HCC patients.Most of the stage-IV specific genes show contra-regulation (i.e, no clear trend) across cancer stages, and only 15 of the 35 genes revealed a monotonic pattern of expression (highlighted in Table 8). The other 20 genes could be unique to the hallmarks of stage-IV cancer, e.g., processes related to lymph node involvement and/or metastasis.
We have developed an original protocol for the stagewise dissection of the HCC transcriptome. We were able to successfully fit a linear model across cancer stages and detected genes with a strong linear expression trend in the cancer phenotype. These genes were found to effectively separate the control and cancer samples. We were able to assign 2455 differentially expressed genes into one of four stages and visualized their stage specific expression using boxplots. Using a multi-layered approach, we were able to assess the significance of each stage-specific DEG and narrowed down to a handful of candidate significant stage-specific DEG’s. Our analysis yielded two stage-I specific genes (CA9, WNT7B), two stage-II specific genes (APOBEC3B, FAM186A), ten stage-III specific genes (including DLG5, NCAPG2, GNMT and XRCC2) and 35 stage-IV specific genes (including GABRD and CXCR2P1). Though most of these genes constituted novel findings in the context of HCC, a comprehensive literature search indicated connections with other cancer conditions. The analysis of monotonicity of expression has uncovered two genes with documented HCC connection, namely NDUFA4L2 and CRHBP. Correlation of our analysis with gene signatures based on the BCLC staging system revealed two common genes, namely HSP90AB1 and ARHGAP42. Our study might deepen our understanding of the mechanistic basis of HCC progression, and lay the foundation for the development of HCC diagnosis and treatment strategies. Translational research could transform our results into a panel of biomarkers for early clinical decision-making and rational drug development. It is straightforward to extend our computational methodology to the stage-based analysis of other cancers to obtain a fuller view of disease initiation, progression, and metastasis.
Availability of data and materials
All data and material are available as supplementary information (https://doi.org/10.6084/m9.figshare.6455024).
American Joint Committee on Cancer
Barcelona Clinic Liver Cancer
Body mass index
Differentially expressed genes
Expressed sequence tag
Gamma Amino Butyric Acid
Kyoto Encyclopaedia of Genes and Genomes
Liver hepatocellular carcinoma
Principal Components Analysis
quantitative Polymerase Chain Reaction
RNA-Seq by Expectation Maximization
The Cancer Genome Atlas
Tumor, Node, Metastasis
Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Børresen-Dale AL, Boyault S, Burkhardt B, Butler AP, Caldas C, Davies HR, Desmedt C, Eils R, Eyfjörd JE, Foekens JA, Greaves M, Hosoda F, Hutter B, Ilicic T, Imbeaud S, Imielinski M, Jäger N, Jones DT, Jones D, Knappskog S, Kool M, Lakhani SR, López-Otín C, Martin S, Munshi NC, Nakamura H, Northcott PA, Pajic M, Papaemmanuil E, Paradiso A, Pearson JV, Puente XS, Raine K, Ramakrishna M, Richardson AL, Richter J, Rosenstiel P, Schlesner M, Schumacher TN, Span PN, Teague JW, Totoki Y, Tutt AN, Valdés-Mas R, van Buuren MM, van ‘t Veer L, Vincent-Salomon A, Waddell N, Yates LR, Australian Pancreatic Cancer Genome Initiative; ICGC Breast Cancer Consortium; ICGC MMML-Seq Consortium; ICGC PedBrain, Zucman-Rossi J, Futreal PA, McDermott U, Lichter P, Meyerson M, Grimmond SM, Siebert R, Campo E, Shibata T, Pfister SM, Campbell PJ, Stratton MR. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–21. https://doi.org/10.1038/nature12477.
Amin MB, Greene FL, Edge SB, Compton CC, Gershenwald JE, Brookland RK, Meyer L, Gress DM, Byrd DR, Winchester DP. The eighth edition AJCC Cancer staging manual: continuing to build a bridge from a population-based to a more “personalized” approach to cancer staging. CA Cancer J Clin. 2017;67(2):93–9.
An F, Zhang Z, Xia M. Functional analysis of the nasopharyngeal carcinoma primary tumor-associated gene interaction network. Mol Med Rep. 2015;12(4):4975–80. https://doi.org/10.3892/mmr.2015.4090.
Arensman MD, Kovochich AN, Kulikauskas RM, Lay AR, Yang PT, Li X, Donahue T, Major MB, Moon RT, Chien AJ, Dawson DW. WNT7B mediates autocrine WNT/β-catenin signaling and anchorage-independent growth in pancreatic adenocarcinoma. Oncogene. 2014;33(7):899.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25.
Broad Institute TCGA Genome Data Analysis Center. Analysis-ready standardized TCGA data from broad GDAC firehose 2016_01_28 run. Broad institute of MIT and Harvard. Dataset; 2016. https://doi.org/10.7908/C11G0KM9.
Burns MB, Temiz NA, Harris RS. Evidence for APOBEC3B mutagenesis in multiple human cancers. Nat Genet. 2013;45(9):977.
Cancer Research UK, http://www.cancerresearchuk.org/health-professional/cancer-statistics/worldwide-cancer, Accessed 05 May 2018.
Chen Y, Cha Z, Fang W, Qian B, Yu W, Li W, Yu G, Gao Y. The prognostic potential and oncogenic effects of PRR11 expression in hilarcholangiocarcinoma. Oncotarget. 2015;6(24):20419.
Chuang SC, La Vecchia C, Boffetta P. Liver cancer: descriptive epidemiology and risk factors other than HBV and HCV infection. Cancer Lett. 2009;286(1):9–14.
Farazi PA, DePinho RA. Hepatocellular carcinoma pathogenesis: from genes to environment. Nat Rev Cancer. 2006;6(9):674.
Farsam V, et al. Senescent fibroblast-derived Chemerin promotes squamous cell carcinoma migration. Oncotarget. 2016;7(50):83554–69. https://doi.org/10.18632/oncotarget.13446.
Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M, Parkin DM, Forman D, Bray F. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int J Cancer. 2015;136:E359–86.
Futreal PA, Coin L, Marshall M, Down T, Hubbard T, et al. A census of human cancer genes. Nat Rev Cancer. 2004;4:177–83.
Gao AC, Lou W, Isaacs JT. Down-regulation of homeobox gene GBX2 expression inhibits human prostate cancer clonogenic ability and tumorigenicity. Cancer Res. 1998;58(7):1391–4.
Gillet JP, Andersen JB, Madigan JP, Varma S, Bagni RK, Powell K, et al. A gene expression signature associated with overall survival in patients with hepatocellular carcinoma suggests a new treatment strategy. Mol Pharmacol. 2016;89(2):263–72.
Gross AM, Kreisberg JF, Ideker T. Analysis of matched tumor and normal profiles reveals common transcriptional and epigenetic signals shared across cancer types. PLoS One. 2015;10(11):e0142618.
Guo Z, Linn JF, Wu G, Anzick SL, Eisenberger CF, Halachmi S, Cohen Y, Fomenkov A, Hoque MO, Okami K, Steiner G, Engles JM, Osada M, Moon C, Ratovitski E, Trent JM, Meltzer PS, Westra WH, Kiemeney LA, Schoenberg MP, Sidransky D, Trink B. CDC91L1 (PIG-U) is a newly discovered oncogene in human bladder cancer. Nat Med. 2004;10(4):374–81.
Haase M, Fitze G. HSP90AB1: helping the good and the bad. Gene. 2016;575:171–86.
Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell. 2011;144(5):646–74.
Ho DWH, Kai AKL, Ng IOL. TCGA whole-transcriptome sequencing data reveals significantly dysregulated genes and signaling pathways in hepatocellular carcinoma. Front Med. 2015;9(3):322–30.
Hochberg Y, Benjamini Y. More powerful procedures for multiple significance testing. Stat Med. 1990;9:811–8.
Hsieh CL, et al. WDHD1 modulates the post-transcriptional step of the centromeric silencing pathway. Nucleic Acids Res. 2011;39(10):4048–62. https://doi.org/10.1093/nar/gkq1338.
Huidobro C, Toraño EG, Fernández AF, Urdinguio RG, Rodríguez RM, Ferrero C, Martínez-Camblor P, Boix L, Bruix J, García-Rodríguez JL, Varela-Rey M. A DNA methylation signature associated with the epigenetic repression of glycine N-methyltransferase in human hepatocellular carcinoma. J Mol Med. 2013;91(8):939–50.
Huo T, Canepa R, Sura A, Modave F, Gong Y. Colorectal cancer stages transcriptome analysis. PLoS One. 2017;12(11):e0188697.
Ke Y, et al. Discs large homolog 5 decreases formation and function of invadopodia in human hepatocellular carcinoma via Girdin and Tks5. Int J Cancer. 2017;141(2):364–76. https://doi.org/10.1002/ijc.30730.
Lai RK, Xu IM, Chiu DK, Tse AP, Wei LL, Law CT, Lee D, Wong CM, Wong MP, Ng IO, Wong CC. NDUFA4L2 fine-tunes oxidative stress in hepatocellular carcinoma. Clin Cancer Res. 2016;22(12):3105–17. https://doi.org/10.1158/1078-0432.CCR-15-1987.
Law CW, Chen Y, Shi W, Smyth GK. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15(2):R29.
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12(1):323.
Li G, Feng G, Zhao A, Péoc’h M, Cottier M, Mottet N. CA9 as a biomarker in preoperative biopsy of small solid renal masses for diagnosis of clear cell renal cell carcinoma. Biomarkers. 2017;22(2):123–6.
Li WH, Miao XH, Qi ZT, Ni W, Zhu SY, Fang F. Proteomic analysis of differently expressed proteins in human hepatocellular carcinoma cell lines HepG2 with transfecting hepatitis B virus X gene. Chin Med J. 2009;5:15–23.
Liu J, et al. Loss of DLG5 promotes breast cancer malignancy by inhibiting the hippo signaling pathway. Sci Rep. 2017;7:42125. https://doi.org/10.1038/srep42125.
Margheri F, et al. GDF5 regulates TGFß-dependent angiogenesis in breast carcinoma MCF-7 cells: in vitro and in vivo control by anti-TGFß peptides. PLoS One. 2012;7(11):e50342. https://doi.org/10.1371/journal.pone.0050342.
McCarthy DJ, Smyth GK. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics. 2009;25:765–71.
O'Connor KW, et al. PARI overexpression promotes genomic instability and pancreatic tumorigenesis. Cancer Res. 2013;73(8):2529–39. https://doi.org/10.1158/0008-5472.CAN-12-3313.
Palaniappan A, Ramar K, Ramalingam S. Computational identification of novel stage-specific biomarkers in colorectal cancer progression. PLoS One. 2016;11(5):e0156665.
Peng L, Bian XW, Li DK, Xu C, Wang GM, Xia QY, Xiong Q. Large-scale RNA-Seq transcriptome analysis of 4043 cancers and 548 Normal tissue controls across 12 TCGA Cancer types. Sci Rep. 2015;5:13413. https://doi.org/10.1038/srep13413.
Perz JF, Armstrong GL, Farrington LA, Hutin YJ, Bell BP. The contributions of hepatitis B virus and hepatitis C virus infections to cirrhosis and primary liver cancer worldwide. J Hepatol. 2006;45(4):529–38.
Qiao GJ, Chen L, Wu JC, Li ZR. Identification of an eight-gene signature for survival prediction for patients with hepatocellular carcinoma based on integrated bioinformatics analysis. PeerJ. 2019;7:e6548.
Qin CJ, Song XM, Chen ZH, Ren XQ, Xu KW, Jing H, He YL. XRCC2 as a predictive biomarker for radioresistance in locally advanced rectal cancer patients undergoing preoperative radiotherapy. Oncotarget. 2015;6(31):32193.
Research Network TCGA. Comprehensive and integrative genomic characterization of hepatocellular carcinoma. Cell. 2017;169(7):1327–41.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. Limma powers differential expression analyses forRNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. https://doi.org/10.1093/nar/gkv007.
Saito Y, Desai RR, Muthuswamy SK. Reinterpreting polarity and cancer: the changing landscape from tumor suppression to tumor promotion. Biochim Biophys Acta. 2018;1869(2):103–16. https://doi.org/10.1016/j.bbcan.2017.12.001.
Sato N, et al. Activation of WD repeat and high-mobility group box DNA binding protein 1 in pulmonary and esophageal carcinogenesis. Clin Cancer Res. 2010;16(1):226–39. https://doi.org/10.1158/1078-0432.CCR-09-1405.
Shi M, Chen MS, Sekar K, Tan CK, Ooi LL, Hui KM. A blood-based three-gene signature for the non-invasive detection of early human hepatocellular carcinoma. Eur J Cancer. 2014;50(5):928–36.
Tello D, Balsa E, Acosta-Iborra B, Fuertes-Yebra E, Elorza A, Ordóñez Á, Corral-Escariz M, Soro I, López-Bernardo E, Perales-Clemente E, Martínez-Ruiz A, Enríquez JA, Aragonés J, Cadenas S, Landázuri MO. Induction of the mitochondrial NDUFA4L2 protein by HIF-1α decreases oxygen consumption by inhibiting complex I activity. Cell Metab. 2011;14(6):768–79. https://doi.org/10.1016/j.cmet.2011.10.008.
Terashima M, et al. TOP2A, GGH, and PECAM1 are associated with hematogenous, lymph node, and peritoneal recurrence in stage II/III gastric cancer patients enrolled in the ACTS-GC study. Oncotarget. 2017;8(34):57574–82. https://doi.org/10.18632/oncotarget.15895.
Timofeeva MN, et al. Recurrent coding sequence variation explains only a small fraction of the genetic architecture of. Colorectal Cancer Sci Rep. 2015;5:16286. https://doi.org/10.1038/srep16286.
Tomiyama L, Sezaki T, Matsuo M, Ueda K, Kioka N. Loss of Dlg5 expression promotes the migration and invasion of prostate cancer cells via Girdin phosphorylation. Oncogene. 2015;34(9):1141–9. https://doi.org/10.1038/onc.2014.31.
Tsioras K, Papastefanaki F, Politis PK, Matsas R, Gaitanou M. Functional interactions between BM88/Cend1, ran-binding protein M and Dyrk1B kinase affect cyclin D1 levels and cell cycle progression/exit in mouse neuroblastoma cells. PLoS One. 2013;8(11):e82172. https://doi.org/10.1371/journal.pone.0082172.
Wang Z, Chen G, Wang Q, Lu W, Xu M. Identification and validation of a prognostic 9-genes expression signature for gastric cancer. Oncotarget. 2017;8(43):73826–36. https://doi.org/10.18632/oncotarget.17764.
Wang HK, Liang JF, Zheng HX, Xiao H. Expression and prognostic significance of ECT2 in invasive breast cancer. J Clin Pathol. 2018;71(5):442–5. https://doi.org/10.1136/jclinpath-2017-204569.
Wermke M, et al. RNAi profiling of primary human AML cells identifies ROCK1 as a therapeutic target and nominates fasudil as an antileukemic drug. Blood. 2015;125(24):3760–8. https://doi.org/10.1182/blood-2014-07-590646.
Xi Y, Formentini A, Nakajima G, Kornmann M, Ju J. Validation of biomarkers associated with 5-fluorouracil and thymidylate synthase in colorectal cancer. Oncol Rep. 2008;19(1), 257–62.
Xia HB, Wang HJ, Fu LQ, Wang SB, Li L, Ru GQ, He XL, Tong XM, Mou XZ, Huang DS. Decreased CRHBP expression is predictive of poor prognosis in patients with hepatocellular carcinoma. Oncol Lett. 2018;16(3):3681–9. https://doi.org/10.3892/ol.2018.9073.
Xu W, Rao Q, An Y, Li M, Zhang Z. Identification of biomarkers for Barcelona clinic liver Cancer staging and overall survival of patients with hepatocellular carcinoma. PLoS One. 2018;13(8):e0202763.
Xu Y, et al. Oxidative stress activates SIRT2 to deacetylate and stimulate phosphoglycerate mutase. Cancer Res. 2014;74(13):3630–42. https://doi.org/10.1158/0008-5472.CAN-13-3615.
Xu XR, Huang J, Xu ZG, Qian BZ, Zhu ZD, Yan Q, Cai T, Zhang X, Xiao HS, Qu J, Liu F. Insight into hepatocellular carcinogenesis at transcriptome level by comparing gene expression profiles of hepatocellular carcinoma with those of corresponding noncancerous liver. Proc Natl Acad Sci. 2001;98(26):15089–94.
Yang H, Zhang X, Cai XY, Wen DY, Ye ZH, Liang L, Zhang L, Wang HL, Chen G, Feng ZB. From big data to diagnosis and prognosis: gene expression signatures in liver hepatocellular carcinoma. PeerJ. 2017;5:e3089.
Yang JD, Roberts LR. Hepatocellular carcinoma: a global view. Nat Rev Gastroenterol Hepatol. 2010;7:448–58. https://doi.org/10.1038/nrgastro.2010.100.
Zhan P, Xi GM, Zhang B, Wu Y, Liu HB, Liu YF, Xu WJ, Zhu Q, Cai F, Zhou ZJ, Miu YY. NCAPG2 promotes tumour proliferation by regulating G2/M phase and associates with poor prognosis in lung adenocarcinoma. J Cell Mol Med. 2017;21(4):665–76.
Zhang Y, et al. PARI functions as a new transcriptional target of FOXM1 involved in gastric cancer development. Int J Biol Sci. 2018;14(5):531–41. https://doi.org/10.7150/ijbs.23945.
Zhang Y, Qiu Z, Wei L, Tang R, Lian B, Zhao Y, He X, Xie L. Integrated analysis of mutation data from various sources identifies key genes and signaling pathways in hepatocellular carcinoma. PLoS One. 2014;9(7):e100854.
Zhao Q. RNAi-mediated silencing of praline-rich gene causes growth reduction in human lung cancer cells. Int J Clin Exp Pathol. 2015;8(2):1760.
Zhou Z, et al. Methylation-mediated silencing of Dlg5 facilitates bladder cancer metastasis. Exp Cell Res. 2015;331(2):399–407. https://doi.org/10.1016/j.yexcr.2014.11.015.
Zhou S, Wang P, Su X, Chen J, Chen H, Yang H, Fang A, Xie L, Yao Y, Yang J. High ECT2 expression is an independent prognostic factor for poor overall survival and recurrence-free survival in non-small cell lung adenocarcinoma. PLoS One. 2017;12(10):e0187356.
We would like to thank the peer reviewers for improving an earlier version of the manuscript. We would like to thank the departments of Bioinformatics and Bioengineering, SASTRA deemed University, for infrastructure and computing support. A.P. acknowledges support from DST-SERB grant no. EMR/2017/000470, Government of India.
This work was supported in part by DST-SERB grant no. EMR/2017/000470, Government of India.
Ethics approval and consent to participate
All data used in the study is available in the public domain, constituting anonymized patient data provided by the TCGA consortium.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Sarathi, A., Palaniappan, A. Novel significant stage-specific differentially expressed genes in hepatocellular carcinoma. BMC Cancer 19, 663 (2019). https://doi.org/10.1186/s12885-019-5838-3