Skip to main content

Novel significant stage-specific differentially expressed genes in hepatocellular carcinoma



Liver cancer is among top deadly cancers worldwide with a very poor prognosis, and the liver is a vulnerable site for metastases of other cancers. Early diagnosis is crucial for treatment of the predominant liver cancers, namely hepatocellular carcinoma (HCC). Here we developed a novel computational framework for the stage-specific analysis of HCC.


Using publicly available clinical and RNA-Seq data of cancer samples and controls and the AJCC staging system, we performed a linear modelling analysis of gene expression across all stages and found significant genome-wide changes in the log fold-change of gene expression in cancer samples relative to control. To identify genes that were stage-specific controlling for confounding differential expression in other stages, we developed a set of six pairwise contrasts between the stages and enforced a p-value threshold (< 0.05) for each such contrast. Genes were specific for a stage if they passed all the significance filters for that stage. The monotonicity of gene expression with cancer progression was analyzed with a linear model using the cancer stage as a numeric variable.


Our analysis yielded two stage-I specific genes (CA9, WNT7B), two stage-II specific genes (APOBEC3B, FAM186A), ten stage-III specific genes including DLG5, PARI, NCAPG2, GNMT and XRCC2, and 35 stage-IV specific genes including GABRD, PGAM2, PECAM1 and CXCR2P1. Overexpression of DLG5 was found to be tumor-promoting contrary to the cancer literature on this gene. Further, GABRD was found to be signifincantly monotonically upregulated across stages. Our work has revealed 1977 genes with significant monotonic patterns of expression across cancer stages. NDUFA4L2, CRHBP and PIGU were top genes with monotonic changes of expression across cancer stages that could represent promising targets for therapy. Comparison with gene signatures from the BCLC staging system identified two genes, HSP90AB1 and ARHGAP42. Gene set enrichment analysis indicated overrepresented pathways specific to each stage, notably viral infection pathways in HCC initiation.


Our study identified novel significant stage-specific differentially expressed genes which could enhance our understanding of the molecular determinants of hepatocellular carcinoma progression. Our findings could serve as biomarkers that potentially underpin diagnosis as well as pinpoint therapeutic targets.

Peer Review reports


Liver cancer is the second most deadly cancer in terms of mortality rate, with a very poor prognosis [60]. It accounted for 9.1% of all cancer deaths, and 83% of the annual new estimated 782,000 liver cancer cases worldwide occur in developing countries [13]. Liver cancer showed the greatest increase in mortality in the last decade for both males (53%) and females (59%) [8]. Liver hepatocellular carcinoma (LIHC) or simply hepatocellular carcinoma (HCC) is the most common type of liver cancer, accounting for nearly 85% of liver cancers. 78% of all reported cases of HCC were due to viral infections (53% Hepatitis B virus and 25% Hepatitis C virus) [38]. There are several non-viral causes of HCC as well, mainly aflatoxins and alcohol [10]. As shown in Fig. 1, all the factors converge to a common mechanism of genetic alterations that lead to the acquisition of cancer hallmarks [20] and the eventual emergence of a cancer cell [11]. Genetic alterations constitute the heart of the problem, and studying changes due to these genetic alterations is paramount to understand HCC. Earlier gene expression studies using EST data detected differential expression in cancer tissue compared to non-cancerous liver and proposed the existence of genetic aberrations and changes in transcriptional regulation in HCC [58]. The Cancer Genome Atlas (TCGA) research network [41] have subtyped and identified many potential targets for HCC based on a comprehensive multi-omics analysis. An independent analysis of TCGA RNA-Seq data encompassing 12 cancer tissues has uncovered liver cancer-specific genes [37]. Zhang et al. [63] have performed mutation analysis of HCC, and Yang et al. [59] combined TCGA expression data and natural language processing techniques to identify cancer-specific markers.

Fig. 1

Major causative pathways of hepatocarcinogenesis. All pathways converge to progressive genomic alterations, leading a normal cell to acquire the hallmarks of cancer

The burden of disease and mortality rate are both inversely correlated with the cancer stage. The response rate to therapy is also inversely correlated with stage. To the best of our knowledge, there are no reported research in the literature that have dissected the stage-specific features of HCC. The cancer staging system is based on gross features of cancer anatomical penetration, and one such standard is the American Joint Committee on Cancer (AJCC) Tumor-Node-Metastasis (TNM) staging [2]. It is reasonable to hypothesize that the stage-specific gross changes are associated with signature molecular events, and try to probe such molecular bases of stage-wise progression of cancer. We had earlier published on stage-specific “hub driver” genes in colorectal cancer [36]. A stage-focussed analysis of colorectal cancer transcriptome data yielded negative results vis-a-vis the AJCC staging system [25].


Data preprocessing

Normalized and log2-transformed Illumina HiSeq RNA-Seq gene expression data processed by the RSEM pipeline [29] were obtained from TCGA via the portal [6]. The patient barcode (uuid) of each sample encoded in the variable called ‘Hybridization REF’ was parsed and used to annotate the controls and cancer samples (Fig. 2). To annotate the stage information of the cancer samples, we obtained the clinical information dataset for HCC from (LIHC.Merge_Clinical.Level_1.2016012800.0.0.tar.gz) and merged the clinical data with the expression data by matching the “Hybridization REF” in the expression data with the aliquot barcode identifier in the clinical data. The stage information of each patient was encoded in the clinical variable “pathologic stage”. The pathologic stage is essentially the surgical stage, prior to any treatment received, determined with the tissue obtained at the time of surgery. This interpretation is reinforced in the TCGA HCC sample inclusion criteria as follows: “Surgical resection of biopsy biospecimens were collected from patients diagnosed with hepatocellular carcinoma (HCC), and had not received prior treatment for their disease (ablation, chemotherapy, or radiotherapy)” (The TCGA [41]). The availability of this unequivocal information enables the analysis of cancer stages. The substages (A,B,C) were collapsed into the parent stage, resulting in four stages of interest (I, II, III, IV). We retained a handful of other clinical variables pertaining to demographic features, namely age, sex, height, weight, and vital status. With this merged dataset, we filtered out genes that showed little change in expression across all samples (defined as σ < 1). Finally, we removed cancer samples from our analysis that were missing stage annotation (value ‘NA’ in the “pathologic stage”). The data pre-processing was done using R (

Fig. 2

TCGA ‘Hybridization REF’ Barcode. The first 10 characters constitute an anonymized unique patient identifier and the following two characters denote whether the sample is tumor or normal

Linear modelling

Linear modelling of expression across cancer stages relative to the baseline expression (i.e, in normal tissue controls) was performed for each gene using the R limma package [42]. The following linear model was fit for each gene’s expression based on the design matrix shown in Fig. 3a:

$$ \mathrm{y}=\upalpha +{\upbeta}_1{\mathrm{x}}_1+{\upbeta}_2{\mathrm{x}}_2+{\upbeta}_3{\mathrm{x}}_3+{\upbeta}_4{\mathrm{x}}_4 $$
Fig. 3

Design matrices. a In the linear modeling, the control samples served as the baseline expression (intercept) of each gene against which the stage-specific expression was estimated. b the design matrix for the contrasts analysis

where the independent variables are indicator variables of the sample’s stage, the intercept α is the baseline expression estimated from the controls, and βi are the estimated stagewise log fold-change (lfc) coefficients relative to controls. The linear model was subjected to empirical Bayes adjustment to obtain moderated t-statistics [34]. To account for multiple hypothesis testing and the false discovery rate, the p-values of the F-statistic of the linear fit were adjusted using the method od Hochberg and Benjamini [22]. The linear trend across cancer stages for the top significant genes were visualized using boxplots to ascertain the regulation status of the gene relative to the control.

Monotonic mean expression

The linear model in eqn. (1) would not be sufficient to identify genes with an ordered monotonic trend of expression across cancer stages. Addressing this question would also help assess whether monotonic changes of gene expression were observed with disease progression. Towards this end, we designed a model of gene expression where the cancer stage was treated as a numeric variable:

$$ \mathrm{y}=\mathrm{aX}+\mathrm{b} $$

where X takes a value in [0,1,2,3,4] corresponding to the sample stage: [control, I, II, III, IV], respectively. It was noted the mean expression of a gene could show the following monotonic patterns across cancer stages:

  1. (i).

    monotonic upregulation, where mean expression follows: control < I < II < III < IV.

  2. (ii).

    monotonic downregulation, where mean expression follows: control > I > II > III > IV.

The sets of genes conforming to either (i) or (ii) were identified to yield monotonically upregulated and monotonically downregulated genes. These two sets were merged, and the final set of genes with monotonic changes of expression with cancer progression was obtained. This final set was ranked by the adj. p-values from the model estimated by eqn. (2).

Pairwise contrasts

To perform contrasts, a slightly modified design matrix shown in Fig. 3b was used, which would give rise to the following linear model of expression for each gene:

$$ \mathrm{y}={\upbeta}_0{\mathrm{x}}_0+{\upbeta}_1{\mathrm{x}}_1+{\upbeta}_2{\mathrm{x}}_2+{\upbeta}_3{\mathrm{x}}_3+{\upbeta}_4{\mathrm{x}}_4 $$

where the controls themselves are one of the indicator variables, and the βi are all coefficients estimated only from the corresponding samples. Our first contrast of interest, between each stage and the control, was achieved using the contrast matrix shown in Table 1. Four contrasts were obtained, one for each stage vs control. A threshold of |lfc| > 2 was applied to each such contrast to identify differentially expressed genes (with respect to the control). We used the absolute value of the lfc, since driver genes could be either upregulated or downregulated. Genes could be differentially expressed in any combination of the stages or no stage at all. To analyze the pattern of differential expression (with respect to the control), we constructed a four-bit binary string for each gene, where each bit signified whether the gene was differentially expressed in the corresponding stage. For example, the string ‘1100’ indicates that the gene was differentially expressed in the first and second stages. There are 24 = 16 possible outcomes of the four-bit string for a given gene corresponding to the combination of stages in which it is differentially expressed. This is illustrated in set-theoretic terms in Fig. 4. In our first elimination, we removed genes whose |lfc| < 2 for all stages. For each remaining gene, we identified the stage that showed the highest |lfc| and assigned the gene as specific to that stage for the rest of our analysis.

Table 1 Contrast matrix with control. Each stage (indicated by '1') is contrasted against the control (indicated by '-1') in turn
Fig. 4

A Venn representation of the pairwise stages contrasts. A gene could be differentially expressed in any combination of the four stages and this could be represented by a 4-bit string, one bit for each stage. For e.g., ‘1111’ at the overlap of all four stages would be assigned to genes that are differentially expressed in all four stages

Significance analysis

We applied a four-pronged criteria to establish the significance of the stage-specific differentially expressed genes.

  1. (i).

    Adj. p-value of the contrast with respect to the control < 0.001. The expression profile of a driver gene in cancer samples would markedly depart from that for the controls, which motivates the use of a stringent threshold here.

  2. (ii).

    (ii)-(iv) P-value of the contrast with respect to other stages < 0.05. The use of a more relaxed cutoff would improve the sensitivity of stage-specific detection.

To obtain the above p-values (ii) - (iv), we used the contrast matrix shown in Table 2, which was then used an an argument to the contrastsFit function in limma.

Table 2 Contrast matrix for inter-stage contrasts. There are six possible pairwise contrasts between the stages that are essential to identifying stage-specific genes

Further analyses

Principal component analysis (PCA) were performed using prcomp in R. To choose 100 random genes, we used the rand function. Gene set enrichment analysis were performed on KEGG ( and Gene Ontology [5] using kegga and goana in limma, respectively. In order to visualize outlier genes that are significant with a large effect size, volcano plots could be obtained by plotting the -log10 transformed p-value vs. the log fold-change of gene expression. Heat maps of significant stage-specific differentially expressed genes were visualized using heatmap and clustered using hclust. Novelty of the identified stage-specific genes was ascertained by screening against the Cancer Gene Census v84 [14].


The TCGA expression data consisted of expression values of 20,532 genes in 423 samples. After the completion of data pre-processing, we obtained a final dataset of expression data for 18,590 genes across 399 samples annotated with the corresponding sample stage (available in Supplementary File S1). The stagewise distribution of TCGA samples along with the corresponding AJCC staging is shown in Table 3. A statistical summary of demographic details including age, sex, height, weight, and vital status is shown in Table 4. The body mass index (BMI) distribution was derived from patient clinical data that had both height and weight (i.e, neither was ‘NA’). The average age of onset of HCC was around 60 years, and the average BMI was about 26, indicating a possible link with ageing-associated pathology and obesity.

Table 3 AJCC Cancer staging. The correspondence between the AJCC staging and the TCGA staging for LIHC is noted, along with the number of LIHC cases in each stage in the TCGA dataset. Control indicates the number of normal tissue control samples, and NA denotes cases where the stage information is unavailable
Table 4 Summary of key demographic features of the dataset. For continuous variables (age,height, weight and BMI), the mean ± standard deviation is given. BMI is calculated only for patients with both height and weight data

The dataset was processed through voom in limma to prepare for linear modelling [28]. At a p-value cutoff of 0.05, 14,843 genes were significant for the linear model given by eqn. (1). Even raising the bar to 1E-5, 9618 genes remained significant in the linear modelling, thus implying a strong linear trend in their expression across cancer stages relative to control. This was not entirely surprising since one of the hallmarks of cancer phenotype is genome-wide instability [20]. The linear modelling highlighted top ranked genes, some upregulated in HCC (GABRD, PLVAP, CDH13) and some downregulated (CLEC4M, CLEC1B, CLEC4G). The lfc for each stage with respect to control of top ten genes (ranked by adjusted p-value) are shown in Table 5, along with their inferred regulation status. Boxplots of the expression of the top 9 genes (Fig. 5) indicated elevated expression across cancer stages relative to control for up-regulated genes, while depressed expression across cancer stages relative to control was indicative of downregulated genes. (Boxplots of all other genes in the top 200 are provided in the Supplementary Fig. S1) It is worthwhile to note that a given gene might have maximal differential expression in any stage (not necessarily stage 4), and the linear trend does not suggest the order of expression across stages (Fig. 6).

Table 5 Top 10 genes of the linear model. The log-fold change expression of the gene in each stage relative to the controls are given, followed by p-value adjusted for the false discovery rate, and the regulation status of the gene in the cancer stages with respect to the control
Fig. 5

Boxplots of top 9 linear model genes. For each gene, notice that the trend in expression could be either overexpression or downregulation relative to the control. For e.g., GABRD, PLVAP, CXorf36, CDH13 and UBE2T are overexpressed, while CLEC4M, CLEC1B, BMP10, and CLEC4G are downregulated. It could be seen that a linear trend does not imply maximal |lfc| in stage 4, as illustrated most clearly in the case of UBE2T

Fig. 6

Boxplots illustrating stage-specificity of differentially expressed genes. Extremal expression in a stage could be either maximal expression or minimal expression relative to the control and all other stages, and could be termed maximal differential expression. Here we show genes with maximal differential expression in stage-I (WDR72; minimum expression), stage-II (GLI4, maximum expression; COLEC11, minimum expression), stage-III (CKAP2; maximum expression), and stage-IV (MAPK11; maximum expression)

A PCA of the top 100 genes from the linear model was visualized using the top two principal components (Fig. 7a). A clear separation of the controls and the cancer samples could be seen, suggesting the extent of differential expression of these genes in cancer samples. Hence linear modelling yields cancer-specific genes versus normal controls, and the results for the all the genes, including the top 100, are provided in order in Supplementary File S2. For comparison, a PCA plot of 100 randomly sampled genes (Fig. 7b) failed to show any separation of the cancer and control samples.

Fig. 7

Principal components analysis of cancer vs control. a The first two principal components of the top 100 genes from linear modeling are plotted. It could be seen that control samples (red) clustered independent of the cancer samples (colored by stage). b The same analysis repeated with 100 random genes failed to effect a clustering of the control samples relative to the cancer samples

To ascertain an ordered trend of expression across cancer stages, the linear model given by eqn. (2) was fit. At a p-value of 0.05, 14,127 genes were significant, and raising the bar to 1E-5 still left 8032 genes significant. A goodness of fit with eqn. (2) does not equate with a monotonic trend of expression; i.e., a a gene with a significant linear fit is not required to follow a monotonic trend of mean expression with cancer stage. Using the definition of monotonicity given in the Methods section, we found 2109 genes showing strictly monotonic expression with the cancer stage and reaching maximum absolute mean expression in stage IV. Each such gene was annotated and ranked with the p-value from eqn. (2). This yielded 1977 genes with significant (i.e, p-val < 0.05) monotonic trends of mean expression across cancer stages, with 1602 upregulated and 375 downregulated. The top 20 such genes are presented in Table 6.

Table 6 Top 20 genes with significant monotonic patterns of expression. Intercept, Coefficient and Adj. p-value are from the linear model given by eqn.(2). Status indicates monotonic upregulation (UP) or monotonic downregulation (DOWN). The genes are sorted by significance (adj.p-value)

The results from the linear modelling were in contrast with those obtained by Huo et al. [25] and were most likely driven by an improved design and the inclusion of 51 controls in our study. These positive results provided the impetus to pursue stage-driven analysis. Given the conventional AJCC staging, gene expression differences would play a major role in driving the cancer progression. To identify the stage-specific differentially expressed genes, we applied the first contrast matrix (Table 2) and constructed the four-bit stage string of each gene. Based on the stage strings, we binned all the genes, and the string-specific gene lists corresponding to all the partitions in the Venn diagram (Fig. 4) is made available in Supplementary File S3. The size of each such partition is illustrated in Fig. 8. We eliminated the 16,135 genes corresponding to the stage string ‘0000’ (|lfc| < 2 in all stages). To establish the significance of the remaining genes, we applied the second contrast (Table 3) and passed each gene through the four filter criteria. The gradual reduction in candidate stage-specific genes as each criterion was applied, is shown in Table 7. Only genes that passed all criteria were retained as significant stage-specific differentially expressed genes. We obtained 2 stage-I specific, 2 stage-II specific, 10 stage-III specific and 35 stage-IV specific genes (Table 8). Figure 9 shows the volcano plot of these 49 stage-specific genes.

Fig. 8

Venn illustration of the size of each 4-bit string. The numbers of genes with each pattern of differential expression are shown

Table 7 Number of genes in each step of the significance analysis. Differential expression is defined with respect to a threshold |logFC| = 2. Significance analysis proceeds first by significance (i.e, p-value) with respect to control, followed by p-value in each possible pairwise contrast between the different stages. Exclusive DE genes refer to genes differentially expressed in only one of the four stages (corresponding to the bit strings '1000', '0100', '0010' and '0001')
Table 8 Final set of highlighted genes in each stage. The genes in each stage are ordered by increasing adjusted p-values of the linear modelling analysis. Stage-IV specific genes with monotonic changes of expression correlating with disease progression are highlighted
Fig. 9

Volcano plot of the 49 significant stage-specific differentially expressed genes. Stage 1 genes, red; Stage 2, blue; Stage 3, green; and Stage 4, orange. The genes are seen to orient away from the origin and the axes, indicating significance and effect size

In view of the limited sample size for stage-IV and consequent low power for rejecting false-positives, we stipulated that each stage-IV specific gene would display a smooth increasing or decreasing expression trend through cancer progression culminating in maximum differential expression in stage-IV. On this basis, we pruned the 35 stage-IV specific genes to just the top ten by significance in the linear modelling. This yielded a total of 24 stage-specifc genes of interest.

A heatmap of the lfc expression of these stage-specific genes across the stages was generated (Fig. 10a) and revealed a systematic gradient in expression relative to control, involving both downregulation and overexpression. The map was clustered on the basis of differential expression (i.e, |lfc|) both across stages and across features (i.e, genes) (Fig. 10b). It was seen that stage I genes clustered together, stage II genes co-clustered with NCAPG2 and DLG5 from stage-III, all the other stage-III genes clustered together, while the stage-IV genes formed two separate clusters. It was interesting to note that GABRD emerged as an outgroup to all the clusters, demonstrating its uniqueness.

Fig. 10

Heatmap plots of the final 24 stage-specific genes. a heatmap generated from the lfc values of all the stage-specific genes (arranged stagewise). The color gradient spans the spectrum from downregulation (blue) to overexpression (red). Log fold changes upto sixfold are seen, indicating 64 times differential expression with respect to control. b Representation of the stagewise gene expression based on clustering of differential expression profiles

To identify the biological processes specific to each stage, we used the genes with maximal |lfc| in each stage and performed a stagewise gene set enrichment analysis on two ontologies, the GO and KEGG pathways. Salient results with respect to KEGG pathways are presented below (Table 9) and the complete KEGG and GO results are available in Supplementary Tables S1 and S2, respectively. In stage I, we found the significant enrichment of cell-cycle signaling pathways (Hippo, Wnt, HIF-1), and viral infection-related pathways (cytokine-cytokine receptor interaction, human papillomavirus infection, HTLV-I infection). In stage II, key signalling pathways (Ras, MAPK) were aberrant. Two liver-specific pathways, alcoholism and cytochrome P450 mediated metabolism of xenobiotics were enriched, as well as standard cancer pathways of bladder, brain, stomach, and skin that might involve generic genetic alterations necessary for cancer cell growth. In stage III, we noticed the significant enrichment of Metabolic pathways that summarize cellular metabolism. This might indicate the metabolic shift needed by the cancer to grow and invade neighboring tissues. Other salient significantly enriched pathways pertained to increased cell cycle progression, DNA replication, chemical carcinogenesis, p53 signaling pathway and cellular senescence, all hallmark processes critical to cancer progression. Stage IV gene set was significantly enriched for bile-related processes (bile secretion, primary bile acid biosynthesis), and ABC transporters (possibly conferring a drug-resistant advanced cancer phenotype). A signaling pathway related to diabetic complications was enriched as well, indicating the role of co-morbidities in driving liver cancer progression. The enrichment analysis of the top 100 genes of the linear model is included in the Supplementary Table S3.

Table 9 Gene set enrichment analysis. Stage-specific gene sets (all the differentially expressed genes, corresponding to row 'DE genes' in Table 6) were analyzed for significant enrichment with respect to KEGG Pathways. Significance was based on p-value <0.05


When differentially expressed genes are identified in a two-class cancer vs control manner, the information about stage-specificity of differential expression is lost. By applying our protocol, this information is recovered and available for dissection. The top linear model genes and all the stage-specific differentially expressed genes (Table 10) were analyzed with respect to the existing literature.

Table 10 Stagewise effect sizes and significance of stage specific genes. The stagewise log foldchanges of differential expression of each candidate stage-specific gene in tumor samples relative to normal control samples are shown, along with significance values, and its inferred regulation status. In stage-IV, only the top 10 genes are shown. The stage-specificity of the genes are emphasized

Top genes of linear models

Three C-type lectin domain proteins (CLEC4M, CLEC1B, CLEC4G) were detected in the top ten genes of linear model given by eqn. (1). Interestingly, this identical cluster of three genes was detected as the most significantly downregulated liver cancer-specific genes in a qPCR study of an independent cohort of 65 tumor-normal matched cases [21]. On screening the top 200 linear model (1) genes against cancer driver genes in the Cancer Gene Census, only four genes were found, namely BUB1B, CDKN2A, EZH2, and RECQL4. The top 200 genes of the linear model given by eqn. (2) overlapped with 111 genes of linear model (1) and yielded six genes from the Cancer Gene Census, namely BUB1B, EZH2, CDKN2C, CANT1, POLD1, and STIL. Both CDKN2A and CDKN2C are cyclin-dependent kinase inhibitors. CDKN2A was a member of the gene signatures for HCC prognosis independently proposed by Gillet et al. [16] and Yang et al. [59]. It was remarkable that GABRD stood out as the top gene in both the linear models, and with a monotonic order of expression with the cancer stage. GABRD is discussed further in the section on Stage-IV specific genes. A gene with a monotonicity of expression may be increasingly upregulated as the cancer initiates, progresses and metastasizes, signalling its oncogenic progression; or conversely, it may be increasingly downregulated with the cancer stages, signalling the loss of tumor suppressor activity. Screening the top 200 genes with monotonic expression against the Cancer Gene Census yielded a completely different set of six genes: HSP90AB1, ALDH2, ESR1, PPP2R1A, HIST1H4I, SEPT5. HSP90AB1, a heat shock protein and molecular chaperone, was a key result of Xu et al. [56] where it played a dual role, one in the set of 50 hub genes correlated with Barcelona Clinic Liver Cancer (BCLC) staging of HCC patients, and another, in the set of 13 hub genes correlated with overall survival of HCC patients. HSP90AB1 might have a significant role in the aetiology of HCC, given that its expression is known to be upregulated by hepatitis B virus encoded X protein [31]. The monotonic changes in HSP90AB1 might further facilitate its known roles in angiogenesis [19]. The top 200 genes with monotonic expression had 15 genes in common with the top 200 of linear model (1) and 16 genes in common with the top 200 of linear model (2). However, only six genes were common to the top 200 of all three (namely GABRD, PIGU, NDUFA4L2, CRHBP, FLVCR1, TTC13; Fig. 11). NDUFA4L2 has been identified as a target gene of HIF-1 (hypoxia-inducible transcription factor-1), and a key factor driving the metabolic reprogramming in hypoxic micro-environments [46]. Our findings established that not only was NDUFA4L2 significantly overexpressed in HCC (as noted in [27]), but its overexpression follows a significant monotonic pattern across cancer stages, a much stronger statement that would support the role of NDUFA4L2 in driving HCC progression. Similarly, the expression of CRHBP has been recently shown to be negatively associated with the tumor size in HCC [55]. Our study provides a more quantitative account of the significant monotonic downregulation of CRHBP with the HCC stage. Two proteins of the glycosylphosphatidylinositol (GPI) anchoring system, PIGU and PIGC, were top genes with respect to significant monotonic expression (Table 6); of these, PIGU is a known bladder cancer oncogene [18].

Fig. 11

Boxplot of top genes with monotonic expression. These six genes (GABRD, PIGU, NDUFA4L2, CRHBP, FLVCR1, TTC13) showed monotonic trends of expression across the cancer stages, and were topranked in both the linear models given by eqns. (1) and (2)

Stage-I specific DEGs (Fig. 12)

CA9 is a member of carbonic anhydrases, which are a large family of zinc metalloenzymes that catalyse the reversible hydration of carbon dioxide. Its expression in clear cell Renal carcinoma, but not in functional kidney cells has gained attention for its use as a pre-operative biomarker [30]. The WNT7B protein is part of the Wnt family, a family of secreted signalling proteins. Elevated WNT7B in pancreatic adenocarcinoma has been found to mediate anchorage independent growth [4]. Surprisingly, both CA9 and WNT7B are downregulated in HCC, most so in stage-I, contrary to their role in other cancers. A concrete interpretation of the role of these genes in HCC awaits appropriately designed experimental studies.

Fig. 12

Boxplot of stage-I specific genes. It is seen that CA9 and WNT7B are both maximally downregulated in stage-I

It is pertinent to ask the following question here: which genes are essential for the initiation of HCC? Clearly these genes would be differentially expressed in stage I relative to control. All significantly differentially expressed genes with maximal |lfc| in stage-I would be the best candidates for genes involved in the initiation of HCC. These 122 genes are provided in the Supplementary File S3.

Stage-II specific DEGs (Fig. 13)

APOBEC3B, a DNA cytidine deaminase, is a known cancer driver gene in the Cancer Gene Census, but there are no literature reports of its stage-specificity in any cancer. It is known to account for half the mutational load in breast carcinoma, and its target sequence context was found to be highly mutated in Bladder, lung, cervix, neck, and head cancers as well [7]. Further studies have attributed specific hypermutation signatures across all cancers to the APOBEC family, including APOBEC3B [1]. Here APOBEC3B is upregulated, increasing its capacity to inflict the hypermutator phenotype, and highlighting an intriguing stage-specificity in its action. FAM186A polymorphisms have been reported in GWAS and SNP studies on colorectal cancer patients and shown to have a significant odds ratio in risk heritability [48]. FAM163A was a component of the 8-gene signature used for the risk stratification of HCC patients [39].

Fig. 13

Boxplot of stage-II specific genes. It is seen that both APOBEC3B and FAM186A are maximally overexpressed in stage-II, the trend following an inverted U-shape

Stage-III specific DEGs (Fig. 14)

C12orf48, also known as PARI, participates in the homologous recombination pathway of DNA repair, and its overexpression has been reported in pancreatic cancer[35]. Further PARI was recently identified as a transcriptional target of FOXM1 [62], which is a well-validated upregulated gene in HCC [21]. DLG5 is a cell polarity gene and its downregulation has been implicated in the malignancy of breast [32], prostate [49] and bladder cancers [65]. It has been recently found that lower DLG5 expression is correlated with advanced stages of HCC and essential for invadopodium formation, an event critical to cancer metastasis [26]. It is surprising that our study has identified a stage-III specific upregulation in DLG5. Interestingly, evidence is emerging to lend support to our finding that DLG5 might be tumor-promoting. In a very recent review, Saito et al. [43] reinterpreted published results on cell polarity and cancer, and advanced an alternative perspective on the role of polarity regulators in cancer biology. They argued that both cellular and subcellular polarity would be regulated by DLG5 and related polarity proteins. Subcellular polarity might improve the cellular fitness for proliferation and stemness, thereby causing tumor promotion. Hence cell polarity regulation is anti-tumorigenic and subcellular polarity regulation is pro-tumorigenic, and our analysis has uncovered the pro-tumorigenic upregulated activity of DLG5. ECT2 encodes a guanine nucleotide exchange factor that remains elevated during the G2 and M phase in cellular mitosis. ECT2 is found to be upregulated in lung adenocarcinoma and lung squamous cell carcinoma [66], as well as in invasive breast cancer [52]. NCAPG2 is a component of the condensing II complex and involved in chromosome segregation during mitosis. NCAPG2 level were found to be increased in non-small cell lung cancer, and its over-expression was found to be correlated with lymph node metastasis, thus enabling the use of NCAPG2 as a poor prognostic biomarker in lung adenocarcinoma [61]. GNMT is a methyltransferase that catalyses conversion of S-adenosine methionine to s-adenosyl cysteine. In the absence of GNMT, S-adenosine methionine causes hypermethylation of DNA, which represses GNMT levels and is found in HCC samples [24]. This is an epigenetic mechanism for loss of function of tumor suppressors and our study here confirmed the downregulation of GNMT expression. PRR11 is found to be over-expressed in lungs, and its silencing using siRNA resulted in cell cycle arrest and apoptotic cell death, followed by decreased cell growth and viability [64]. A similar knock out experiment of PRR11 in hilar cholangiocarcinoma cell lines resulted in decreased cellular proliferation, migration, and tumor growth [9]. WDHD1 is a key post-transcriptional regulator of centromeric, and consequently genomic, integrity [23] and its overexpression has been identified as biomarker of acute myeloid leukemia [53], and lung and esophageal carcinomas [44]. C15orf42 has been implicated in nasopharyngeal carcinoma [3]. ORC6L overexpression has been identified as a prognostic biomarker of colorectal cancer possibly by enhancing chromosomal instability [54]. XRCC2 was found to increase locally advanced rectal cancer radioresistance by repairing DNA double-strand breaks and preventing cancer cell apoptosis [40]. XRCC2 was also highlighted in the gene signature for HCC prognosis advanced by Gillet et al. [16].

Fig. 14

Boxplot of stage-III specific genes. Except for GNMT, the expression of stage-III specific genes show a peak in stage-III, with the expression trend following an inverted U-shape across the stages. The expression trend is convex and reversed for the downregulated GNMT, with minimum expression in stage-III

Stage-IV specific DEGs (Fig. 15)

GABRD, which was the top gene in the linear models as well, encodes for the delta subunit of the gamma-amino butyric acid receptor. The GABA receptor family was found to be frequently downregulated in cancers, except for GABRD, which was found to be up-regulated. Gross et al. [17] proposed that the GABA receptor gene family might play a role in the proliferation independent differentiation of cancer cells. GBX2 is part of the GBX gene family, which are homeobox containing DNA binding transcription factors. GBX2 is overexpressed in prostate cancer and studies show that expression of GBX2 is required for malignant growth of human prostate cancer [15]. PECAM1 overexpression has been linked to peritoneal recurrence of stage II/III gastric cancer patients [47]. CEND1 has been identified as a cell-cycle protein [50]. PGAM2 is a glycolytic enzyme whose upregulation is essential for tumor cell proliferation [57]. NR1I2 downregulation has been used in constructing a prognostic 9-genes expression signature of gastric cancer [51]. GDF5 has been shown to be a downstream target of the TGF-beta signaling pathway [33], stimulating angiogenesis required for the growth and spread of the cancer. GPR1 has been reported to be involved in promoting cutaneous squamous cell carcinoma migration [12]. Two other stage-IV specific genes, namely the downregulated CXCR2P1, which is a C-X-C motif chemokine receptor 2 pseudogene 1, and LOC25845, are minimally documented in the literature in the context of HCC, other cancers or any other condition. It is worth mentioning however that CXCR2, a member of the GPCR protein family binding the interleukin IL8, has been reported as an effective non-invasive blood based biomarker for HCC [45]. It is notable that ARHGAP42, a Rho GTPase activating protein, was another key result of Xu et al. [56], finding a place both in their set of 50 hub genes correlated with the BCLC staging of HCC patients, and in the set of 13 hub genes correlated with overall survival of HCC patients.Most of the stage-IV specific genes show contra-regulation (i.e, no clear trend) across cancer stages, and only 15 of the 35 genes revealed a monotonic pattern of expression (highlighted in Table 8). The other 20 genes could be unique to the hallmarks of stage-IV cancer, e.g., processes related to lymph node involvement and/or metastasis.

Fig. 15

Boxplot of top 10 stage-IV specific genes. All genes, except NR1I2 and CXCR2P1, show a smooth increasing expression trend reaching peak expression in stage-IV. In the case of NR1I2 and CXC2RP1, the trend is reversed, with the expression decreasing smoothly to touch the minimum in stage-IV


We have developed an original protocol for the stagewise dissection of the HCC transcriptome. We were able to successfully fit a linear model across cancer stages and detected genes with a strong linear expression trend in the cancer phenotype. These genes were found to effectively separate the control and cancer samples. We were able to assign 2455 differentially expressed genes into one of four stages and visualized their stage specific expression using boxplots. Using a multi-layered approach, we were able to assess the significance of each stage-specific DEG and narrowed down to a handful of candidate significant stage-specific DEG’s. Our analysis yielded two stage-I specific genes (CA9, WNT7B), two stage-II specific genes (APOBEC3B, FAM186A), ten stage-III specific genes (including DLG5, NCAPG2, GNMT and XRCC2) and 35 stage-IV specific genes (including GABRD and CXCR2P1). Though most of these genes constituted novel findings in the context of HCC, a comprehensive literature search indicated connections with other cancer conditions. The analysis of monotonicity of expression has uncovered two genes with documented HCC connection, namely NDUFA4L2 and CRHBP. Correlation of our analysis with gene signatures based on the BCLC staging system revealed two common genes, namely HSP90AB1 and ARHGAP42. Our study might deepen our understanding of the mechanistic basis of HCC progression, and lay the foundation for the development of HCC diagnosis and treatment strategies. Translational research could transform our results into a panel of biomarkers for early clinical decision-making and rational drug development. It is straightforward to extend our computational methodology to the stage-based analysis of other cancers to obtain a fuller view of disease initiation, progression, and metastasis.

Availability of data and materials

All data and material are available as supplementary information (



American Joint Committee on Cancer


Barcelona Clinic Liver Cancer


Body mass index


Differentially expressed genes


Expressed sequence tag


Gamma Amino Butyric Acid


Gene Ontology


Hepatocellular carcinoma


Kyoto Encyclopaedia of Genes and Genomes


log fold-change


Liver hepatocellular carcinoma


Principal Components Analysis


quantitative Polymerase Chain Reaction


RNA-Seq by Expectation Maximization


The Cancer Genome Atlas


Tumor, Node, Metastasis


  1. 1.

    Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Børresen-Dale AL, Boyault S, Burkhardt B, Butler AP, Caldas C, Davies HR, Desmedt C, Eils R, Eyfjörd JE, Foekens JA, Greaves M, Hosoda F, Hutter B, Ilicic T, Imbeaud S, Imielinski M, Jäger N, Jones DT, Jones D, Knappskog S, Kool M, Lakhani SR, López-Otín C, Martin S, Munshi NC, Nakamura H, Northcott PA, Pajic M, Papaemmanuil E, Paradiso A, Pearson JV, Puente XS, Raine K, Ramakrishna M, Richardson AL, Richter J, Rosenstiel P, Schlesner M, Schumacher TN, Span PN, Teague JW, Totoki Y, Tutt AN, Valdés-Mas R, van Buuren MM, van ‘t Veer L, Vincent-Salomon A, Waddell N, Yates LR, Australian Pancreatic Cancer Genome Initiative; ICGC Breast Cancer Consortium; ICGC MMML-Seq Consortium; ICGC PedBrain, Zucman-Rossi J, Futreal PA, McDermott U, Lichter P, Meyerson M, Grimmond SM, Siebert R, Campo E, Shibata T, Pfister SM, Campbell PJ, Stratton MR. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–21.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Amin MB, Greene FL, Edge SB, Compton CC, Gershenwald JE, Brookland RK, Meyer L, Gress DM, Byrd DR, Winchester DP. The eighth edition AJCC Cancer staging manual: continuing to build a bridge from a population-based to a more “personalized” approach to cancer staging. CA Cancer J Clin. 2017;67(2):93–9.

    Article  Google Scholar 

  3. 3.

    An F, Zhang Z, Xia M. Functional analysis of the nasopharyngeal carcinoma primary tumor-associated gene interaction network. Mol Med Rep. 2015;12(4):4975–80.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Arensman MD, Kovochich AN, Kulikauskas RM, Lay AR, Yang PT, Li X, Donahue T, Major MB, Moon RT, Chien AJ, Dawson DW. WNT7B mediates autocrine WNT/β-catenin signaling and anchorage-independent growth in pancreatic adenocarcinoma. Oncogene. 2014;33(7):899.

    CAS  Article  Google Scholar 

  5. 5.

    Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25.

    CAS  Article  Google Scholar 

  6. 6.

    Broad Institute TCGA Genome Data Analysis Center. Analysis-ready standardized TCGA data from broad GDAC firehose 2016_01_28 run. Broad institute of MIT and Harvard. Dataset; 2016.

    Google Scholar 

  7. 7.

    Burns MB, Temiz NA, Harris RS. Evidence for APOBEC3B mutagenesis in multiple human cancers. Nat Genet. 2013;45(9):977.

    CAS  Article  Google Scholar 

  8. 8.

    Cancer Research UK,, Accessed 05 May 2018.

  9. 9.

    Chen Y, Cha Z, Fang W, Qian B, Yu W, Li W, Yu G, Gao Y. The prognostic potential and oncogenic effects of PRR11 expression in hilarcholangiocarcinoma. Oncotarget. 2015;6(24):20419.

    PubMed  PubMed Central  Google Scholar 

  10. 10.

    Chuang SC, La Vecchia C, Boffetta P. Liver cancer: descriptive epidemiology and risk factors other than HBV and HCV infection. Cancer Lett. 2009;286(1):9–14.

    CAS  Article  Google Scholar 

  11. 11.

    Farazi PA, DePinho RA. Hepatocellular carcinoma pathogenesis: from genes to environment. Nat Rev Cancer. 2006;6(9):674.

    CAS  Article  Google Scholar 

  12. 12.

    Farsam V, et al. Senescent fibroblast-derived Chemerin promotes squamous cell carcinoma migration. Oncotarget. 2016;7(50):83554–69.

    Article  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M, Parkin DM, Forman D, Bray F. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int J Cancer. 2015;136:E359–86.

    CAS  Article  Google Scholar 

  14. 14.

    Futreal PA, Coin L, Marshall M, Down T, Hubbard T, et al. A census of human cancer genes. Nat Rev Cancer. 2004;4:177–83.

    CAS  Article  Google Scholar 

  15. 15.

    Gao AC, Lou W, Isaacs JT. Down-regulation of homeobox gene GBX2 expression inhibits human prostate cancer clonogenic ability and tumorigenicity. Cancer Res. 1998;58(7):1391–4.

    CAS  PubMed  Google Scholar 

  16. 16.

    Gillet JP, Andersen JB, Madigan JP, Varma S, Bagni RK, Powell K, et al. A gene expression signature associated with overall survival in patients with hepatocellular carcinoma suggests a new treatment strategy. Mol Pharmacol. 2016;89(2):263–72.

    CAS  Article  Google Scholar 

  17. 17.

    Gross AM, Kreisberg JF, Ideker T. Analysis of matched tumor and normal profiles reveals common transcriptional and epigenetic signals shared across cancer types. PLoS One. 2015;10(11):e0142618.

    Article  Google Scholar 

  18. 18.

    Guo Z, Linn JF, Wu G, Anzick SL, Eisenberger CF, Halachmi S, Cohen Y, Fomenkov A, Hoque MO, Okami K, Steiner G, Engles JM, Osada M, Moon C, Ratovitski E, Trent JM, Meltzer PS, Westra WH, Kiemeney LA, Schoenberg MP, Sidransky D, Trink B. CDC91L1 (PIG-U) is a newly discovered oncogene in human bladder cancer. Nat Med. 2004;10(4):374–81.

    CAS  Article  Google Scholar 

  19. 19.

    Haase M, Fitze G. HSP90AB1: helping the good and the bad. Gene. 2016;575:171–86.

    CAS  Article  Google Scholar 

  20. 20.

    Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell. 2011;144(5):646–74.

    CAS  Article  Google Scholar 

  21. 21.

    Ho DWH, Kai AKL, Ng IOL. TCGA whole-transcriptome sequencing data reveals significantly dysregulated genes and signaling pathways in hepatocellular carcinoma. Front Med. 2015;9(3):322–30.

    Article  Google Scholar 

  22. 22.

    Hochberg Y, Benjamini Y. More powerful procedures for multiple significance testing. Stat Med. 1990;9:811–8.

    CAS  Article  Google Scholar 

  23. 23.

    Hsieh CL, et al. WDHD1 modulates the post-transcriptional step of the centromeric silencing pathway. Nucleic Acids Res. 2011;39(10):4048–62.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Huidobro C, Toraño EG, Fernández AF, Urdinguio RG, Rodríguez RM, Ferrero C, Martínez-Camblor P, Boix L, Bruix J, García-Rodríguez JL, Varela-Rey M. A DNA methylation signature associated with the epigenetic repression of glycine N-methyltransferase in human hepatocellular carcinoma. J Mol Med. 2013;91(8):939–50.

    CAS  Article  Google Scholar 

  25. 25.

    Huo T, Canepa R, Sura A, Modave F, Gong Y. Colorectal cancer stages transcriptome analysis. PLoS One. 2017;12(11):e0188697.

    Article  Google Scholar 

  26. 26.

    Ke Y, et al. Discs large homolog 5 decreases formation and function of invadopodia in human hepatocellular carcinoma via Girdin and Tks5. Int J Cancer. 2017;141(2):364–76.

    CAS  Article  PubMed  Google Scholar 

  27. 27.

    Lai RK, Xu IM, Chiu DK, Tse AP, Wei LL, Law CT, Lee D, Wong CM, Wong MP, Ng IO, Wong CC. NDUFA4L2 fine-tunes oxidative stress in hepatocellular carcinoma. Clin Cancer Res. 2016;22(12):3105–17.

    CAS  Article  PubMed  Google Scholar 

  28. 28.

    Law CW, Chen Y, Shi W, Smyth GK. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15(2):R29.

    Article  Google Scholar 

  29. 29.

    Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12(1):323.

    CAS  Article  Google Scholar 

  30. 30.

    Li G, Feng G, Zhao A, Péoc’h M, Cottier M, Mottet N. CA9 as a biomarker in preoperative biopsy of small solid renal masses for diagnosis of clear cell renal cell carcinoma. Biomarkers. 2017;22(2):123–6.

    CAS  Article  Google Scholar 

  31. 31.

    Li WH, Miao XH, Qi ZT, Ni W, Zhu SY, Fang F. Proteomic analysis of differently expressed proteins in human hepatocellular carcinoma cell lines HepG2 with transfecting hepatitis B virus X gene. Chin Med J. 2009;5:15–23.

    CAS  Google Scholar 

  32. 32.

    Liu J, et al. Loss of DLG5 promotes breast cancer malignancy by inhibiting the hippo signaling pathway. Sci Rep. 2017;7:42125.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Margheri F, et al. GDF5 regulates TGFß-dependent angiogenesis in breast carcinoma MCF-7 cells: in vitro and in vivo control by anti-TGFß peptides. PLoS One. 2012;7(11):e50342.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  34. 34.

    McCarthy DJ, Smyth GK. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics. 2009;25:765–71.

    CAS  Article  Google Scholar 

  35. 35.

    O'Connor KW, et al. PARI overexpression promotes genomic instability and pancreatic tumorigenesis. Cancer Res. 2013;73(8):2529–39.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  36. 36.

    Palaniappan A, Ramar K, Ramalingam S. Computational identification of novel stage-specific biomarkers in colorectal cancer progression. PLoS One. 2016;11(5):e0156665.

    Article  Google Scholar 

  37. 37.

    Peng L, Bian XW, Li DK, Xu C, Wang GM, Xia QY, Xiong Q. Large-scale RNA-Seq transcriptome analysis of 4043 cancers and 548 Normal tissue controls across 12 TCGA Cancer types. Sci Rep. 2015;5:13413.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  38. 38.

    Perz JF, Armstrong GL, Farrington LA, Hutin YJ, Bell BP. The contributions of hepatitis B virus and hepatitis C virus infections to cirrhosis and primary liver cancer worldwide. J Hepatol. 2006;45(4):529–38.

    Article  Google Scholar 

  39. 39.

    Qiao GJ, Chen L, Wu JC, Li ZR. Identification of an eight-gene signature for survival prediction for patients with hepatocellular carcinoma based on integrated bioinformatics analysis. PeerJ. 2019;7:e6548.

    Article  Google Scholar 

  40. 40.

    Qin CJ, Song XM, Chen ZH, Ren XQ, Xu KW, Jing H, He YL. XRCC2 as a predictive biomarker for radioresistance in locally advanced rectal cancer patients undergoing preoperative radiotherapy. Oncotarget. 2015;6(31):32193.

    Article  Google Scholar 

  41. 41.

    Research Network TCGA. Comprehensive and integrative genomic characterization of hepatocellular carcinoma. Cell. 2017;169(7):1327–41.

    Article  Google Scholar 

  42. 42.

    Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. Limma powers differential expression analyses forRNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Saito Y, Desai RR, Muthuswamy SK. Reinterpreting polarity and cancer: the changing landscape from tumor suppression to tumor promotion. Biochim Biophys Acta. 2018;1869(2):103–16.

    CAS  Article  Google Scholar 

  44. 44.

    Sato N, et al. Activation of WD repeat and high-mobility group box DNA binding protein 1 in pulmonary and esophageal carcinogenesis. Clin Cancer Res. 2010;16(1):226–39.

    CAS  Article  PubMed  Google Scholar 

  45. 45.

    Shi M, Chen MS, Sekar K, Tan CK, Ooi LL, Hui KM. A blood-based three-gene signature for the non-invasive detection of early human hepatocellular carcinoma. Eur J Cancer. 2014;50(5):928–36.

    CAS  Article  Google Scholar 

  46. 46.

    Tello D, Balsa E, Acosta-Iborra B, Fuertes-Yebra E, Elorza A, Ordóñez Á, Corral-Escariz M, Soro I, López-Bernardo E, Perales-Clemente E, Martínez-Ruiz A, Enríquez JA, Aragonés J, Cadenas S, Landázuri MO. Induction of the mitochondrial NDUFA4L2 protein by HIF-1α decreases oxygen consumption by inhibiting complex I activity. Cell Metab. 2011;14(6):768–79.

    CAS  Article  PubMed  Google Scholar 

  47. 47.

    Terashima M, et al. TOP2A, GGH, and PECAM1 are associated with hematogenous, lymph node, and peritoneal recurrence in stage II/III gastric cancer patients enrolled in the ACTS-GC study. Oncotarget. 2017;8(34):57574–82.

    Article  PubMed  PubMed Central  Google Scholar 

  48. 48.

    Timofeeva MN, et al. Recurrent coding sequence variation explains only a small fraction of the genetic architecture of. Colorectal Cancer Sci Rep. 2015;5:16286.

    CAS  Article  PubMed  Google Scholar 

  49. 49.

    Tomiyama L, Sezaki T, Matsuo M, Ueda K, Kioka N. Loss of Dlg5 expression promotes the migration and invasion of prostate cancer cells via Girdin phosphorylation. Oncogene. 2015;34(9):1141–9.

    CAS  Article  PubMed  Google Scholar 

  50. 50.

    Tsioras K, Papastefanaki F, Politis PK, Matsas R, Gaitanou M. Functional interactions between BM88/Cend1, ran-binding protein M and Dyrk1B kinase affect cyclin D1 levels and cell cycle progression/exit in mouse neuroblastoma cells. PLoS One. 2013;8(11):e82172.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  51. 51.

    Wang Z, Chen G, Wang Q, Lu W, Xu M. Identification and validation of a prognostic 9-genes expression signature for gastric cancer. Oncotarget. 2017;8(43):73826–36.

  52. 52.

    Wang HK, Liang JF, Zheng HX, Xiao H. Expression and prognostic significance of ECT2 in invasive breast cancer. J Clin Pathol. 2018;71(5):442–5.

    CAS  Article  PubMed  Google Scholar 

  53. 53.

    Wermke M, et al. RNAi profiling of primary human AML cells identifies ROCK1 as a therapeutic target and nominates fasudil as an antileukemic drug. Blood. 2015;125(24):3760–8.

    CAS  Article  PubMed  Google Scholar 

  54. 54.

    Xi Y, Formentini A, Nakajima G, Kornmann M, Ju J. Validation of biomarkers associated with 5-fluorouracil and thymidylate synthase in colorectal cancer. Oncol Rep. 2008;19(1), 257–62.

  55. 55.

    Xia HB, Wang HJ, Fu LQ, Wang SB, Li L, Ru GQ, He XL, Tong XM, Mou XZ, Huang DS. Decreased CRHBP expression is predictive of poor prognosis in patients with hepatocellular carcinoma. Oncol Lett. 2018;16(3):3681–9.

    Article  PubMed  PubMed Central  Google Scholar 

  56. 56.

    Xu W, Rao Q, An Y, Li M, Zhang Z. Identification of biomarkers for Barcelona clinic liver Cancer staging and overall survival of patients with hepatocellular carcinoma. PLoS One. 2018;13(8):e0202763.

    Article  Google Scholar 

  57. 57.

    Xu Y, et al. Oxidative stress activates SIRT2 to deacetylate and stimulate phosphoglycerate mutase. Cancer Res. 2014;74(13):3630–42.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  58. 58.

    Xu XR, Huang J, Xu ZG, Qian BZ, Zhu ZD, Yan Q, Cai T, Zhang X, Xiao HS, Qu J, Liu F. Insight into hepatocellular carcinogenesis at transcriptome level by comparing gene expression profiles of hepatocellular carcinoma with those of corresponding noncancerous liver. Proc Natl Acad Sci. 2001;98(26):15089–94.

    CAS  Article  Google Scholar 

  59. 59.

    Yang H, Zhang X, Cai XY, Wen DY, Ye ZH, Liang L, Zhang L, Wang HL, Chen G, Feng ZB. From big data to diagnosis and prognosis: gene expression signatures in liver hepatocellular carcinoma. PeerJ. 2017;5:e3089.

    Article  Google Scholar 

  60. 60.

    Yang JD, Roberts LR. Hepatocellular carcinoma: a global view. Nat Rev Gastroenterol Hepatol. 2010;7:448–58.

    Article  PubMed  PubMed Central  Google Scholar 

  61. 61.

    Zhan P, Xi GM, Zhang B, Wu Y, Liu HB, Liu YF, Xu WJ, Zhu Q, Cai F, Zhou ZJ, Miu YY. NCAPG2 promotes tumour proliferation by regulating G2/M phase and associates with poor prognosis in lung adenocarcinoma. J Cell Mol Med. 2017;21(4):665–76.

    CAS  Article  Google Scholar 

  62. 62.

    Zhang Y, et al. PARI functions as a new transcriptional target of FOXM1 involved in gastric cancer development. Int J Biol Sci. 2018;14(5):531–41.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  63. 63.

    Zhang Y, Qiu Z, Wei L, Tang R, Lian B, Zhao Y, He X, Xie L. Integrated analysis of mutation data from various sources identifies key genes and signaling pathways in hepatocellular carcinoma. PLoS One. 2014;9(7):e100854.

    Article  Google Scholar 

  64. 64.

    Zhao Q. RNAi-mediated silencing of praline-rich gene causes growth reduction in human lung cancer cells. Int J Clin Exp Pathol. 2015;8(2):1760.

    CAS  PubMed  PubMed Central  Google Scholar 

  65. 65.

    Zhou Z, et al. Methylation-mediated silencing of Dlg5 facilitates bladder cancer metastasis. Exp Cell Res. 2015;331(2):399–407.

    CAS  Article  PubMed  Google Scholar 

  66. 66.

    Zhou S, Wang P, Su X, Chen J, Chen H, Yang H, Fang A, Xie L, Yao Y, Yang J. High ECT2 expression is an independent prognostic factor for poor overall survival and recurrence-free survival in non-small cell lung adenocarcinoma. PLoS One. 2017;12(10):e0187356.

    Article  Google Scholar 

Download references


We would like to thank the peer reviewers for improving an earlier version of the manuscript. We would like to thank the departments of Bioinformatics and Bioengineering, SASTRA deemed University, for infrastructure and computing support. A.P. acknowledges support from DST-SERB grant no. EMR/2017/000470, Government of India.


This work was supported in part by DST-SERB grant no. EMR/2017/000470, Government of India.

Author information




AP conceived and designed the study. AS and AP performed the experiments and analyzed the data. AP and AsS wrote the manuscript. All authors approved the final manuscript.

Corresponding author

Correspondence to Ashok Palaniappan.

Ethics declarations

Ethics approval and consent to participate

All data used in the study is available in the public domain, constituting anonymized patient data provided by the TCGA consortium.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sarathi, A., Palaniappan, A. Novel significant stage-specific differentially expressed genes in hepatocellular carcinoma. BMC Cancer 19, 663 (2019).

Download citation


  • LIHC transcriptomics
  • HCC stages
  • Stage-specific biomarkers
  • Differentially expressed genes
  • Pairwise contrasts
  • Significance analysis
  • Linear modelling
  • Tumorigenesis
  • Cancer progression
  • Metastasis
  • Monotonic expression