Novel Significant Stage-Specific Differentially Expressed Genes in Liver Hepatocellular Carcinoma

Liver cancer is among the top deadly cancers worldwide with a very poor prognosis, and the liver is a particularly vulnerable site for metastasis of other cancers. In this study, we developed a novel computational framework for the stage-specific analysis of hepatocellular carcinoma initiation and progression. Using publicly available clinical and RNA-Seq data of cancer samples and controls, we annotated the gene expression matrix with sample stages. We performed a linear modelling analysis of gene expression across all stages and found significant genome-wide changes in gene expression in cancer samples relative to control. Using a contrast against the control, we were able to identify differentially expressed genes (log fold change >2) that were significant at an adjusted p-value < 10E-3. In order to identify genes that were specific to each stage without confounding differential expression in other stages, we developed a full set of pairwise stage contrasts and enforced a p-value threshold (<0.05) for each such contrast. Genes were specific for a stage if they passed all the significance filters for that stage. Our analysis yielded two stage-I specific genes (CA9, WNT7B), two stage-II specific genes (APOBEC3B, FAM186A), ten stage-III specific genes including DLG5, PARI and GNMT, and ten stage-IV specific genes including GABRD, PGAM2 and PECAM1. Of these, only APOBEC3B is an established cancer driver gene. DLG5 was found to be tumor-promoting contrary to the cancer literature on this gene. Further, GABRD, well studied in literature on other cancers, emerged as a stage-IV specific gene. Our findings could be validated using multiple sources of omics data as well as experimentally. The biomarkers identified herein could potentially underpin diagnosis as well as pinpoint drug targets.


INTRODUCTION
Liver cancer is the second most deadly cancer in terms of mortality rate, with a very poor prognosis (Yang et al 2010). It accounted for 9.1% of all cancer deaths, and 83% of the annual new estimated 782,000 liver cancer cases worldwide occur in developing countries (Ferlay et al., 2015). Liver cancer showed the greatest increase in mortality in the last decade for both males (53%) and females (59%) (Cancer Research UK, 2018). Liver hepatocellular carcinoma (LIHC; HCC) is the most common type of liver cancer. 78% of all reported cases of LIHC were due to viral infections (53% Hepatitis B virus and 25% Hepatitis C virus) (Perz et al., 2006). There are several non-viral causes of LIHC as well, mainly aflatoxins and alcohol (Chuang et al., 2009). As shown in Fig. 1, all the factors converge to a common mechanism of genetic alterations that lead to the acquisition of cancer hallmarks (Hanahan and Weinberg, 2011) and the eventual emergence of a cancer cell (Farazi et al., 2006). Genetic alterations constitute the heart of the problem, and studying changes due to these genetic alterations is paramount to understand LIHC. Early gene expression studies using EST data detected differential expression in cancer tissue compared to non-cancerous liver and proposed the existence of genetic aberrations and changes in transcriptional regulation in LIHC (Xu et al., 2001). The Cancer Genome Atlas (TCGA) research network (2017) have subtyped and identified many potential targets for LIHC based on a comprehensive multi-omics analysis. An independent analysis of TCGA RNA-Seq data encompassing 12 cancer tissues has uncovered liver cancer-specific genes (Peng et al., 2015). Zhang et al. (2015) have performed mutation analysis of LIHC, and Yang et al. (2017) combined TCGA expression data and natural language processing techniques to identify cancer-specific markers.
The burden of disease and mortality rate are both inversely correlated with the cancer stage. The response rate to therapy is also inversely correlated with stage. To the best of our knowledge, there are no reported research in the literature that have dissected the stage-specific features of LIHC. The cancer staging system is based on gross features of cancer anatomical penetration, and one such standard is the American Joint Committee on Cancer (AJCC) Tumor-Node-Metastasis (TNM) staging (Amin et al., 2017). It is reasonable to hypothesize that the stage-specific gross changes are associated with signature molecular events, and try to probe such molecular bases of stage-wise progression of cancer. We had earlier published on stage-specific "hub driver" genes in colorectal cancer (Palaniappan et al., 2016). A stage-focussed analysis of colorectal cancer transcriptome data yielded negative results vis-a-vis the AJCC staging system (Huo et al., 2017).

METHODS
DATA PREPROCESSING. Normalized and log2-transformed Illumina HiSeq RNA-Seq gene expression data processed by the RSEM pipeline (Li and Dewey, 2011) were obtained from TCGA via the firebrowse.org portal (Broad Institute TCGA Genome Data Analysis Center, 2016). The patient barcode (uuid) of each sample encoded in the variable called 'Hybridization REF' was parsed and used to annotate the controls and cancer samples (Fig. 2). To annotate the stage information of the cancer samples, we obtained the clinical information dataset for LIHC from firebrowse.org (LIHC.Merge_Clinical.Level_1.2016012800.0.0.tar.gz) and merged the clinical data with the expression data by matching the "Hybridization REF" in the expression data with the aliquot barcode identifier in the clinical data. The stage information of each patient was encoded in the clinical variable "pathologic stage". The substages (A,B,C) were collapsed into the parent stage, resulting in four stages of interest (I, II, III, IV). We retained a handful of other clinical variables pertaining to demographic features, namely age, sex, height, weight, and vital status. With this merged dataset, we filtered out genes that showed little change in expression across all samples (defined as σ < 1). Finally, we removed cancer samples from our analysis that were missing stage annotation (value 'NA' in the "pathologic stage"). The data pre-processing was done using R (www.r-project.org).
LINEAR MODELLING. Linear modelling of expression across cancer stages relative to the baseline expression (i.e, in normal tissue controls) was performed for each gene using the R limma package (Ritchie et al., 2015). The following linear model was fit for each gene's expression based on the design matrix shown in Fig. 3A: where the independent variables are indicator variables of the sample's stage, the intercept α is the baseline expression estimated from the controls, and βi are the estimated stagewise log fold-change (lfc) coefficients relative to controls. The linear model was subjected to empirical Bayes adjustment to obtain moderated t-statistics (McCarthy and Smyth, 2009). To account for multiple hypothesis testing and the false discovery rate, the p-values of the F-statistic of the linear fit were adjusted using the method od Hochberg and Benjamini (1990). The linear trend across cancer stages for the top significant genes were visualized using boxplots to ascertain the regulation status of the gene relative to the control.
PAIRWISE CONTRASTS. To perform contrasts, a slightly modified design matrix shown in Fig.  3B was used, which would give rise to the following linear model of expression for each gene: where the controls themselves are one of the indicator variables, and the βi are all coefficients estimated only from the corresponding samples. Our first contrast of interest, between each stage and the control, was achieved using the contrast matrix shown in Table 1. Four contrasts were obtained, one for each stage vs control. A threshold of |lfc| > 2 was applied to each such contrast to identify differentially expressed genes (with respect to the control). We used the absolute value of the lfc, since driver genes could be either upregulated or downregulated. Genes could be differentially expressed in any combination of the stages or no stage at all. To analyze the pattern of differential expression (with respect to the control), we constructed a four-bit binary string for each gene, where each bit signified whether the gene was differentially expressed in the corresponding stage. For example, the string '1100' indicates that the gene was differentially expressed in the first and second stages. There are 2 4 =16 possible outcomes of the four-bit string for a given gene corresponding to the combination of stages in which it is differentially expressed. This is illustrated in set-theoretic terms in Fig. 4. In our first elimination, we removed genes whose |lfc| < 2 for all stages. For each remaining gene, we identified the stage that showed the highest |lfc| and assigned the gene as specific to that stage for the rest of our analysis. SIGNIFICANCE ANALYSIS. We applied a four-pronged criteria to establish the significance of the stage-specific differentially expressed genes. (i) Adj. p-value of the contrast with respect to the control < 0.001 (ii)-(iv) P-value of the contrast with respect to other stages < 0.05 To obtain the above p-values (ii) -(iv), we used the contrast matrix shown in Table 2, which was then used an an argument to the contrastsFit function in limma.
FURTHER ANALYSES. Principal component analysis (PCA) were performed using prcomp in R. To choose 100 random genes, we used the rand function. Gene set enrichment analysis were performed on KEGG (www.kegg.ac.jp) and Gene Ontology (Ashburner et al., 2000) using kegga and goana in limma, respectively. In order to visualize outlier genes that are significant with a large effect size, volcano plots could be obtained by plotting the -log10 transformed p-value vs. the log fold-change of gene expression. Heat maps of significant stage-specific differentially expressed genes were visualized using heatmap and clustered using hclust. Novelty of the identified stage-specific genes was ascertained by screening against the Cancer Gene Census v84 (Futreal et al., 2004).

RESULTS
The TCGA expression data consisted of expression values of 20,532 genes in 423 samples. After the completion of data pre-processing, we obtained a final dataset of expression data for 18,590 genes across 399 samples annoatated with the corresponding sample stage (available in Supplementary File S1). The stagewise distribution of TCGA samples along with the corresponding AJCC staging is shown in Table 3. A statistical summary of demographic details including age, sex, height, weight, and vital status is shown in Table 4. The body mass index (BMI) distribution was derived from patient clinical data that had both height and weight (i.e, neither was 'NA'). The average age of onset of LIHC was around 60 years, and the average BMI was about 26, indicating a possible link with ageing and obesity.
The dataset was processed through voom in limma to prepare for linear modelling (Law et al., 2014). At a p-value cutoff of 10E-5, 9618 genes were significant in the linear modelling, implying a strong linear trend in their expression across cancer stages. This was not entirely surprising since one of the hallmarks of cancer phenotype is genome-wide instability (Hanahan and Weinberg, 2011). The linear modelling highlighted top ranked genes, some upregulated in LIHC (GABRD, PLVAP, CDH13) and some downregulated (CLEC4M, CLEC1B, CLEC4G). The lfc for each stage with respect to control of top ten genes (ranked by adjusted p-value) are shown in Table 5, along with their inferred regulation status. Boxplots of the expression of the top 9 genes ( Fig 5) indicated a progressive net increase in expression across cancer stages relative to control for up-regulated genes, while depressed expression across cancer stages relative to control was indicative of downregulated genes. (Boxplots of all other genes in the top 200 are provided in the Supplementary  Fig. S1) It is worthwhile to note that a given gene might have maximal differential expression in any stage (not necessarily stage 4), and the linear trend does not suggest the order of expression across stages (Fig. 6).
A PCA of the top 100 genes from the linear model was visualized using the top two principal components (Fig. 7A). A clear separation of the controls and the cancer samples could be seen, suggesting the extent of differential expression of these genes in cancer samples. Hence linear modelling yields cancer-specific genes versus normal controls, and the results for the all the genes, including the top 100, are provided in order in Supplementary File S2. For comparison, a PCA plot of 100 randomly sampled genes ( Fig. 7B) failed to show any separation of the cancer and control samples.
The results from the linear modelling were in contrast with those obtained by Huo et al. (2017) and were most likely driven by the inclusion of 51 controls in our study. These positive results provided the impetus to pursue stage-driven analysis. Given the conventional AJCC staging, gene expression differences would play a major role in driving the cancer progression. To identify the stage-specific differentially expressed genes, we applied the first contrast matrix ( Table 2) and constructed the four-bit stage string of each gene. Based on the stage strings, we binned all the genes, and the string-specific gene lists corresponding to all the partitions in the Venn diagram ( Fig. 4) is made available in Supplementary File S3. The size of each such partition is illustrated in Fig. 8. We eliminated the 16,135 genes corresponding to the stage string '0000' ( |lfc|<2 in all stages). To establish the significance of the remaining genes, we applied the second contrast (Table 3) and passed each gene through the four filter criteria. The gradual reduction in candidate stage-specific genes as each criterion was applied, is shown in Table 6. Only genes that passed all criteria were retained as significant stage-specific differentially expressed genes. We obtained 2 stage-I specific, 2 stage-II specific, 10 stage-III specific and 35 stage-IV specific genes (Table 7). Fig. 9 shows the volcano plot of these 49 stage-specific genes.
In view of the limited sample size for stage-IV and consequent low power for rejecting falsepositives, we stipulated that each stage-IV specific gene would display a smooth increasing or decreasing expression trend through cancer progression culminating in maximum differential expression in stage-IV. On this basis, we pruned the 35 stage-IV specific genes to just ten topranked by significance in the linear modelling.
A heatmap of the lfc expression of stage-specific genes across different stages was visualized (Fig.  10A) and revealed systematic variation in expression relative to control on a gradient from blue (downregulated) to red (overexpressed). The map was clustered on the basis of differential expression (i.e, |lfc|) both across stages and across features (i.e, genes) (Fig. 10B). Stage I genes clustered together, stage II genes co-clustered with NCAPG2 and DLG5 from stage-III, all the other stage-III genes clustered together, while the stage-IV genes formed two separate clusters. It was interesting to note that GABRD emerged as an outgroup to all the clusters, demonstrating its uniqueness.

DISCUSSION
When differentially expressed genes are identified in a two-class cancer vs control manner, the information about stage-specificity of differential expression is lost. By applying our protocol, this information is recovered and available for dissection.
To identify the biological processes specific to each stage, we used the genes with maximal |logFC| in each stage and performed a stagewise gene set enrichment analysis on two ontologies, the GO and KEGG pathways. Salient results with respect to KEGG pathways are presented below (Table 8) and the complete KEGG and GO results are available in Supplementary Tables S1 and S2, respectively. In stage I, we found the significant enrichment of cell-cycle signaling pathways (Hippo, Wnt, HIF-1), and viral infection-related pathways (cytokine-cytokine receptor interaction, human papillomavirus infection, HTLV-I infection). In stage II, key signalling pathways (Ras, MAPK) were aberrant. Two liver-specific pathways, alcoholism and cytochrome P450 mediated metabolism of xenobiotics were enriched, as well as standard cancer pathways of bladder, brain, stomach, and skin that might involve generic genetic alterations necessary for cancer cell growth. In stage III, we noticed the significant enrichment of Metabolic pathways that summarize cellular metabolism. This might indicate the metabolic shift needed by the cancer to grow and invade neighboring tissues. Other salient significantly enriched pathways pertained to increased cell cycle progression, DNA replication, chemical carcinogenesis, p53 signaling pathway and cellular senescence, all hallmark processes critical to cancer progression. Stage IV gene set was significantly enriched for bile-related processes (bile secretion, primary bile acid biosynthesis), and ABC transporters (possibly conferring a drug-resistant advanced cancer phenotype). A signaling pathway related to diabetic complications was enriched as well, indicating the role of co-morbidities in driving liver cancer progression. The enrichment analysis of the top 100 genes of the linear model is included in the Supplementary Table S3.
The top ten linear model genes (Table 5) and all the stage-specific differentially expressed genes (Table 10) were analyzed with respect to the existing literature. Three C-type lectin domain proteins (CLEC4M, CLEC1B, CLEC4G) were detected in the top ten genes of linear modelling across stages. Interestingly, this identical cluster of three genes was detected as the most significantly downregulated liver cancer-specific genes in a qPCR study of an independent cohort of 65 tumornormal matched cases (Ho et al., 2015). On screening the top 200 linear model genes against cancer driver genes in the Cancer top 200 Gene Census, only four genes were found, namely BUB1B, CDKN2A, EZH2, and RECQL4.
Stage-I specific DEGs (Fig. 11). CA9 is a member of carbonic anhydrases, which are a large family of zinc metalloenzymes that catalyse the reversible hydration of carbon dioxide. Its expression in clear cell Renal carcinoma, but not in functional kidney cells has gained attention for its use as a pre-operative biomarker (Li et al., 2017). The WNT7B protein is part of the Wnt family, a family of secreted signalling proteins. Elevated WNT7B in pancreatic adenocarcinoma has been found to mediate anchorage independent growth (Arensman et al., 2014). Surprisingly, both CA9 and WNT7B are downregulated in LIHC, most so in stage-I, contrary to their role in other cancers. A concrete interpretation of the role of these genes in LIHC awaits appropriately designed experimental studies.
It is pertinent to ask the following question here: which genes are essential for the initiation of LIHC? Clearly these genes would be differentially expressed in stage I relative to control. All significantly differentially expressed genes with maximal |lfc| in stage-I would be the best candidates for genes involved in the initiation of LIHC. These 122 genes are provided in the Supplementary File S3.
Stage-II specific DEGs (Fig. 12). APOBEC3B, a DNA cytidine deaminase, is a known cancer driver gene in the Cancer Gene Census, but there are no literature reports of its stage-specificity in any cancer. It is known to accountfor half the mutational load in breast carcinoma, and its target sequence was found to be highly mutated in Bladder, lung, cervix, neck, and head cancers as well (Burns et al., 2013). Here APOBEC3B is upregulated possibly conferring a gain-of-function comparable to that achieved by a mutation mechanism. FAM186A polymorphisms have been reported in GWAS and SNP studies on colorectal cancer patients and shown to have a significant odds ratio in risk heritability (Timofeeva et al., 2015).
Stage-III specific DEGs (Fig. 13). C12orf48, also known as PARI, participates in the homologous recombination pathway of DNA repair, and its overexpression has been reported in pancreatic cancer . It has been recently found that lower DLG5 expression is correlated with advanced stages of HCC and essential for invadopodium formation, an event critical to cancer metastasis (Ke et al., 2017). It is surprising that our study has identified a stage-III specific upregulation in DLG5. Interestingly, evidence is emerging to lend support to our finding that DLG5 might be tumor-promoting. In a very recent review, Saito et al. (2018) reinterpreted published results on cell polarity and cancer, and advanced an alternative perspective on the role of polarity regulators in cancer biology. They argued that both cellular and subcellular polarity would be regulated by DLG5 and related polarity proteins. Subcellular polarity might improve the cellular fitness for proliferation and stemness, thereby causing tumor promotion. Hence cell polarity regulation is anti-tumorigenic and subcellular polarity regulation is pro-tumorigenic, and our analysis has uncovered the pro-tumorigenic upregulated activity of DLG5. . ECT2 encodes a guanine nucleotide exchange factor that remains elevated during the G2 and M phase in cellular mitosis. ECT2 is found to be upregulated in lung adenocarcinoma and lung squamous cell carcinoma , as well as in invasive breast cancer . NCAPG2 is a component of the condensing II complex and involved in chromosome segregation during mitosis. NCAPG2 level were found to be increased in non-small cell lung cancer, and its over-expression was found to be correlated with lymph node metastasis, thus enabling the use of NCAPG2 as a poor prognostic biomarker in lung adenocarcinoma (Zhan et al., 2017). GNMT is a methyltransferase that catalyses conversion of S-adenosine methionine to sadenosyl cysteine. In the absence of GNMT, S-adenosine methionine causes hypermethylation of DNA, which represses GNMT levels and is found in HCC samples (Huidobro et al., 2013). This is an epigenetic mechanism for loss of function of tumor suppressors and our study here confirmed the downregulation of GNMT expression. PRR11 is found to be over-expressed in lungs, and its silencing using siRNA resulted in cell cycle arrest and apoptotic cell death, followed by decreased cell growth and viability (Zhao 2015). A similar knock out experiment of PRR11 in hilar cholangiocarcinoma cell lines resulted in decreased cellular proliferation, migration, and tumor growth . WDHD1 is a key post-transcriptional regulator of centromeric, and consequently genomic, integrity (Hsieh et al., 2011) and its overexpression has been identified as biomarker of acute myeloid leukemia (Wermke et al., 2015), and lung and esophageal carcinomas (Sato et al., 2010). C15orf42 has been implicated in nasopharyngeal carcinoma . ORC6L overexpression has been identified as a prognostic biomarker of colorectal cancer possibly by enhancing chromosomal instability (Xi et al., 2008). XRCC2 was found to increase locally advanced rectal cancer radioresistance by repairing DNA double-strand breaks and preventing cancer cell apoptosis (Qin et al 2015).
Stage-IV specific DEGs (Fig. 14). GABRD, which is the top gene in the linear model as well, encodes for the delta subunit of the gamma-amino butyric acid receptor. The GABA receptor family was found to be frequently downregulated in cancers, except for GABRD, which was found to be up-regulated. Gross et al. (2015) proposed that the GABA receptor gene family might play a role in the proliferation independent differentiation of cancer cells . GBX2 is part of the GBX gene family, which are homeobox containing DNA binding transcription factors. GBX2 is overexpressed in prostate cancer and studies show that expression of GBX2 is required for malignant growth of human prostate cancer (Gao et al., 1998). PECAM1 overexpression has been linked to peritoneal recurrence of stage II/III gastric cancer patients (Terashima et al., 2017). CEND1 has been identified as a cell-cycle protein (Tsioras et al., 2013). PGAM2 is a glycolytic enzyme whose upregulation is essential for tumor cell proliferation (Xu et al., 2014). NR1I2 downregulation has been used in constructing a prognostic 9-genes expression signature of gastric cancer . GDF5 has been shown to be a downstream target of the TGF-beta signaling pathway (Margheri et al., 2012), stimulating angiogenesis required for the growth and spread of the cancer. GPR1 has been reported to be involved in promoting cutaneous squamous cell carcinoma migration (Farsam et al., 2016). Two more stage-IV specific genes, namely CXCR2P1, which is a C-X-C motif chemokine receptor 2 pseudogene 1, and LOC25845, are undocumented in the literature in the context of LIHC, other cancers or any other condition.

CONCLUSION
We have developed an original protocol for the stagewise dissection of the LIHC transcriptome. We were able to successfully fit a linear model across cancer stages and detected genes with a strong linear expression trend in the cancer phenotype. These genes were found to effectively separate the control and cancer samples. We were able to assign 2455 differentially expressed genes into one of four stages and visualized their stage specific expression using boxplots. Using a multi-layered approach, we were able to assess the significance of each stage-specific DEG and narrowed down to a handful of candidate significant stage-specific DEG's. Our analysis yielded two stage-I specific genes (CA9, WNT7B), two stage-II specific genes (APOBEC3B, FAM186A), ten stage-III specific genes and ten stage-IV specific genes. Though all these genes except APOBEC3B are novel, a literature search indicated that most of the genes have a cancer connection (albeit not with LIHC). Experimental validation would be useful to translate these results into a panel of biomarkers for clinical use and rational drug development. It is straightforward to extend our computational methodology to the stage-based analysis of other cancers to obtain a fuller view of disease initiation, progression, and metastasis.

ACKNOWLEDGMENTS:
We would like to thank SASTRA deemed University, for infrastructure and computing support.

SUPPLEMENTARY INFORMATION:
All data and results are provided in the supplementary information (10.6084/m9.figshare.6455024). Figure 1 Major causative pathways of hepatocarcinogenesis. All pathways converge to progressive genomic alterations, leading a normal cell to acquire the hallmarks of cancer.         Heatmaps of final stage-specific 24 genes. A, heatmap generated from the lfc values of all the stage-specific genes (arranged stagewise). Log fold changes upto sixfold are seen, indicating 64 times differential expression with respect to control. B, Clustered representation of the stagewise gene expression based on differential expression (corresponding to |lfc|). Figure 11 Boxplot of stage-I specific genes. It is seen that CA9 and WNT7B are both maximally downregulated in stage-I. Figure 12 Boxplot of stage-II specific genes. It is seen that both APOBEC3B and FAM186A expression are maximum in stage-II, following an inverted U-shape.         C12orf48  C15orf42  ORC6L  ECT2  WDHD1  DLG5  XRCC2  NCAPG2  GNMT  PRR11   GABRD  PECAM1  LOC25845  CEND1  GBX2  PGAM2  NR1I2  GDF5  CXCR2P1  GPR1  MUSTN1  EHD2  LOC143188  HIST3H2BB  CA12  CDX1  MYO16  CPE  LPPR3  ZMYND12  KCNF1  GPR126  MCCD1  GABRB2  SNCB  TRIM50  MT3  KCNQ2  DUXA  C14orf72  ECEL  FOXE  MYH13  ARHGAP42 BMP7   Table 9 Stage specific genes and parameters. The log-expressions (β's) of each gene from our analysis are shown, along with adjusted p-values with respect to control and the linear model, and the inferred regulation status of the gene in LIHC. The stage-specificity of the genes are emphasized.