Analysis Using Random Forest and Data Integration Methods of Non-promoter Methylation and Expression Data Identifies Dysregulated Genes in Central Carbon Metabolism Pathway in Oral Cancer

Background: Studies of epigenomic alterations associated with diseases primarily focus on methylation profiles of promoter regions of genes, but not of other genomic regions. In our past work (Das et al. 2019) on patients suffering from gingivo-buccal oral cancer – the most prevalent form of cancer among males in India – we have also focused on promoter methylation changes and resultant impact on transcription profiles. Here, we have investigated alterations in non-promoter (gene-body) methylation profiles and have carried out an integrative analysis of gene-body methylation and transcriptomic data of oral cancer patients. Methods: Tumor and adjacent normal tissue samples were collected from 40 patients. Data on methylation in the non-promoter (gene-body) regions of genes and transcriptome profiles were generated and analyzed. Because of high dimensionality and highly correlated nature of these data, we have used Random Forest (RF) and other data-analytical methods. Results: Our integrative analysis of non-promoter methylation and transcriptome data has revealed significant methylation-driven alterations in some genes that also significantly impact on their transcription levels. These changes result in enrichment of the Central Carbon Metabolism (CCM) pathway, primarily by dysregulation (overexpression) of (a) NTRK3 , which plays a dual role as an oncogene and a tumor suppressor; (b) SLC7A5 ( LAT1 ) which is a transporter dedicated to essential amino acids, and is overexpressed in cancer cells to meet the increased demand for nutrients that include glucose and essential amino acids; and, (c) EGFR which has been earlier implicated in progression, recurrence, and stemness of oral cancer, but we provide evidence of epigenetic impact on overexpression of this gene for the first time. Conclusions : reprogramming from normal cells enable analysis, we have identified that among oral cancer patients, genes the CCM pathway – that plays a fundamental role in metabolic reprogramming – are significantly dysregulated because of perturbation of methylation in non-promoter regions of the genome. This result compliments our previous result that perturbation of promoter methylation results in significant changes in key genes that regulate the feedback process of DNA methylation for the maintenance of normal cell division.


Background
For various cancers both DNA methylation and gene expression data have been analyzed separately and alterations have been found to be associated with susceptibility and outcome [1,2]. It is well known that DNA methylation impacts on gene expression. Therefore, attempts have been made to perform integrative analyses of these two types of data to draw robust inferences [3]. Various methods of data integration have been used [4,5]. Methylation and expression data are high volume, highly correlated data. Further, the number of genes or DNA regions/sites on which data are collected are orders of magnitude higher than the number of patients and controls (the so-called "large p, small n" problem in statistics). Therefore, no method of analysis has been universally accepted that takes into account data volume, correlation, and interaction. Random forest (RF) is a machine learning inferential method that is dataadaptive and tree-based. It handles correlated and large data sets very efficiently and is, therefore particularly appealing for analysis of high-dimensional genome data.
Normally, only a small portion of a high-dimensional data is associated with a phenotype. A regression framework does not apply to this scenario. The highly correlated nature of genomic data also makes the application of standard statistical models inappropriate. RF is a non-parametric tree-based approach that is particularly suited for such data-analysis problems. RF can also be used to select and rank variables by taking advantage of variable importance measures. A good review of RF in genomic data analysis can be found in [6].
We have used RF methodology to identify gene-body methylation differences between tumor and adjacent normal tissues in patients with oral squamous cell carcinoma of the gingivo-buccal region (OSCC-GB), the most common form of oral cancer in India [7,8].
We then integrated the knowledge thus obtained with data on levels of transcription of genes, which we use as a proxy for gene-expression levels, to discover methylationdriven alterations in the gene-body regions of the genome that significantly associate with dysregulation of genes in oral cancer.

5
DNA methylation occurs predominantly on cytosines followed by guanine residues (CpG). This type of methylation is referred to as CpG methylation. We had earlier analyzed data on methylation in CpG sites in the known promoter regions of all genes, but we had ignored gene-body CpG sites; sites that are on the coding regions of genes [4]. By application of modern data-adaptive method (RF) on gene-body methylation data and subsequent integration with gene expression data, we have identified some dysregulated genes and a pathway that were not identified in our earlier [4] analysis of promoter methylation and expression.

DNA Methylation
Methylation data from paired tumour and adjacent normal tissue samples of 40 OSCC-GB patients were generated using the Illumina Infinium MethylationEPIC BeadChip [4].
Using the R package minfi, we estimated for each CpG site, the CpG-specific methylation level (β-value) as the ratio of the intensity of methylated (M) to the combined intensities of both methylated (M) and unmethylated (U) alleles : where M * and U * denote signal intensities of M and U alleles, respectively, and the constant C set at 100 (as recommended by the BeadChip manufacturer) [4,9,10]. The β-value ranges from 0 (unmethylated) to 1 (methylated). The sites that had a detection p-value ≥ 0.01 and those that mapped to X or Y chromosomes were removed. We further removed from final analysis data on (a) SNP associated probes with minor allele frequency (MAF) > 0.01, (b) probes that mapped to non-coding regions of the genome that are devoid of protein-coding genes, (c) multi-mapped probes, (d) probes that did not map to annotated genes [4,9,11,12], and (e) probes that mapped to 3'UTR region of the genome.

Random Forest Classifier
To analyze the difference between Tumor and Normal samples, a Random Forest (RF) method was used on Methylation data as implemented in the randomForest package in R [9,[11][12][13][14][15][16]. The random forest algorithm is an ensemble classifier similar to Classification and Regression Tree (CART) [14]. Each tree in an RF is built by choosing a bootstrap sample of two-third of the total number of individuals; the remaining onethird (Out-Of-Bag [OOB] sample) is utilised for validation. For each node in a tree, a binary splitting rule is used on a sample of CpG sites from the bootstrap sample to find the best split. The variable with the maximum information gain [17] is selected. A parameter mtry defines the number of variables randomly selected for each node in a tree, and another parameter ntree specifies the number of trees to be built in a forest.
Normally, the value of mtry is taken to be the square root of the number of variables; this is also the default value in the R package. The output of randomForest provides an aggregated misclassification error (OOB error rate), which is estimated from predictions made on the OOB samples, and variable importance, which measures the weighted mean of the improvement in individual trees by each variable [12][13][14]18]. The most reliable variable importance method is "permutation accuracy importance" or "Mean Decrease Accuracy" (MDA) [18,19]. MDA permutes the data of i th variable in the OOB sample and records the permuted OOB error rate. The difference of the original and permuted OOB error rate averaged over the number the trees gives the importance 7 score for i th variable (VIi ) in the random forest [16,[18][19][20]. A high value of MDA implies greater importance of the variable [18,19].  [4]. Before implementing the random forest (RF) classifier, ntree and mtry parameters were tuned to generate an accuracy rate [9,13]. The best performing combination of parameters were those for which the OOB error rate stabilised and reached a minimum; i.e., the combination of parameters with the highest accuracy rate.

Classification of samples
Once the optimum set of parameters was determined, "randomForest" was executed 50 times on the methylation data of 40 paired samples. In each iteration we selected only variables (probes) with MDA-score > 0 [15]. The selected probes were then mapped to their respective genes. A gene was considered for further analyses if it satisfied the following conditions: (a) there were at least two probes in the non-promoter region of the gene, (b) methylation status of all probes in the non-promoter region were unidirectional; either hypermethylated or hypomethylated, and (c) had no probes in the promoter region. The stringency of criteria (a) and (b) were adopted to minimize the chance of false-positive discovery, and the criterion (c) was adopted to make discoveries attributable to gene-body methylation only.

RNA Sequencing
RNA was extracted and RNA sequencing was performed to obtain levels of transcription of genes, on the same set of 40 paired samples. Paired-end libraries were constructed and sequenced using Illumina HiSeq2500 [4,21]. The quality of the RNA-Seq reads was checked by FastQC. TopHat2 [4,[21][22][23] was then used to align these reads to a hg19 reference transcriptome or genome. Multi-mapped reads and non-concordant reads were filtered out using SAMtools [4] and duplicate reads were removed using MarkDuplicates from PICARD [4]. Cufflinks [4,[21][22][23] was then used to assemble and reconstruct the transcriptome. Finally, using Cuffnorm, normalised FPKM values for each gene were estimated [4]. Only those genes that had non-zero levels of transcription levels in all samples were considered for further analysis. We have used the level of transcription of a gene as a proxy for the level of expression of the gene, and have used transcription and expression levels interchangeably in this report.

Integration of Methylation and Transcription data
Those genes for which there was no promoter probe and with multiple probes in the non-promoter region that were uniformly hyper-or hypo-methylated, and for which the level of transcription/expression change between tumour and normal tissues, averaged over the 40 pairs of samples, was higher than two-fold, were identified to be dysregulated by methylation in non-promoter regions [4]. The genes that had 1st exon and exon boundary probes were removed. Finally, we considered only those genes for mapping on pathways that satisfied the known biological directionality of control; genes with hypermethylation (hypomethylation) in the gene-body region in the tumour tissue should have a significantly higher (lower) level of expression in the tumor tissue [24,25].

Enrichment analysis of pathways
Genes that were so identified by the integration of both methylation and expression data were analyzed for enrichment of biological pathways. We considered pathways in KEGG for this analysis. ClueGo and CluePedia plug-ins of Cytoscape were used. To identify whether a pathway in KEGG was significantly enriched, a right-sided test based on hypergeometric distribution was used. Benjamini-Hochberg correction method was used to correct the p-values for multiple testing [4,26].

Integration of Methylation and Gene-Expression
In paired tissues collected from the 40 OSCC-GB patients, non-zero levels of transcription/expression were found for 17,464 genes. Considering the 666 genes that exhibited significant and unidirectional methylation, we found that 132 of these genes showed at least two-fold difference in the level of expression between tumour and normal tissues, averaged over the 40 patients. Of these 132 genes, 8 genes were removed as they had 1st exon and exon boundary probes. However, of these 124 only for 67 (54% ) genes, the direction of change of expression level was consistent with that of methylation change [Additional File 3]. That is, genes with hypermethylation (hypomethylation) in the tumour tissue had significantly higher (lower) levels of expression in the tumor tissue.
Enriched pathway 10 The pathway enrichment analysis using the 67 genes dysregulated by methylation alteration in the gene-body region between tumour and normal tissues, identified enrichment of one significant (corrected p-value = 0.0012) KEGG pathway. This was Central Carbon metabolism in Cancer with three associated genes EGFR, NTRK3, SLC7A5. It has been reported, based on cell line studies, that overexpression of EGFR can impact on the development of solid tumors, including oral cancer [27]. We have found EGFR overexpressed and global hypermethylated.

Discussion
By applying the novel Random Forest data-adaptive method to high-dimensional data (about 500,000 data points per individual) to identify significant alterations in gene-body methylation in gingivo-buccal oral tumor tissue compared to adjacent normal tissue, and subsequent integration with gene expression data we have detected some genes and pathways not earlier inferred to be involved in OSCC-GB only through cell-line studies.
The significantly enriched pathway that has been identified using this data-adaptive and data-integrative approach is the Central Carbon Metabolism (CCM) pathway, which is  [28]. It is noteworthy that enrichment of the CCM pathway in OSCC-GB takes place by gene-body methylation mediated dysregulation of three key genes, EGFR, NTRK3, SLC7A5.
Significant downregulation of NTRK3 mediated by promoter methylation was noted in our earlier study [4]. NTRK3 is a neurotrophin receptor. It behaves as an oncogene in breast cancer [29,30] and possibly also in hepatocellular carcinoma [31].
However, it also plays a dual function. It acts as a tumor suppressor in colorectal cancer in which it is epigenetically inactivated [32]. In OSCC-GB also, NTRK3 is epigenetically dysregulated and appears to behave as a tumor suppressor.
SLC7A5earlier known as LAT1is a transporter dedicated to essential amino acids. Cancer cells have an increased demand for nutrients that include glucose and essential amino acids; the so-called "Warburg effect." Overexpression of SLC7A5, as we have observed here, is explained in part by the presence, in its promoter, of a canonical binding site for the proto-oncogene c-Myc [33] that is known to regulate glucose metabolism [34]. Overexpression of SLC7A5 is also controlled by methylation in the promoter [4] and non-promoter regions (this study).
EGFR has been earlier implicated in progression, recurrence, and stemness of oral cancer [35,36]. EGFR is inappropriately activated in cancer mainly because of amplification and point mutations. Transcriptional upregulation of EGFR due to autocrine/paracrine mechanisms has also been described [37]. Here, we have, for the first time, shown that dysregulation of EGFR takes place by epigenetic mechanisms in oral cancer.

Conclusions
Three key genes NTRK3, SLC7A5 (LAT1) and EGFR were overexpressed in the CCM pathway. Of these, NTRK3 [4,38] and SLC7A5 [4] were earlier identified to be associated with oral cancer. However, we provide the first evidence of epigenetic impact on overexpression of EGFR in oral cancer. To enable enhanced proliferation of cells in a cancer tissue, metabolic reprogramming from normal cells usually takes place. In the present analysis, we have identified that among oral cancer patients, genes in the CCM pathwaythat plays a fundamental role in metabolic reprogrammingare significantly