Statistical and Bioinformatic Analysis of Hemimethylation Patterns in Lung Cancer

Background: DNA methylation is an epigenetic event involving the addition of a methyl-group to a cytosine-guanine base pair (i.e., CpG site). It is associated with different cancers. Our research focuses on studying lung cancer hemimethylation, which refers to methylation occurring on only one of the two DNA strands. Many studies often assume that methylation occurs on both DNA strands at a CpG site. However, recent publications show the existence of hemimethylation and its impact. It is important to identify cancer hemimethylation patterns. Methods: In this paper, we use the Wilcoxon signed rank test to identify hemimethylated CpG sites based on publicly available lung cancer methylation sequencing data. We then identify two types of hemimethylated CpG clusters, regular and polarity clusters, and genes with large numbers of hemimethylated sites. Highly hemimethylated genes are then studied for their biological interactions using available bioinformatics tools. Results: In this paper, we have conducted the rst-ever in-depth investigation of hemimethylation for lung cancer. Our results show that hemimethylation does exist in lung cells either as singletons or clusters. Most clusters contain only 2 or 3 CpG sites. Polarity clusters are much shorter than regular clusters and appear less frequently. The majority of clusters found in tumor samples have no overlap with clusters found in normal samples, and vice versa. Several genes that are known to be associated with cancer are hemimethylated differently between the cancerous and normal samples. Furthermore, highly hemimethylated genes exhibit many different interactions with other genes that may be associated with cancer. Hemimethylation has diverse patterns and frequencies that are comparable between normal and tumorous cells. Therefore, hemimethylation may be related to both normal and tumor cell development. Conclusions: Our research has identied CpG clusters and genes that are hemimethylated in normal and lung tumor samples. Due to the potential impact of hemimethylation on gene expression and cell function, these clusters and genes may be important to advance our understanding of the development and progression of lung cancer.


Background
Lung cancer is a leading cause of death in the United States; more patients die of lung cancer than of breast, prostate, and colon cancers combined. The American Cancer Society predicts that in 2020 alone there will be 228,820 new cases of lung cancer diagnosed and 135,720 deaths in the United States [1].
The ve-year survival rate of lung cancer is much lower than many other prominent cancers such as breast, colorectal, and prostate, as only 19.4% of patients survive beyond ve years of having the disease. The rate of survival can be as high as 57.4% when the cancer is still localized. However, the majority (57%) of patients are diagnosed in late stages, and the ve-year survival rate of these patients is only 5.2% [2].
In order to diagnose and treat lung cancer, it is often important to identify genetic and epigenetic biomarkers. One important epigenetic biomarker or event is DNA methylation. It is the covalent bonding of a methyl group (-CH 3 ) to a CpG site in a mammalian cell; this is an epigenetic alteration to the DNA, meaning the DNA sequence does not change. A CpG site is the nucleotide base cytosine bonded to the base guanine by a phosphate (5'-CpG-3') [3]. The correlation between methylation patterns on speci c CpG sites and gene expression has been studied as methylation enhances or mutes particular genes [4].
DNA methylation patterns are maintained and changed mainly through two dynamic processes: methylation maintenance and de novo methylation [5,6]. Methylation maintenance allows for preservation of methylation patterns across replication generations, maintaining valuable methylation levels. De novo methylation occurs on symmetrically unmethylated CpG sites and increases methylation levels over cell generations [5]. Demethylation is the action of a methyl group being removed from a CpG site, and it can be observed in two forms: passive and active [6]. Passive demethylation is an error during maintenance methylation, resulting in bare or hypomethylated CpG sites on the nascent strand whilst the parent strand is methylated. In contrast, active demethylation is the deliberate removal of a methyl group from a CpG site [7]. Both demethylation and de novo methylation can contribute to the development of hemimethylation, which means that methylation occurs on only one DNA strand of a CpG site and not the other, see Fig. 1. The exact process by which the methyl groups are lost or gained is not well understood but may one day provide new insight into the development of cancerous cells [5].
Hemimethylation is a particular kind of methylation pattern. If a CpG is methylated on the forward strand but not on the reverse strand, it is de ned as an "MU" hemimethylation site. If a CpG is methylated on the reverse strand but not on the forward strand, it is de ned as a "UM" hemimethylation site. If a CpG exhibits no signi cant hemimethylation, it is de ned as an "NS" site. If no data is available to be analyzed at a CpG, that site is de ned as an "NA" site. Hemimethylation occurs not only at solitary CpG sites, but also in consecutive ones, known as hemimethylation clusters. Such clusters can manifest in one of two distinct patterns: regular or polar [5,[7][8][9], see Fig. 1. A regular cluster can be observed when sequential CpG sites are methylated on the same strand and unmethylated on the other. A polar or polarity cluster occurs when consecutive CpG sites are methylated on opposite strands. Ehrlich and Lacey found that studying hemimethylation provided valuable insight into cancer-associated active demethylation [5].
Exactly how the different hemimethylation patterns affect gene expression is not yet well documented, except for a recent nding that correlates gene body hemimethylation with an increase in transcription [10].
Hemimethylation may be closely related to hypermethylation and hypomethylation patterns found in the genome. When hypermethylation occurs in CpG islands proximal to the promoter regions of a gene, it is more likely to repress the gene [11]. In contrast, DNA hypomethylation is observed when CpG sites lose the methyl group and become less methylated or bare. Hypomethylation or unmethylation may be associated with increased gene expression [4].
Abnormal hypermethylation and hypomethylation are commonly known characteristics of cancerous cells. Identi cation of these different states of methylation can assist in the detection of cancerous cells long before they would appear in clinical examinations or before symptoms are apparent. This identi cation is useful because many patients are not diagnosed until the tumor is at an incurable stage [12]. In addition, DNA methylation patterns and levels can vary as cancer progresses [7]. For example, hemimethylation as a transitional state or indicator of hypomethylation and hypermethylation can help medical researchers to monitor how far the disease has progressed. Knowledge of such indicators, including hemimethylation, could allow for better comprehension of certain cancers and provide better insight toward the development of treatment methods. Furthermore, the possibility exists for future manipulation of methylation levels to interact with epigenetic changes associated with cancer and imprinting disorders [13].
Identifying the methylation pattern difference between normal and tumorous cells may indicate a given set of genes in uencing carcinogenic development. Hemimethylation, the transitional state of DNA methylation, may be correlated with gene expression levels of certain genes. The hemimethylation of carcinogenic genes may be related to cancer development and progression. Therefore, it is important to investigate cancer hemimethylation. Hemimethylation has been researched previously for breast cancer cell lines [8], but not yet for lung cancer. Lung cancer is a great candidate for this investigation as it is challenging to detect. Its symptoms are often obscure or mistakable due to the consequence of previous smoking habits. Utilizing hemimethylation markers to identify carcinogenic development may be bene cial in lung cancer diagnosis.
The purpose of this research study is to identify hemimethylation patterns in both normal and tumorous lung cells using publicly available methylation sequencing data. In the Methods section, the statistical and computational analysis will be described, as well as the hemimethylation terminologies used throughout the paper. In the Results section, the outcomes of a variety of comparisons are evaluated and displayed, followed by biological mapping of highly hemimethylated genes and analysis of their relationships to cellular functions. All of these analyses guide towards the discussion of hemimethylation patterns in carcinogenic development and the conclusion of this study.

Methods
In this study, samples are obtained from lung tumor and adjacent normal tissues of 18 male non-small cell lung cancer patients in their fties to seventies. The reduced representation bisul te sequencing datasets of these patients are publicly available [14]. Sequencing reads are aligned to the hg38 reference genome and methylation signals are obtained using the BRAT-bw software package [15]. All methylation datasets consist of the methylation level of each CpG site, based on the number of sequencing reads that are methylated and unmethylated. Further analysis is then performed on CpG sites that exhibit at least 4 samples with non-NA methylation levels for both strands.
The Wilcoxon signed rank test is utilized to investigate if hemimethylation exists at each CpG site. This particular test is selected instead of a regular statistical t-test because the independence and normality assumptions are not satis ed. The null hypothesis is that there is no methylation level difference between the forward and reverse strands at each CpG site.
The test results are ltered based on two metrics: forward and reverse strand methylation mean difference and Wilcoxon test p-value. CpG sites with a large mean difference in methylation levels and a p-value that is less than 0.05 are identi ed as hemimethylated CpG sites. Signi cant CpG sites are annotated to show which gene promoter region or gene body they belong to. Additionally, clusters of CpG sites are identi ed as either regular or polar patterns. For example, MM-UU and MU-UM are regular and polar clusters, respectively. MM-UU means at two consecutive CpG sites, methylation occurs on the positive strands (i.e., MM) but not on the reverse strand (i.e., UU). MU-UM means at two consecutive CpG sites, on the positive strand they are methylated (M) and unmethylated (U) respectively (i.e., MU), and on the reverse strand they are unmethylated (U) and methylated (M) respectively (i.e., UM). The lengths of all clusters in tumor and normal cells are shown in histograms. The percentages of CpG sites in regular clusters, polar clusters, and singleton points are summarized. For those in clusters, the tumor and normal strands are compared and overlapping clusters are identi ed. We'll show the key results in the Results section.

Results
Hemimethylated CpG sites for both normal and tumor cells are identi ed using the Wilcoxon tests. Table 1 describes the proportions of hemimethylation sites that are in clusters depending on the p-value (p < 0.05) and three mean difference cutoff values. The CpG sites that are not in clusters are called singletons. There are similar numbers of hemimethylation sites in tumor and normal samples, but the proportion in clusters is slightly higher in normal samples. For the rest of this paper, our analysis will focus on the hemimethylation sites identi ed based on the p-value of 0.05 and the absolute mean difference greater than or equal to 0.4.   Tumor and normal samples' hemimethylation clusters are compared in Table 3. This table shows that most clusters only have two or three CpG sites and cluster frequency decreases with increased cluster length, meaning large congregations of hemimethylation are infrequent. Table 3 Normal and tumor hemimethylation cluster patterns. The rst column is the cluster pattern, separating forward and reverse strands by "-". The second and third columns are the counts of such patterns in normal and tumor samples respectively. The length of a cluster is de ned as the total number of base pairs between the rst and the last CpG sites in the cluster. Figure 2 shows 4 histograms of cluster lengths. These histograms display the length distributions of polarity patterns in tumor, polarity patterns in normal, regular patterns in tumor, and regular patterns in normal samples. Regular and polarity patterns are analyzed separately because polarity clusters tend to be much shorter. In fact, many of the polarity clusters are less than 40 base pairs long and a majority of them are less than 10 base pairs long (see peaks in the top panels of Fig. 2). Many of the regular clusters are relatively short, i.e., less than 60 base pairs long, but a small amount of them are longer than that with a maximum length of around 100 to 120 base pairs.  For the two main hemimethylation cluster patterns, regular cluster and polarity cluster, we summarize them in detail in Table 4 and Table 5. Table 4 describes the proportions of different regular clusters in normal and tumor DNA. Table 5 describes the proportions of different polarity patterns in normal and tumor DNA. Polarity clusters appear less frequently than regular patterns, as seen by the difference in the number of sites between Tables 4 and 5. For example, tumor samples have a total of 477 regular clusters and only 36 polar clusters.
One way to detect which clusters may be related to cancer is to compare the cluster locations between tumor DNA and normal DNA. Some clusters may appear in the same sites in both tumor and normal samples, but others may be found only in tumor or only in normal. We compare the 513 tumor clusters with the 583 normal clusters and summarize the results in Table 6. This table shows that multiple kinds of overlaps can be found between tumor and normal. Hemimethylation clusters that occur only in tumor or normal samples are shown in Column B. 695 (313 tumor only and 382 normal only) clusters fall into these categories, and these are the clusters or regions that may be associated with cancer. Column C counts the number of clusters that are exactly the same for normal and tumor. Column D indicates the situations in which a tumor cluster begins and ends within a normal cluster (i.e., tumor cluster contained within the bounds of a normal cluster), and vice versa as shown in Column E. For example, a tumor cluster's start and end positions on a chromosome are 150 and 170 base pairs. It is located within a normal cluster that has the start and end positions of 120 and 190 base pairs. Column D, which represents tumor clusters that are embedded in normal clusters, shows different counts for normal and tumor samples because there are two instances of multiple normal clusters located in one tumor cluster.
Similarly, Column E, which represents normal clusters that are embedded in tumor clusters, shows different counts because there are three tumor clusters that are located in one normal cluster. Column F represents all other kinds of overlap. For example, there are two normal clusters that have some overlap with the same tumor cluster.
The second row of Table 6 shows that among the 513 tumor clusters, 313 of them belong to tumor only; 140 clusters also show up in normal samples; 25 tumor clusters are short ones and they are located within long normal clusters; 23 tumor clusters are long ones in which short normal clusters are located; and 12 tumor clusters are partially overlapped with normal clusters. The third row of Table 6 shows that among the 583 normal clusters, 382 of them belong to normal only; 140 clusters also show up in tumor samples; 23 normal clusters are long ones and they cover short tumor clusters; 25 normal clusters are short ones and they are located within long tumor clusters; and 13 normal clusters are partially overlapped with tumor clusters. After identifying hemimethylated CpG sites, we may also map them back to genes. That is, we provide the annotation for each CpG site by providing the gene name in whose gene body or promoter region a hemimethylation site is located. We call this analysis gene annotation and summarizing such will provide the frequency on how many hemimethylated CpG sites a gene has. This annotation analysis is important because highly hemimethylated genes may play an important role. Table 7 shows the frequency of hemimethylated CpG sites in gene bodies. Each column shows how many genes have n hemimethylated CpG sites in their gene bodies, where n is given in the rst row. The second row describes the distribution for tumor genes and the third row describes the distribution for normal genes. Similarly, Table 8 describes the frequency of hemimethylated CpG sites in promoter regions. Table 7 displays that the large majority of gene bodies have at most 3 hemimethylated CpG sites in both tumor and normal samples, but a few have more than 10. When looking at promoter regions, Table 8 shows none have 10 or more and the large majority of genes have 1 or 2 hemimethylated CpG sites.  With the gene annotation analysis, we can identify genes that have relatively more hemimethylation sites. In particular, we select the genes that have at least 5 hemimethylation sites in tumor only, in normal only, and in both normal and tumor samples. These genes are summarized in Tables 9, 10, and 11 respectively. In each of these tables, the rst column is the gene name, the second column is the number of hemimethylation sites belonging to this gene, and the third column is the description of this gene. Terms in the tables that are followed by * are gene families, e.g., transcription factor and oncogene families. Otherwise they are general gene descriptions. The description and gene family of each gene are summarized based on the Molecular Signature Database [16] and the GeneCards database [17] .
All three gene tables have some transcription factor genes, which may affect the gene expression of other cancer-related genes that are not found to be hemimethylated. Table 9 Genes with ≥ 5 hemimethylation sites in tumor samples. The "*" beside certain genes indicates a speci c gene family (e.g., transcription factor or oncogene family) that a gene belongs to.  Table 10 Genes with ≥ 5 hemimethylation sites in normal samples. The "*" beside certain genes indicates a speci c gene family (e.g., the transcription factor family) that a gene belongs to.  Table 11 Genes with ≥ 5 hemimethylation sites identical in both tumor and normal samples. The "*" beside certain genes indicates a speci c gene family (e.g., transcription factor or oncogene family) that a gene belongs to.  Figure 3 describes the different types of biological relationships between genes based on the CPDB software. A gene with a black label is known to be hemimethylated (i.e., identi ed by our analysis) and a gene with a purple label is a gene that is not provided in our hemimethylation gene list but it interacts with one of the known genes. Figure 3 is the legend for Figs. 4, 5, and 6. This legend gure summarizes the relationships for gene lists in Tables 9, 10, and 11 as shown in Figs. 4, 5, and 6, respectively. These gures show the extent to which these highly hemimethylated genes interact and possibly affect the cell function of related genes. Figure 4 shows genetic interactions between genes with the most hemimethylation in tumor samples, and these genes are recorded in Table 9. The gene network in Fig. 4 contains a number of hub genes with complex interactions. These hub genes include GNAS, NFATC1, NOTCH1, MAPK1, HOAC4, TP73, and EGR1. We can see that if a hub gene like MAPK1 is hemimethylated, it may interact with dozens of other genes. Some of these genes, e.g., EGR1 [31][32][33][34] and UNC5B [35][36][37][38], are known to be associated with cancer. Figure 5 shows genetic interactions between genes with the most hemimethylation in normal DNA, and these genes are recorded in Table 10. In this gure, we can see that GNAS is a hub gene interacting with many other genes that may not be hemimethylated themselves. GNAS is observed in both tumor and normal samples, as well as in the hemimethylation study for breast cancer cell lines [8]. MEIS1 is also a hub gene that interacts with genes like KMT2A [39] and TK1 [40]. While these genes are not hemimethylated in our samples, they are known to be associated with cancer. MEIS1 plays a crucial role in normal development [17] and it is also reported as an important gene related to leukemia [41][42][43].
Therefore, it is possible that the hemimethylation of hub genes like MEIS1 affects protein, biochemical, or regulatory functions of genes that are associated with cancer. Figure 6 shows genetic interactions between genes with the most hemimethylation on identical locations in tumor and normal samples. These genes are recorded in Table 11. This means that the hemimethylation of CpG sites in this network are unchanged or unaffected by the formation of cancer. The HNRNPL gene is a major hub in this gene network. While we do not detect any hemimethylation in this gene, it directly interacts with 10 genes that we know to be hemimethylated. Some of these genes, like PTPRN2 and MAD1L1, can also be found in the tumor gene network, see Fig. 4. There appears to be no common genes between Fig. 5 (hemimethylated genes in normal samples) and Fig. 6 (hemimethylated genes in both tumor and normal samples). Therefore, genes that have a large number of hemimethylated CpG sites found only in normal DNA seem to have very few CpG sites that remain the same when cancer forms.

Discussion
The original sequencing datasets are generated via the reduced representation bisul te sequencing protocol. Because this sequencing method covers only a small percentage of the whole genome, there are many NA entries as shown in Table 2. A more thorough sequencing method like whole-genome bisul te sequencing, which can provide methylation signals on all CpG sites in a genome, will help us see a clear picture of hemimethylation patterns in an entire genome.
The p-value used in these results is 0.05 and the mean difference cutoff values, 0.4, 0.6 and 0.8, are predetermined based on our previous research [8]. Results are narrowed down to the 0.4 cutoff level to allow more results to be viewed, as the higher cutoff values restricted the available hemimethylated CpG sites from being identi ed. The number of both tumor and normal clusters detected decreases rapidly as we increase the mean cutoff value at each CpG site as shown in Table 2. With more strict criteria, the methylation difference between the two DNA strands at each CpG site must exist in order for us to consider hemimethylation at a CpG site. This rapid decrease may indicate certain hemimethylation heterogeneity in lung cancer as cancer methylation patterns are generally heterogeneous among multiple patients or cell lines [44].
For the 41 most hemimethylated genes in lung cancer tumors, 7 of them are also highly hemimethylated in breast cancer cell lines, as reported by Sun et al. [8]. These seven genes are PRDM16, GNAS, PTPRN2, MAD1L1, HDAC4, NOTCH1, and CACNA1H. The remaining 34 highly hemimethylated genes in the lung tumor sample are not highly hemimethylated in breast cancer cell lines. It is possible that these genes are unique to lung cancer and as a result would be helpful when diagnosing patients with lung cancer speci cally, but further research needs to be done.
Based upon the outcome of hemimethylation research in breast cancer cell lines, the frequency of polarity clusters is much higher than the one in this paper. The results of breast cancer hemimethylation analysis indicate polarity clusters are more frequently found than regular clusters [8]. However, the lung cancer analysis re ects contrasting results; polarity clusters are less frequently found than regular clusters for both tumor and normal samples, as shown in Fig. 2, Table 4, and Table 5. The source of data may have some in uence on this. The breast cancer study is performed using breast cancer cell lines, which are tumors grown in labs over a long period of time. Whereas, this study uses primary tissues directly from lung cancer patients. Another factor could be the type of cancer, as hemimethylation pattern frequency may be tissue speci c. Future research into the biological impact of hemimethylation on different kinds of cells may allow for more insight into these differences and their effects.

Conclusion
In this paper, we have conducted the rst-ever in-depth investigation of hemimethylation for lung cancer.
In particular, we have conducted statistical analyses to identify hemimethylation patterns for lung cancer patients. We have identi ed both singleton hemimethylation sites and different clusters in normal and tumor cells. We have also conducted bioinformatic analysis on the genes that have relatively more hemimethylated sites in tumor, normal, and both tumor and normal cells to see the biological interactions of these genes. Our results show that not only does hemimethylation exist in lung cells, but also with diverse patterns and frequencies that are comparable between normal and tumorous cells. We conclude that hemimethylation is related to both normal and tumor cell development. This is also seen by its existence in the same genes in normal and lung tumor cells. However, there are certain genes that only have hemimethylated sites for one type of cell, normal or tumor, but not both. Certain genes are previously known to be associated with carcinogenesis. These genes exhibit existence in one cell type and not the other. Hemimethylation existing in this way may imply epigenetic changes in certain genes associated with lung cancer. The development and progression of lung cancer may be tracked by the analysis of epigenetic change (i.e., hemimethylation and methylation) in these regions. Abbreviations CpG The shorthand notation for 5'-cytosine-phosphate-guanine-3'. MU When it is for one CpG site on two DNA strands, it refers to a hemimethylated CpG site with methylation (M) on the positive strand and unmethylation (U) on the reverse strand. When it refers to two consecutive CpG sites on one DNA strand, it means that methylation occurs on the rst CpG site (i.e., M), but not on the second one (i.e., U). UM When it is for one CpG site on two DNA strands, it refers to a hemimethylated CpG site with unmethylation (U) on the positive strand and methylation (M) on the reverse strand. When it refers to two consecutive CpG sites on one DNA strand, it means that methylation does not occur on the rst CpG site (i.e., U), but occurs on the second one (i.e., M).

NS
A CpG site identi ed as not signi cantly (NS) hemimethylated.

Declarations
Ethics approval and consent to participate No ethics approval is required for the study.

Consent for publication
Not applicable.

Availability of data and materials
Datasets analyzed for this study are publicly available (SRP125064) and can be downloaded from this web page: https://www.ncbi.nlm.nih.gov/sra/SRP125064. R code les are available upon request.
Competing interests