Knowledge-based genetic association study of hepatitis B virus related hepatocellular carcinoma

Background : Recent genome-wide association studies (GWASs) have suggested several susceptibility loci of hepatitis B virus (HBV)-related hepatocellular carcinoma (HCC) by statistical analysis at individual single-nucleotide polymorphisms (SNPs). However, these loci only explain a small fraction of HBV-related HCC heritability. In the present study, we aimed to identify additional susceptibility loci of HBV-related HCC using advanced gene-and gene-set-based association tests. Methods: We performed a meta-analysis of two existing GWASs of HBV-related HCC, based on which a series of association analyses at genes and multiple gene sets curated according to current knowledge were carried out for prioritizing potential risk genes. A series of prioritized SNPs were selected to replicate genetic associations in an independent sample of 965 cases and 923 controls. Results: The gene-based association analysis suggested that five genes are significantly associated with HBV-related HCC risk: RNY4, GOLGA8M, LINC01207, WHAMMP2 and SLC39A8. Through gene-set-based association analysis, we found that the genes in systemic lupus erythematosus pathway may be relevant to development of HBV-related HCC. Three previously reported genes, NAT2, GSTA1 and GSTA2, were also highlighted to be susceptibility genes of HBV-related HCC when genes were stratified in a liver-specific expression set. However, probably due to small sample size, none of the genes prioritized by knowledge-based association analyses are successfully replicated in an independent sample. Conclusions: This comprehensive knowledge-based association mining study suggested several promising genes significantly associated with HBV-related HCC risk. More experiments or larger samples are needed to validate their contribution to the pathogenic mechanism of HCC.


Background
Hepatocellular carcinoma (HCC) is one of the most common cancers worldwide. With 750,000 new HCC cases diagnosed each year, it is the third leading cause of cancer mortality.(1) As many as 30% of patients diagnosed with hepatitis, fibrosis or cirrhosis ultimately develop HCC. In high endemic areas such as Africa and Asia, at least 60% of HCC is associated with hepatitis B virus (HBV).(2) However, only a minority of HBV carriers develops HCC. HBV carriers with a family history of HCC were estimated to have over twofold risk for HCC compared with those without a family history of HCC.(3) Furthermore, genetic complex segregation analysis suggested that major genes may be involved in the genetic predisposition to develop HCC at an earlier age. (4) Genome-wide association study (GWAS) is a widely used strategy for identifying risk loci of complex diseases. Recently, several GWASs on risk of HBV-related HCC were conducted using single-nucleotide polymorphisms (SNPs)-based statistical association tests. Multiple susceptibility loci were identified, including rs17401966 in intron 24 of KIF1B at 1p36.22, rs7574865 in intron 3 of STAT4 at 2q32. 2-32.3, rs9275319 between HLA-DQB1 and HLA-DQA2 at 6p21.3, rs9272105 between HLA-DQA1 and HLA-DRB1 at 6p21.3, and rs455804 in intron 1 of GRIK1 at 21q21.3.(5-7) However, these susceptibility loci account for only a small fraction of the contribution of genetics to HBV-related HCC. Identifying additional genetic alterations associated with HBV-related HCC may be difficult due to the relatively weak effects of many individual risk SNPs, which may be unidentifiable with the currently available, relatively small sample sizes.(8) SNP-based statistical association tests alone in GWAS do not have enough power to discover most risk loci for human complex diseases.
Gene-and biological pathway-based association analysis has been proposed to have superior statistical power compared with conventional statistical tests, as it relieves multiple testing and enriches signals.(9) Moreover, gene-and biological pathway-based analysis also lends itself to introducing more disease-specific knowledge into the analysis.
In the present study, we performed a gene-based association analysis with meta-analysis p-values from two independent HBV-related HCC GWASs. The gene-based p-values were further evaluated within multiple gene-sets defined according to knowledge of HCC. SNPs within prioritized genes were selected for replication in two independent HBV-related HCC case/control populations.

Methods
Two existing GWASs on HBV-related HCC The association p-values were obtained from two previous GWASs on HBV-related HCC in Chinese populations for meta-analysis and knowledge-based association analysis. One study (7)  Chinese descent and have participated in previously published studies. (7,11) The study was performed in accordance with guidelines approved by the local ethical committees from all participating centers involved in both the GWAS stage and the replication stage. An informed consent to participate in the study was obtained from each subject in accordance with the declaration of Helsinki principles. All study participants approved the storage of their frozen DNA specimens, for research purposes, in our laboratory.
Genotyping and quality control in replication Genomic DNA from the peripheral blood of all participants in replication was extracted using the QIAamp DNA Blood Mini Kit (QIAGEN GmbH, Hilden, Germany). Genotyping analyses for replication samples were conducted using the Sequenom MassArray system (Sequenom) according to the manufacturer's instructions. Genotyping quality was examined by a detailed QC procedure consisting of a 95% successful call rate, duplicate calling of genotypes, and internal positive control samples and two water samples (PCR negative controls) included in each 96-well plate. Genotype analysis was performed by technicians in a blind fashion.

Meta-analysis of variants
The association p-values of untyped SNPs were imputed directly by the tool FAPI (http://grass.cgs.hku.hk/limx/fapi/)(12)with default settings. The p-values of the two GWASs were then combined by Stouffer's Z-score method for meta-analysis on FAPI as well: in which N is the number of GWASs, z i is the individual z-score of the i th GWAS study, and n i is the sample size of the i th study.

Gene-based and gene-set-based analysis
The knowledge-based secondary analysis platform KGG (http://grass.cgs.hku.hk/limx/kgg/) was used to map the SNPs onto reference genes (UCSC RefGene hg19), and to perform gene-based and gene-set-based association analysis with default settings. The phased genotypes of Eastern Asian samples in the 1000 Genomes Project(13) were used to account for linkage disequilibrium of SNPs through KGG. The Benjamini-Hochberg approach was used to control false discovery rate (FDR) of genome-wide genes at a level, which is a more powerful multiple testing approach than Bonferroni correction when there are multiple susceptibility genes.

Results
We first combined the association p-values of variants by meta-analysis from two independent GWASs. Association analyses at genes and multiple knowledge-based genesets were carried to prioritize potential HBV-related HCC susceptibility genes. A series of prioritized variants were selected to replicate their genetic associations in a group of independent case-control samples. The overall workflow is shown in Figure 1 According to the gene-based p-values, three genes, RNY4, GOLGA8M and LINC01207 passed the multiple testing correction by FDR, 0.05 (Table 1). Interestingly, the RNY4 and LINC01207 are non-coding RNA genes, which have not been previously well studied. In addition, two genes, WHAMMP2 and SLC39A8, have nearly significant p-values on the genome, corrected p = 0.054 (Table 1). We further annotated the two RNA genes (RNY4 and LINC01207) and the pseudogene WHAMMP2 with known regulatory elements and epigenomic markers by the UCSC genome browser (http://genome.ucsc.edu). While the  Table 2 and Supplementary Table 1).
We also examined the association of recurrent integrated genes by HBV reported in previous studies, (24)(25)(26)(27) the genes reported to be genetically associated with HBV-related HCC risk in previous studies, and HCC risk genes defined by COSMIC database

Replication study in independent samples
We replicated genetic association at genes prioritized by the above gene-based and geneset-based associations in a group of independent HBV-related HCC case-control samples.
In total, 21 SNPs of the prioritized genes were selected according to the stability of their allele sequences in ancestry matched reference panel in the 1000 Genomes Project and/or their predicted functional importance by RegulomeDB (http://regulomedb.org/) with regulatory elements. After the genotype quality assessment, two SNPs were excluded because they failed to pass the Hardy-Weinberg equilibrium test (p<0.001).
Three genetic models (additive, dominant and recessive) were considered under a logistic regression framework in which the HCC status was adjusted for sex and age. Generally, the independent sample failed to replicate a significant association in the discovery sample after correcting multiple testing. Only two SNPs, rs389883 and rs17343667, had an association p-value below 0.05. The rs389883, which is in intron region of STK19, had pvalues of 0.026 and 0.032 for HCC association under additive and recessive models, respectively, with a protective effect at the minor allele G. However, in the original Qidong GWAS sample and Hong Kong GWAS sample, G was estimated to have a risk effect. The other SNP, rs17343667, which is located in the first intron of EIF2AK1, had an association p-value equal to 0.02 under the additive model with an odds ratio of 1.27 for the minor allele, which was found to have a risk effect in both original Qidong and Hong Kong GWAS samples (Table 3). In addition, the regulator potential of rs17343667 was supported by expression quantitative trait locus (eQTL) and TF binding/ DNase peak (scored 1f) in RegulomeDB (See details in Supplementary Figure 6).

Discussion
This study utilized knowledge-based approaches to mine new susceptibility loci of HBVrelated HCC in existing HBV-related HCC GWAS data sets. The gene-based association analysis suggested five statistically significant genes including RNY4, GOLGA8M, LINC01207, WHAMMP2 and SLC39A8. The gene-set-based association analysis implied that genes in the SLE pathway may be relevant to the development of HCC. In addition, three genes, NAT2, GSTA1 and GSTA2, were also highlighted when genes were stratified in some functional sets. Furthermore, our analysis also suggested that the germline susceptibility loci of HBV-related HCC are unlikely to be enriched in recurrent targeted genes of HBV infection, or HCC risk genes with many somatic mutations. However, probably due to small sizes in our replication samples, no associations prioritized by the knowledge-based association analysis are successfully replicated in an independent sample. The rs17343667 of EIF2AK1 is the only one with suggestive significance.
Our study is the first to indicate that these five genes (RNY4, GOLGA8M, LINC01207, WHAMMP2 and SLC39A8, which were discovered by gene-based association analysis) are relevant to the development of HBV-related HCC. For RNY4, GOLGA8M and WHAMMP2, there are no publications, to our knowledge, about their roles in risk of HCC or other cancers until this study. LINC01207 has been implicated as a biomarker for survival of colorectal adenocarcinoma (28) and promoting proliferation of lung adenocarcinoma. (29) SLC39A8 has been reported to regulate IFN-γ level in T cells (30) and influence trace element homeostasis in liver, (31,32) which may be relevant to the development of HCC.
Functional studies are warranted to explore the mechanisms of the potential roles of these genes in risk of HBV-related HCC.
Interestingly, our finding that the SLE pathway-related genes may be relevant to the development of HBV-related HCC is supported by a recent meta-analysis involving 59,662 SLE patients, which suggested that SLE had a relative risk of 3.21 (95% CI, 1.70-6.05) for liver cancer. (33) In addition, studies have found that a number of risk genes are shared by SLE and HBV-related HCC, such as STAT4 and genes in the HLA region. (7,34) Our results may further explain the comorbidity of the two diseases from a genetics aspect.
The three genes, NAT2, GSTA1 and GSTA2, that are highly expressed in liver have been previously suggested to be relevant to HCC risk. Both Gelatti et al. (35) and Yu et al.(36) observed a significant association between NAT2 genetic polymorphisms and HCC susceptibility among chronic HBV carriers who were smokers. Huang et al., (37) found that the NAT2 gene polymorphisms may confer different susceptibilities to the effect of red meat intake on HCC. GSTA1 polymorphism was suggested to be associated with an increased risk of occurrence of HCC, and decreased expression of GSTA1 was considered as a marker of advanced and highly aggressive HCC.(38) GSTA1 polymorphism was also reported to correlate with both GSTA1 and GSTA2 expression in the liver, which is expected to be of significance for individual risk of cancer or individual response to chemotherapeutic agents. (39) The negative findings in all curated gene sets were unexpected. Particularly, three gene sets (recurrent targeted genes of HBV infection, HCC risk genes with many somatic mutations and genes highly and specifically expressed in human liver) appeared to be very biologically relevant to the development of HCC. In the analyses, there were no trends that genes with smaller HCC association p-values were enriched in the gene sets.
These results suggest that the biological context or connection of underlying susceptibility genes is elusive, and that it is difficult to use our current knowledge to identify the unknown susceptibility genes of HCC. Using larger sample sizes for hypothesis-free GWASs is likely the only reliable way for identification of HCC risk genes at present.
The SNP rs17343667 in the EIF2AK1 is a promising candidate susceptibility variant although it only has a suggestively significant p-value in the small replication samples. In RegulomeDB, this SNP is a cis eQTL of lymphoblastoid and is located within the DNase peak and histone modifications of multiple tissues and cell types. In the HaploReg (v4.1) database, this SNP is located within multiple regulatory elements, such as histone marks, DNAse and transcription Motifs. EIF2AK1 encodes a kinase protein for translation initiation to downregulate protein synthesis in response to stress. Previous studies suggested that EIF2AK1 mRNA and protein were overexpressed and the kinase activity was enhanced in HCC. (40,41)

Conclusion
We performed the first systematic gene-and gene-set-based association study of HCC.
Our study suggested several promising genes significantly associated with HCC risk, which may shed insights into pathogenic mechanisms of this fatal disorder. However, the negative associations in multiple curated gene sets also imply that it is difficult to infer gene associations using our current biological knowledge. More hypothesis-free genetic studies with larger sample sizes are needed to elucidate the susceptibility genes and mechanisms of HCC.