Discovery of breast cancer risk genes and establishment of a prediction model based on estrogen metabolism regulation

Background Multiple common variants identified by genome-wide association studies have shown limited evidence of the risk of breast cancer in Chinese individuals. In this study, we aimed to uncover the relationship between estrogen levels and the genetic polymorphism of estrogen metabolism-related enzymes in breast cancer (BC) and establish a risk prediction model composed of estrogen-metabolizing enzyme genes and GWAS-identified breast cancer-related genes based on a polygenic risk score. Methods Unrelated BC patients and healthy subjects were recruited for analysis of estrogen levels and single nucleotide polymorphisms (SNPs) in genes encoding estrogen metabolism-related enzymes. The polygenic risk score (PRS) was used to explore the combined effect of multiple genes, which was calculated using a Bayesian approach. An independent sample t-test was used to evaluate the differences between PRS scores of BC and healthy subjects. The discriminatory accuracy of the models was compared using the area under the receiver operating characteristic (ROC) curve. Results The estrogen homeostasis profile was disturbed in BC patients, with parent estrogens (E1, E2) and carcinogenic catechol estrogens (2/4-OHE1, 2-OHE2, 4-OHE2) significantly accumulating in the serum of BC patients. We then established a PRS model to evaluate the role of SNPs in multiple genes. PRS model 1 (M1) was established from SNPs in 6 GWAS-identified high risk genes. On the basis of M1, we added SNPs from 7 estrogen metabolism enzyme genes to establish PRS model 2 (M2). The independent sample t-test results showed that there was no difference between BC and healthy subjects in M1 (P = 0.17); however, there was a significant difference between BC and healthy subjects in M2 (P = 4.9*10− 5). The ROC curve results showed that the accuracy of M2 (AUC = 62.18%) in breast cancer risk identification was better than that of M1 (AUC = 54.56%). Conclusion Estrogen and related metabolic enzyme gene polymorphisms are closely related to BC. The model constructed by adding estrogen metabolic enzyme gene SNPs has a good predictive ability for breast cancer risk, and the accuracy is greatly improved compared with that of the PRS model that only includes GWAS-identified gene SNPs.


(Continued from previous page)
Conclusion: Estrogen and related metabolic enzyme gene polymorphisms are closely related to BC. The model constructed by adding estrogen metabolic enzyme gene SNPs has a good predictive ability for breast cancer risk, and the accuracy is greatly improved compared with that of the PRS model that only includes GWAS-identified gene SNPs.
Keywords: Breast cancer, Risk prediction, Estrogens, Estrogen-metabolizing enzyme, Gene polymorphism, Polygenic risk score Background Breast cancer is the most common malignant disease among women worldwide, accounting for 24% of new cancer cases and 15% of cancer deaths in 2018, and incident cases are expected to increase by more than 46% by 2040, according to the GLOBOCAN Cancer Tomorrow prediction tool, which will seriously endanger women's lives and health [1]. At present, people's understanding of breast cancer is deepening substantially, and new treatment strategies for tumors, including breast cancer, are continually emerging [2,3]. With continuous improvements in diagnosis and treatment methods, the survival rate of breast cancer patients has been greatly improved. Early prediction, early detection, and early treatment of high-risk groups are the key issues that urgently need to be solved in the clinic.
The occurrence and development of breast cancer are closely related to genetic and environmental factors. In 1989, Gail proposed the breast cancer risk prediction model, which included factors such as age at evaluation, age at menarche, age at first live birth, race, number of breasts, and family history of breast cancer [4,5]. Some subsequent prediction models also involved BRCA1/2, estrogen replacement therapy, mammography screening times, and genetic polymorphisms. Rare high-risk mutations, particularly in the BRCA1 and BRCA2 genes, explain less than 20% of the twofold familial relative risk (FRR) and account for a small proportion of breast cancer cases in the general population. Low-frequency variants conferring intermediate risk, such as those in CHEK2, ATM, and PALB2, explain 2 to 5% of the FRR [6]. Genome-wide association studies (GWASs) have led to the discovery of multiple common, low-risk variants (single nucleotide polymorphisms [SNPs]) associated with breast cancer risk [7]. Recently, it was found that genetic risk factors can account for 31% of breast cancer risk evaluations [8], which indicates that breast cancer is a multifactorial disease and that genetic factors are important etiological factors involved in the occurrence and development of breast cancer. At present, an increasing number of researchers are inclined to develop a comprehensive genetic risk scoring method to evaluate the polygenic effects of single nucleotide polymorphisms (SNPs) based on GWASs [9][10][11]. Some well-known studies, such as Mavaddat et al., used 77 GWAS-selected SNPs to construct a PRS for BC. Compared with middle quintile polygenic scores, the risk scores of the highest 1% were increased threefold [9].
GWASs also have their own limitations. First, a major limitation of genome-wide approaches is the need to adopt a high level of significance to account for multiple tests. Second, GWASs explain only a modest fraction of the missing heritability [12]. Estrogen is an important risk factor for breast cancer. With long-term exposure, super physiological concentrations of estrogen can bind to estrogen receptors, mediate the overexpression of various growth factors, and promote the growth and proliferation of cells, and various metabolites of estrogen can form adducts with DNA, induce genetic mutations and produce direct genotoxicity [13]. Thus, the abnormal accumulation of estrogen and its toxic metabolites in breast tissue is an important risk factor for breast cancer development. Estrogen homeostasis is regulated by estrogen-related metabolic enzymes. Endogenous estrogens are metabolized to be 2-, 4-and 16α-hydroxy estrogens, which are catalyzed by the phase I metabolizing enzymes cytochrome P450 CYP1A1, CYP1B1 and CYP3A4, respectively [14][15][16]. Hydroxyestrogens are detoxified by conjugation reactions catalyzed by phase II metabolizing enzymes such as COMT, UGTs and SULTs. Thus, the expression level of estrogen and its toxic metabolites can be considered to be a comprehensive reflection of the role of these estrogen metabolic enzymes to a certain extent. Polymorphisms in genes encoding these estrogen-related metabolic enzymes are reported to be closely related to differences in enzyme activities and alter the levels of DNA-damaging species to influence the individual's susceptibility to breast cancer [14,17,18]. Genetic epidemiological studies have suggested that there is a correlation between polymorphisms in estrogen metabolism genes and breast cancer risk; however, these results are not consistent [18][19][20]. This is an important reason for the inconsistency of existing research results that studied the correlation between gene polymorphisms of estrogen metabolic enzymes and breast cancer in isolation. Currently, breast cancer risk gene prediction models have not taken estrogen metabolic enzyme genes into consideration; therefore, further optimization is needed from the perspective of overall estrogen metabolism levels.
Based on the above analysis, our research aims to reveal the form of estrogen homeostasis disorders in breast cancer and explore the association between metabolic enzyme gene polymorphisms and breast cancer occurrence from the overall level of estrogen metabolism. Furthermore, we developed a risk score comprising GWAS-selected SNPs and estrogen metabolic enzyme gene SNPs to optimize the breast cancer risk prediction model.

Chemicals
The standards and other chemical reagents were described in our previously published study [21].

Clinical sample collection
Serum samples were collected during the follicular and luteal phases of 64 premenopausal women (mean age: 45.5 ± 5.04 years) first diagnosed with BC and 49 matched healthy women (mean age: 43.7 ± 8.80 years) to detect the level of estrogens. Blood samples were also collected from 140 premenopausal women (mean age: 43.3 ± 6.24 years) first diagnosed with BC and 140 matched healthy women (mean age: 40.2 ± 3.52 years) to extract DNA and analyze SNP genotypes. All samples and related data were obtained from the Affiliated Hospital of Xuzhou Medical University, Xuzhou, China, from June 2017 to May 2019. Patients with BC were enrolled from the Department of Nail Surgery, whereas healthy subjects were enrolled from the physical examination center. Blood samples were collected before any therapy.
The enrollment criteria were as follows: no history of smoking; BMI ranging from 19 to 26; and no history of chemotherapy, radiotherapy, or estrogen-related endocrine therapy during blood sample collection. The characteristics of the patients at baseline can be seen in Table 1. This protocol was approved by the Ethics Committee of the Affiliated Hospital of Xuzhou Medical University. Written informed consent was obtained from each subject before the study.

Quantification of estrogens using the LC-MS/MS method
The LC-MS/MS method was performed according to our previously published method [21].
Statistical analysis SPSS 22.0 software was used to perform statistical analysis. We used the mean ± SEM to express all estrogen data and Student's t-test to test differences between the two groups. Multivariate analysis was performed using SIMCA 14.0 software. HWE was examined among controls using a goodness-of-fit chi-squared test. The odds ratio (OR) and 95% confidence interval (CI) were calculated using a logistic regression model to assess the association between the SNPs and the risk of breast cancer.
We established a PRS to estimate the multigene contribution of estrogen-metabolic enzyme gene loci for breast cancer susceptibility, which was created using marginally significant SNPs associated with breast cancer risk based on the per-allele models. For SNPs in strong linkage disequilibrium located on the same gene or chromosome, we chose the one variant with the lowest P value in the per-allele model as a candidate. The basic formula is as follows: where βk is the per-allele OR for breast cancer associated with the minor allele for SNP k, and xk is the number of alleles for the same SNP (0, 1, or 2).

Disorders of estrogen expression in breast cancer patients
Using LC-MS/MS quantitative analysis, we measured the expression levels of 11 serum estrogens and metabolites in 64 patients with premenopausal BC (mean age: 45.5 ± 5.04 years) and 49 matched controls (mean age: 43.7 ± 8.80 years). We found that there was no significant difference in age between the BC group and NC group. As shown in Fig. 1a, compared with the NC group, the BC group exhibited significantly increased estrogen levels, especially E1, E2, 2-OHE2, 4-OHE2 (P < 0.01) and 2/4-OHE1 (P < 0.05). OPLS-DA was constructed as an unsupervised statistical method to identify potential estrogen homeostatic changes between the two groups. As shown in Fig. 1b, the metabolic profile of the NC group was clearly separated from that of the BC group, indicating that there was a considerable metabolite difference between the BC group and NC group. We also found that the potential biomarkers with VIP values higher than 1.0 in the OPLS-DA model were E1, E2, 2-OHE2, 4-OHE2 and 2/4-OHE1 in the serum of BC patients (Fig. 1c). Overall, these results supported the view that the disorder of estrogen homeostasis was closely related to increased risk of BC.

Cohort description and Hardy-Weinberg equilibrium testing
We enrolled 140 patients first diagnosed with breast cancer and 140 corresponding healthy women in this study. The mean age at diagnosis (for patients with cancer) was 43.3 ± 6.24 years, and the mean age of healthy women at enrollment was 40.2 ± 3.52 years. Blood samples were collected from these participants to extract DNA and analyze the SNP genotype. We found that there was no significant difference in age between the BC group and NC group. The chi-square test was used to test the HWE value, and P > 0.05 explained that the samples at enrollment were representative of the group. As seen in Table 2, all polymorphisms were found to be in genetic equilibrium, which indicated that the observed genotype frequencies of the case and control groups were constant and representative.

PRS breast cancer risk prediction model establishment and evaluation
The binary logistic regression method was used to calculate the OR of the per-allele model, and the detailed results are shown in Table 4. We used SNPs   (rs7439366), to create PRS model 2 (M2). For SNPs in strong linkage disequilibrium located on the same gene or chromosome, we chose the one variant (rs1048943) with the lowest P value in CYP1A1, and rs1056836 is a protective gene loci, we chose the risk variant rs10526827 in CYP1B1. The PRS scores are expressed as the means ± SEM to find the difference between the two groups. Under M1 and M2, the PRS data of the two groups obeyed a normal distribution; therefore, we used an independent sample t-test to evaluate the difference between the two groups of data. As shown in Table 5 and Fig. 2, the PRS scores in the NC group were significantly lower than those in the BC group in M2 (P = 4.9*10 − 5 ); however, there was no significant difference between NC and BC in M1 (P = 0.17). Finally, the ROC curve was calculated to evaluate how the risk models discriminated between women with and without breast cancer (Fig. 3). The ROC curve estimated for M2 was 62.18% (95% confidence interval [CI] = 0.56-0.69), whereas that for M1 was only 54.56% (95% confidence interval [CI] = 0.48-0.61). Therefore, the accuracy of M2 in breast cancer risk identification was better than that of M1.

Discussion
Breast cancer (BC) is an estrogen-dependent tumor, and the occurrence of BC is closely related to the imbalance of estrogen homeostasis. The accumulation of estrogen and its toxic metabolites in vivo is a significant risk factor for BC development. Different types of estrogens have different physiological and pathological activities and can play an important role in the process of cancer development through different mechanisms. Parent estrogens are postulated to promote tumorigenesis directly through the stimulation of the estrogen receptor (ER) [23]. The endogenous conversion of estrogen to genotoxic metabolites has been reported as an alternative, potentially ER-independent mechanism for estrogen-dependent breast tumorigenesis [24]. Catechol estrogens can form adducts with DNA, causing gene mutations and producing direct genotoxicity [13]. Methoxyestrogens, including 2-methoxyestradiol, have been shown to inhibit carcinogenesis by suppressing cell proliferation and estrogen oxidation due to their effects on microtubule stabilization [25].
In this study, the LC-MS/MS quantitative analysis method was used to determine the serum estrogens in the BC group and NC group. Comparing the levels of serum estrogens in the follicular phase and luteal phase of premenopausal breast cancer patients with healthy female volunteers, we found that the levels of parent and hydroxylated estrogen in the BC group were significantly higher than those in the NC group, which indicated that estrogen metabolism disorder is closely related to the occurrence and development of breast cancer. Using OPLS-DA, we also noticed that E1, E2, 4-OHE2, 2-   [27,28]. However, high-risk genes such as BRCA1 and BRCA2 account for less than 15% of breast cancer cases [29,30], which suggests that numerous breast cancer-related risk genes have not been discovered, and these gene polymorphisms influence susceptibility to breast cancer.
Estrogen is an important risk factor for breast cancer. However, no research has incorporated estrogens into the breast cancer risk prediction model. A possible major reason is that there is no clinically effective estrogen evaluation method because the steady state of estrogen is affected by various physiological and pathological factors, such as menstrual cycle fluctuations. However, estrogen homeostasis is regulated by various metabolic enzymes. Therefore, we believe that estrogen metabolic enzyme gene polymorphisms are closely related to estrogen homeostasis and the occurrence and development of breast cancer. In this study, univariate logistic regression analysis showed that CYP1A1, CYP1B1, and SULT1A1 gene polymorphisms are closely related to the occurrence of breast cancer. It is worth noting that these gene polymorphisms are also associated with other estrogendependent tumors such as endometrial cancer and ovarian cancer. Hiroshi Hirata et al. found that the SULT1A1 rs9282861 (rs1042028) was related to endometrial cancer [31]. A meta-analysis was performed to research the association between CYP1A1 gene polymorphism and ovarian cancer risk, which showed that the Ile/Val (rs1048943) was significantly associated with ovarian cancer, with homozygous carriers (Val/Val vs. Ile/Ile: OR = 2.64; 95% CI: 1.63-4.28) being risk factors for ovarian cancer development [32].
CYP1A1 and CYP1B1 are the major phase I drug metabolism enzymes that catalyze the hydroxylation of estrogens. The increasing polarity of estrogens may be related to the risk of breast cancer [14]. Our experiments also verified this view. In this study, we found that the variant allele of CYP1B1 rs1086836 was involved in reducing the risk of breast cancer and that the exact mechanism of the protection of this variant allele was not clear [33]. We assumed that the heterozygous model of CYP1B1 rs1086836 (GC vs. GG: OR = 0.37, 95% CI: 0.21-0.67, P = 0.001) may result in decreased function of the CYP1B1 enzyme, reducing the production of 4hydroxy estrogen and even catechol estrogen-3,4-quinone (CE-3,4-Q) to form adducts with DNA. At the same time, this study also proved that the variant alleles of CYP1A1 rs1048943 (TC vs. TT: OR = 2.37, 95% CI: 1.27-4.43, P = 0.003) and CYP1B1 rs1056827 (AA vs. CC: OR = 6.90, 95% CI: 1.50-31.76, P = 0.001) are closely related to the risk of breast cancer, which is consistent with most research [34,35]. The possible reason is that the mutations promote the activity of CYP1A1 and CYP1B1 enzymes to increase the production of hydroxylated estrogens or promote the individual's susceptibility to estrogen.
SULTs catalyze the sulfate conjugation of a broad range of substrates and play an important role in the metabolism of endogenous and exogenous compounds, including thyroid and steroid hormones, neurotransmitters, drugs and procarcinogens [36]. SULTs catalyze the sulfated metabolism of estrogen (E1 and E2) and its metabolites (such as catechol estrogen) and eliminate the activity of estrogen by forming sulfate compounds: sulfated estrogens that cannot combine with estrogen receptors (ERs). At the same time, it promotes the rapid excretion of sulfated estrogen from the cells, which can reduce the level of estrogen exposure in the circulation and target tissues. SULT1A1 rs1042028 is the most widely studied gene polymorphism. Its allelic variation can reduce enzyme activity and thermal stability, resulting in increased estrogen accumulation and increased individual susceptibility to breast cancer [37]. In this study, the heterozygous model of rs1042028 had a 2.21fold higher risk of breast cancer than the wild-type model. This is consistent with the results of multiple studies [38,39]. Previous studies investigated associations between the PRS of multiple SNPs and breast cancer risk to study the cumulative effect of genes on the disease. Mavaddat et al. constructed a 77-SNP PRS for breast cancer and found a threefold increase in risk when comparing the polygenic scores of the highest 1% and the middle quintiles [9]. Harlid et al. investigated the combined effect of low-penetrant SNPs on breast cancer, which included ten SNPs, and found that the cumulative effect was strongly correlated with breast cancer [40]. However, most of this research on PRS comes from the Caucasian population sample database. Although Sueta, Chan and others have also conducted similar studies in Asian populations, the evidence is still limited [7,41]. To date, there have been no relevant reports on the establishment of a breast cancer PRS risk prediction model from the perspective of estrogen-metabolizing enzymes. Thus, a multigene PRS model including estrogen metabolic enzyme gene SNPs and GWAS-selected SNPs was constructed in this study to evaluate the comprehensive effects of multiple estrogen metabolic enzyme SNPs on breast cancer.
In this study, we evaluated possible relationships between the increased breast cancer risk estrogen metabolic enzyme gene SNPs and GWAS-identified gene SNPs in an Asian population. Among them, the GWASidentified SNPs were not associated with breast cancer risk in the per-allele model or dominant model in our study. This finding was inconsistent with a previous study [23]. Further, we established PRS model 1, including only GWAS-identified SNPs, and PRS model 2, which included estrogen metabolic enzyme gene SNPs on the basis of M1. By calculating the PRS score of each individual under the M1 and M2 PRS models and performing a t-test analysis on the PRS score of the BC and NC groups, we found that the P-value (4.9*10 − 5 ) of the M2 PRS model was far less than that of M1 (0.17). Moreover, the ROC curve (62.18%) of the M2 model was better than that of the M1 model (54.56%). Therefore, the model constructed by adding estrogen metabolic enzyme gene SNPs had a good ability in breast cancer risk prediction, and the accuracy was greatly improved.
There are several limitations of this study that should be noted. First, the sample size was relatively small. In this study, only 140 premenopausal women first diagnosed with BC and 140 matched healthy women were recruited based on our criteria; thus, we did not have enough statistical power to detect the effect of the genetic variants on some of the parameters. Second, because funding was limited, it did not include comprehensive metabolic enzymes and adequate breast cancer risk gene loci. Due to these reasons, the AUC was small and the model have not been tested. In the future, we will study additional estrogen-metabolizing enzyme genes and other breast cancer risk genes in our research. At the same time, we will also include recognized breast cancer risk factors such as age at evaluation, age at menarche, age at first live birth, race, number of breasts, and family history of breast cancer and construct a breast cancer risk prediction model composed of phenotype and genotype to obtain a more valuable ROC value. In addition, the sample size needs to be further expanded, and it is better to include more data information of different races.

Conclusion
Estrogens and related metabolic enzyme gene polymorphisms are closely related to BC. The model constructed by adding estrogen metabolic enzyme gene SNPs has good predictive ability for breast cancer risk, and the accuracy is greatly improved compared with that of the PRS model that only includes GWAS-identified gene SNPs.