SNP-SNP interactions in breast cancer susceptibility

Background Breast cancer predisposition genes identified to date (e.g., BRCA1 and BRCA2) are responsible for less than 5% of all breast cancer cases. Many studies have shown that the cancer risks associated with individual commonly occurring single nucleotide polymorphisms (SNPs) are incremental. However, polygenic models suggest that multiple commonly occurring low to modestly penetrant SNPs of cancer related genes might have a greater effect on a disease when considered in combination. Methods In an attempt to identify the breast cancer risk conferred by SNP interactions, we have studied 19 SNPs from genes involved in major cancer related pathways. All SNPs were genotyped by TaqMan 5'nuclease assay. The association between the case-control status and each individual SNP, measured by the odds ratio and its corresponding 95% confidence interval, was estimated using unconditional logistic regression models. At the second stage, two-way interactions were investigated using multivariate logistic models. The robustness of the interactions, which were observed among SNPs with stronger functional evidence, was assessed using a bootstrap approach, and correction for multiple testing based on the false discovery rate (FDR) principle. Results None of these SNPs contributed to breast cancer risk individually. However, we have demonstrated evidence for gene-gene (SNP-SNP) interaction among these SNPs, which were associated with increased breast cancer risk. Our study suggests cross talk between the SNPs of the DNA repair and immune system (XPD-[Lys751Gln] and IL10-[G(-1082)A]), cell cycle and estrogen metabolism (CCND1-[Pro241Pro] and COMT-[Met108/158Val]), cell cycle and DNA repair (BARD1-[Pro24Ser] and XPD-[Lys751Gln]), and within carcinogen metabolism (GSTP1-[Ile105Val] and COMT-[Met108/158Val]) pathways. Conclusion The importance of these pathways and their communication in breast cancer predisposition has been emphasized previously, but their biological interactions through SNPs have not been described. The strategy used here has the potential to identify complex biological links among breast cancer genes and processes. This will provide novel biological information, which will ultimately improve breast cancer risk management.


Background
The existence of dominant predisposition alleles/mutations, conferring a high breast cancer risk, has been confirmed with the discovery of BRCA1 and BRCA2 [1,2]. The functionally defective mutations of BRCA1 and BRCA2 are strongly associated with dramatically increased breast cancer risk, however, such mutations are found to be rare (<5%) in unselected breast cancer cases [1][2][3]. Besides these rare mutations, commonly occurring single nucleotide polymorphisms (SNPs) have also been shown to incrementally contribute to breast cancer risk, however, their individual contributions are relatively small [4][5][6].
SNPs have been historically classified as commonly occurring (>1%) genetic variation in the general population, whereas the rare variants with obvious functional consequences on the protein have been classified as mutations. Compared to mutations, SNPs have been perceived as functionally insignificant, however, current evidence emphasizes that a considerable fraction affects the intrinsic properties and the function of the proteins to a variable degree [7][8][9]. Although the effect of an individual SNP is generally small, the genetic effect of combinations of functionally relevant SNPs may additively or synergistically contribute to increased breast cancer risk. Epistasis or gene-gene interaction is likely to be a ubiquitous component of the genetic architecture of common diseases, such as breast cancer. The effects of epistasis could dictate functional outcomes over the independent effects of any one susceptibility gene [10]. Polygenic models have also been proposed to explain the joint effect of many susceptibility alleles on breast cancer, but without considering specifically their possible interactions [11][12][13][14].

Subject population
A case control study was conducted using biospecimens and data from the Ontario Familial Breast Cancer Registry (OFBCR) a participating site in the NIH-funded Breast Cancer Family Registry [42]. Written informed consent was obtained from all subjects, and the study protocol was approved by Mount Sinai Hospital Research Ethics Board.
Cases of invasive breast cancer, pathologically confirmed and diagnosed between 1996 and 1998 in the province of Ontario were identified from the population-based Ontario Cancer Registry. All female cases under 55, a random sample (35%) of female cases aged 55 to 69, and all male cases under age 80 were identified. Physician permission to contact patients was granted for 91% of cases (7668 of 8453). Patients were then mailed a cancer family history questionnaire and 65% (4957) completed it. All respondents who met a defined set of genetic risk criteria (i.e., Ashkenazi Jewish; diagnosed before age 36 years; previous ovarian or breast diagnosis; one or more first-or two or more second-degree relatives with breast or ovarian cancer; one or more second-or third-degree relatives with either breast cancer diagnosed before age 36 years, ovarian cancer diagnosed before age 61 years, multiple breast or breast and ovarian primaries, or male breast can-  cer; three or more first-degree relatives with any combination of breast, ovarian, colon, prostate, or pancreatic cancer or sarcoma, with at least one diagnosis before age 51 years) were included in the study [43] and a random sample of 25% of those not meeting criteria were selected to continue to participate in the OFBCR (n= 2580). This participation included providing a blood sample (provided by 62% of all eligible, n = 1601). For the current study, we restricted the sample to women who identified themselves as Caucasian and were less than 55 years old. As we had randomly sampled 25% of those who did not meet genetic risk criteria, we also randomly sampled 25% of those who did meet genetic risk criteria in order to create a more representative sample of cases. Therefore, the cases should better represent all cases without enrichment for genetic risk criteria such as family history. In Table 2, 21.6% of cases in the present study had a first-degree family history of breast cancer, which is consistent with the 17 to 22% frequency reported in cases in a number of large case-control studies [44][45][46]. Of 459 Caucasian breast cancer cases with blood available, 398 were successfully genotyped and included in the study.
Controls were identified by calling randomly selected residential telephone numbers from across the province of Ontario and were frequency-matched to all female OFBCR cases by 5-year age group. The number of telephone numbers was 14,653, but 1101 (8%) were invalid and no contact could be made for 841 (6%). Of the 12,711 households contacted, 7829 (62%) did not have an eligible individual. No information on eligibility was provided for 2194 (17%) households. Of the 2688 eligible individuals identified on the telephone, 1726 (64%) completed the mailed risk factor questionnaire and 75% of these agreed to be contacted about providing a blood sample. The 676 women under age 55 who had agreed to be approached about blood sampling were asked to provide a blood sample and 419 (62%) did so. Individuals who were not Caucasian were excluded from the analysis, as were those with insufficient DNA or those subsequently found to be ineligible because of age. The remaining 372 population controls were successfully genotyped in this study.

Molecular genotyping
All SNPs were analyzed by TaqMan 5'nuclease assay [47] using the ABI PRISM 7900 HT Sequence Detection System (version 2.0). Oligonucleotide primers and the dual labeled allele specific probes were designed using PrimerExpress version 2.0 (PE Biosystems). Positions of primers for and probes in their appropriate accession numbers are given in Additional file 1.
A panel of DNA samples were sequenced for each SNP region initially, in order to identify control genotypes to be used in each experiment. PCRs were performed in 96 well plates (AXYGEN) with each plate containing four control samples for each possible genotype. Genomic DNA (10 ng) was amplified in a total volume of 10 ul in the presence of 100 uM of each of the dNTPs, 3 pmoles of each of the appropriate primers, 2 pmoles of each of the corresponding dual labelled probes, and 0.025 units of Platinum Taq DNA Polymerase (InVitrogen). PCR cycling conditions consisted of 40 cycles of 94°C for 15 sec, 55-60°C for 15 sec and 72°C for 15 sec. The optimal MgCl 2 concentrations and annealing temperatures for each SNP are given in Additional file 2. The reliability of the results was determined by re-genotyping a randomly selected 10% portion of the total study population.

Statistical analyses
We sought evidence of association between each of the 19 SNPs and breast cancer risk in a multi-step process. At the first stage, we calculated crude allele and genotype frequencies for each individual polymorphism and evaluated Hardy-Weinberg equilibrium using a one-degree of freedom goodness-of-fit test among controls [48]. The association between the case-control status and each individual SNP, measured by the odds ratio (OR) and its corresponding 95% confidence interval, was estimated using unconditional logistic regression after adjustment for age. Several epidemiological risk factors were also assessed for association with breast cancer including age, BMI, education status, smoking status, family history, menopausal status, age at menarche, age at menopause, parity and age at first live birth ( Table 2). Some of our analyses were also carried out adjusting the SNP main effect for the statistically significant epidemiological risk factors.
All analyses were performed assuming a dominant, recessive and co-dominant effect for each polymorphism. In the dominant model, both the heterozygous variant and the rare homozygous variant were combined. In the recessive model, the variant was defined as only the rare homozygous genotype and in the co-dominant model both rare homozygous and heterozygous variant effects were estimated using two dummy variables. In all analyses, the common homozygote genotype in the control population was defined as the reference category. Age was considered as a continuous variable. The likelihood ratio test was used to test the effect of each SNP at the nominal 5% significance level. Akaïke's information criterion [49] was also used to select the best genetic effect for each SNP.
At the second stage, two-way interactions were investigated using multivariate logistic models. More specifically, we tested all SNP-SNP interactions. We assumed a multiplicative interaction effect on the logit scale. Statistically significant interactions were selected using a forward stepwise selection procedure to evaluate evidence that  specific interactions were independently associated with breast cancer. The initial model included all SNPs and age as main effects, and then searched for the most significant candidate interactions to enter into the model based on the score statistics at the 5% level. Backward elimination of variables was then performed using the likelihood ratio test (LRT) also at the level of 5%. Forward stepwise selection procedure has proven to be efficient in assessing interaction effects as compared to backward elimination when testing multiple interactions. First, it is more time efficient and second, when using backward elimination, a relatively large number of predictor variables may increase the risk of complete separation of the two outcome groups, which would result in numerical problems in estimating the model parameters [50]. Since the genetic risk model is uncertain for most of the SNPs considered, we performed these tests on the co-dominant models only. Therefore, tests for SNP-SNP interactions have four degrees of freedom. All these analyses were also performed adjusting the interaction effects for the risk factors found to be associated with breast cancer risk at a significance level of 5% (BMI and family history).
We have also estimated the amount of linkage disequilibrium (LD) between the two ESR1 SNPs separated by about 140 kb on chromosome 6 and investigated their haplotype effect on breast cancer using the software "Unphased" from Dudbridge [51].
The large number of interactions (n = 171) analyzed could lead to false positive results, therefore, we adopted two different strategies to avoid this problem: The first approach included the assessment of the selection procedure using bootstrapping and the second one included an adjustment for multiple testing using detection rate (FDR). The bootstrap approach selects random samples of size n (n 1 cases + n 2 controls) with replacement from the original data [52]. Repeating the sampling procedure a large number of times provides information on the variability and validity of the parameter estimate and model selection. We repeated the selection procedure on 1,000 random samples (each random sample comprising 398 cases and 372 controls), generated from the original sample and the number of times a particular interaction was selected was reported. The achieved significance level (ASL) from the bootstrap test of hypotheses was also computed. Following Efron and Tibshirani [52], the ASL was obtained by comparing the observed LRT statistic for a specific interaction to its null distribution, evaluated by randomly assigning the case-control status in 1,000 bootstrap samples. The second approach tries to correct formally for the multiple testing problem using the FDR principle [53]. This procedure does not control the experiment-wise error rate like the Bonferroni-type correction (which is known to be conservative) but estimates the proportion of errors among the rejected null hypotheses. FDR was applied to both main effect models and interaction models using bootstrap P-values. For these latter models, bootstrap P-values and FDR-adjusted P-values correspond to interaction effects in multivariate logistic models that include all main effects and only the interaction of interest. This is equivalent to the test performed at the first step of the forward stepwise regression. We also computed the probability of no true association between an interaction and the disease status given a statistically significant result (i.e. the false positive report probability, FPRP) proposed by Wacholder et al. [54]. This statistic depends on the observed P-value but also on both the prior probability that the association between the SNP-SNP interaction and the disease is real and the statistical power of the test. The power of the test was determined by computing the expected value of the likelihood ratio test statistic, assuming our data were analyzed by the unconditional age-adjusted logistic model and with the specific coding of the SNP-SNP interactions using four dummy variables (see above). This computation was implemented into an R program, following the method described by Gauderman [55,56]. We used informative prior probabilities using the functional studies presented in Table 1 to classify the importance of each SNP. Following Wacholder's recommendations [54], the probability assigned to each SNP was 0.10, 0.01 and 0.001 for the high-, medium-, and low-ranked SNPs respectively. The prior joint probability for each pair of SNPs was just the product of the individual SNP probability. Finally, we used the bootstrap P-values in the FPRP computation. Table 1 gives the minor allele frequencies of 19 SNPs estimated in our control population. None of the SNP distributions showed deviation from Hardy-Weinberg equilibrium in this sample. The distribution of selected epidemiologic risk factors in cases and controls is shown in Table 2. Cases and controls were similar with respect to the distribution of smoking status, menopausal status, age at menarche, age at menopause, parity, and age at first birth. Controls tended to have a higher BMI (p = 0.05) and level of education (p = 0.06) than cases. Cases were also more likely to have a positive family history of breast cancer than controls, and this difference was highly significant (p = < 10 -5 ). Logistic regression analysis was performed for all SNPs in the context of recessive, dominant and co-dominant models after adjustment for age. The estimated ORs and 95% CIs for all SNPs under codominant models are shown in Table 3.

Results
Among the 19 SNPs studied, XPD-[Lys751Gln] was the only one showing a significant main effect in our sample based on the crude P-value. However, after correction for multiple testing using FDR, the effect was not significant. Our results remained unchanged when the models were also adjusted for BMI and family history. Results of twoway interaction analyses are shown in Table 4. Since the genetic risk models are uncertain for most of the SNPs considered, we performed these tests on the co-dominant models only. A total of nine SNP-SNP interactions were consistently selected in at least 30% of the random samples by the stepwise procedure. Interactions were observed more frequently for XPD- Report Probability (FPRP) approach, computed using the functional importance of each SNP, we found that three interactions were noteworthy at the 0.2 FPRP level (Table  4) were all significant (respectively, P = 0.014, P = 0.020, P = 0.022 and P = 0.020), however the significance of the tests decreased due to the high proportion of individuals missing BMI information. Therefore, only the analyses adjusted for age are presented in Table 4.
The amount of LD between the two ESR1 SNPs was relatively small, with a D' [57] of 0.07 in cases and 0.15 in controls, and none of the four haplotypes was significantly associated with breast cancer. Therefore, only the interaction effect between the two SNPs is presented.

Discussion
In this study, we have analysed the contribution of 19 SNPs from 18 cancer-related genes, to breast cancer risk in a case-control study of 398 breast cancer cases and 372 population controls, sampled from the population-based OFBCR. All cases and controls were Caucasian women under age 55. We found that among 19 SNPs, XPD-[Lys751Gln] substitution was the only one showing a significant association with breast cancer risk. However, after correction for multiple testing, the effect became insignificant, suggesting that this finding might be due to chance. Overall we found little evidence of breast cancer risk conferred by individual commonly occurring SNPs in this dataset.
Our main focus was to understand the contribution to breast cancer risk of functionally relevant SNP-SNP inter- actions within and between different cancer pathways.
Recently, there has been increasing evidence regarding the joint effect of commonly occurring SNPs on cancer risk, supported by polygenic breast cancer models [11][12][13][14]. Although this model was originally based on the additive effects of multiple risk alleles [14,58] (each with a low to moderate risk) it can be generalized by considering interaction effects between the susceptibility alleles.
In this study, we have shown several statistically significant interactions between SNP pairs. Despite the low false discovery rates observed for certain interactions and small-unadjusted P-values, caution should be used when interpreting these results. First, the statistical modeling of interactions through a product term might not correspond to or reflect any biological interaction such as synergy or antagonism [59,60]. Second, as in any association study, epidemiologic limitations such as selection bias or confounding have the potential to lead to false-positive results. Cases were selected from a population-based cancer registry and although selection may have occurred, there was no evidence of selection related to family history of breast cancer [43,61]. This analysis was also restricted to Caucasians who had the highest response rates and also minimizes the potential for population stratification.
To assess the robustness of interaction models, we used two different strategies: an internal validation procedure based on bootstrap re-sampling methods [52] and a correction for multiple testing using the FDR principle [53].
The first approach allowed us to prioritize nine "candidate" SNP-SNP interactions that were consistently selected by the stepwise procedure across the bootstrap samples (i.e. in more than 30% of the 1,000 random samples) and that were significant based on the age adjusted bootstrap P-values. This method can be used to assess the variability of our model selection but does not control formally for the multiple testing problem. Although correction for multiple testing can be performed under the bootstrap framework [62] or using the familiar Bonferroni correction, these procedures can be very conservative. We therefore used the FDR principle that does not control the experiment-wise error rate but estimates the proportion of errors among the rejected null hypotheses. Using this correction, none of the SNP main effects were significant, but four 2-way interactions had adjusted P-values lower than 5%. The stepwise selection procedure applied to our original data set without validation or correction would have detected 14 significant SNP-SNP interactions at the 5% level. Based on our validation procedures, ten of these are likely to be false positive results. This shows the importance of model validation in studies of gene interactions [63]. Although this validation was internal (i.e. using the same data set), we intend to replicate our results using an external data set.
To interpret our positive results, we should also remember that our SNPs were selected from genes involved in cancer, and enriched by SNPs that are likely to affect the function of the encoded protein. Using a Bayesian approach, Wacholder [54] recently showed how the probability of *Age-adjusted odds ratios from unconditional logistic regression analyses.

Table 3: Analysis of individual SNP effects on breast cancer. Estimated odds-ratios (OR) and 95% confidence intervals (CI) associated with each of the 19 polymorphisms selected under co-dominant age-adjusted main effect models (the significant associations are in bold). (Continued)
no true association between a genetic variant and disease given a statistically significant result (i.e. the false positive report probability (FPRP) depends on the prior probability that the association is real and also on the statistical power of the test. Although the determination of a prior probability is quite challenging [64], selecting SNPs based on their functions clearly reduce the FPRP. Based on this approach, we found that three out of the four significant interactions were noteworthy at the 0. . More complex approaches to account for the prior knowledge of the functional importance of each SNP have also been proposed [65][66][67][68]. We intend to use some of these methods to confirm our results in future analyses. Another problem to consider is the chance for false negative results. SNPs that were considered as negative after our conservative multiple comparison adjustment might still be worthy of investigation in other data sets. Therefore, some of our results must be viewed as hypothesis generating. Validation of our results (positive and negative) in an independent data set will provide further insight into the role of these SNPs in breast cancer etiology.
The novelty of our study is the demonstration of statistically significant interactions between SNPs that did not have an effect on breast cancer risk individually. Most studies have investigated the main effects of commonly occurring SNPs and categorized them as "not associated", and thus not important in breast cancer risk. However, our study suggests that SNPs without main effects or with main effects, too small to detect, may interact with others and confer an increased risk for breast cancer. Larger studies will allow a better application of our model, in which more complex interactions could be investigated. SNP-SNP interactions in breast cancer development have been also reported in other studies, which targeted the SNPs of the carcinogen metabolism genes, including GSTM1, GSTT1, GSTP1, GSTM3 and CYPs [69][70][71][72][73]. These reports support our findings regarding SNP-SNP interactions on breast cancer risk, although they were only limited to the interactions of SNPs within a single cancer pathway.
In the context of breast cancer predisposition, our study suggests there is cross talk between the alleles of proteins of different cancer pathways including DNA repair and the immune system (XPD-  We have further investigated the biological relationships between XPD and COMT interactions with other SNPs using protein-protein interaction databases and a literature search. As seen in Figure 1  and BARD1 revolves around the protein-protein interaction of BRCA1 and TP53. As shown previously, mutations in these two proteins lead to predisposition to hereditary breast cancer [1,78]. This supports the observation that SNPs of XPD (ERCC2), BARD1 and IL-10 may be good candidates for breast cancer predisposition, which may also modify the effect of BRCA1 in carriers. On the other hand, statistical interaction of COMT-[Met108/158Val] with GSTP1-[Ile105Val] and CCND1-[Pro241Pro] revolves around estrogen metabolism and cell proliferation ( Figure 2). Estrogen is an important risk factor for breast cancer. Estrogen is broken down into reactive species by phase I enzymes, which are then inactivated by phase II enzymes such as the methylation of catechol estrogens by COMT or the conjugation of oxidized estrogen-quinones by GSTP1. The relation between COMT and GSTP presumably depends on reduced inactivation of the aforementioned reactive estrogen intermediates, because of decreased activities of both enzymes [4,22,23]. The regulated level of estrogen might in turn influence the cell proliferation through CCND1 transcription [4,[24][25][26][27]79].
The biological pathways investigated in this study have been previously implicated in breast cancer development; however, their genetic interactions, detected through variant alleles (SNPs), have not been previously described. These data and the statistical approaches applied to them have the potential to assist in the identification of complex biological relationships among cancer processes during the development of breast cancer. When moving into the era of "genetic dissection of complex traits [80]", we will need to abandon the concept of single genetic deter-minants to favor the idea of a "web of causation [81]" involving multiple and complex pathways, which in turn could implicate many genes and environmental factors. This study provides a possible framework for a functional SNP-SNP interaction-based model for breast cancer risk.

Conclusion
Our focus in this study has been to uncover SNP-SNP interactions, which additively or synergistically contribute to breast cancer risk. From our small pool of SNPs, we have shown significant statistical interactions suggesting biological cross talk among genes/SNPs from DNA repair, cell cycle, immune system and carcinogen metabolism pathways. Our immediate task is to apply this strategy to a larger sample, with the aim of replicating our findings and investigating more complex interactions (involving three or more SNPs). This line of research has the potential to identify important cross talk between members of the cancer pathways in the disease state. This study not only provides insight into the analysis of the multi-genic nature of breast cancer, but also provides important information regarding how cell function relates to breast cancer development. We believe that these and other interactions in breast cancer will one day be identified and used in clinics to identify individuals at increased risk of breast cancer and develop preventive strategies.