Common variation in EMSY and risk of breast and ovarian cancer: a case-control study using HapMap tagging SNPs

Background EMSY could be involved in low-level susceptibility to breast and ovarian cancer. Gene amplification is seen in a proportion of breast and ovarian tumours and correlates with poor prognosis in breast cancer patients. Furthermore, the EMSY protein silences a transcription activation domain in BRCA2 exon 3. Methods We used a genetic association study design to determine if common genetic variation (frequency ≥ 5%) in EMSY was associated with breast or ovarian cancer risk in the British population. Haplotype tagging single-nucleotide polymorphisms (htSNPs) were selected from the HapMap database and genotyped using Taqman® in two large study sets of white British women (n [breast set] = 2343 cases and 2284 controls, n [ovarian set] = 864 cases and 864 controls). HapMap data might be insufficient to tag genetic variation in EMSY comprehensively. We therefore screened the gene promoter and coding sequences with denaturing high performance liquid chromatography in order to identify additional SNPs that are most likely to be functional. Results HapMap data on 22 SNPs show that 4 htSNPs tag 4 common haplotypes: rs2282611 (5'up t>g), rs4245443 (IVS7 g>a), rs2513511 (IVS16 a>g), rs2155220 (3'down c>t). We observed no association between any of the genotypes or associated haplotypes and breast or ovarian cancer risk. Seventeen out of the 18 remaining HapMap polymorphisms (94%) were well tagged by the 4 selected htSNPs (r2s > 0.8). Genotype frequencies for two further SNPs identified by screening and located near exon-intron boundaries, rs2508740 (IVS9 a>g) and rs11600501 (IVS10 c>t), were also similar in cases and controls. In order to simulate unidentified SNPs, we performed the leave-one-out cross-validation procedure on the HapMap data; over 95% of the common genetic variation was well represented by tagging polymorphisms. We are therefore likely to have tagged any common, functional variants present in our population. Conclusion We found no association between common genetic variation in EMSY and risk of breast or ovarian cancer in two large study sets of white British women.


Background
Breast and ovarian cancer are two of the most common causes of cancer in women in the United Kingdom (Office for National Statistics). Together, they account for about a third of all new cancer cases and a quarter of cancer deaths. Positive family history is a well-established risk factor for both diseases: the risk to first-degree relatives of a case is about 2 times the population risk [1][2][3]. Most of the excess familial risk associated with breast and ovarian cancer is likely to be genetic in origin [3,4]. However, only a small proportion of this risk is accounted for by known highly predisposing genes, BRCA1 and BRCA2, while the remainder might be explained by a combination of weakly predisposing alleles [5][6][7][8][9][10]. EMSY (C11orf30) is a novel gene that could be involved in low-level predisposition to breast and ovarian cancer [11]. Its recent discovery generated widespread interest [12,13]. The gene maps to chromosome 11q13, spans 103.3 kilobases and comprises 20 coding exons. EMSY is amplified in 12% of breast cancers and 17% of high-grade ovarian cancers and its amplification has been associated with an increased risk of relapse as well as decreased survival in breast cancer patients [11,14]. Furthermore, the EMSY protein silences the transcriptional activation potential of BRCA2 exon 3, a region deleted in a Swedish breast and ovarian cancer family [11,15].
The case-control study design is well suited to the identification of small-effect genes that are likely to underlie common, complex diseases such as breast or ovarian cancer: a difference in allele frequency is sought between affected individuals and unrelated controls [16]. Two approaches have been proposed. The traditional, hypothesis-driven approach is to investigate single-nucleotide polymorphisms (SNPs) in coding regions, since they are more likely to have a functional role and to influence directly the traits under study [17]. The alternative, indirect approach is to select a set of haplotype-tagging SNPs (htSNPs); htSNPs are informative polymorphisms that best characterize haplotype diversity and therefore genetic variation within the gene [18,19]. They serve as markers to detect associations between a particular region and diseases, whether or not the SNPs themselves have a functional effect [20,21]. It is not necessary to genotype all possible polymorphisms because the alleles of SNPs that are physically close to each other tend to be correlated with each other: they are in linkage disequilibrium (LD) [22]. The HapMap online database allows the indirect approach to be applied readily to many genes or regions [23]. By December 2004, the database held the genotypes of 90 individuals with northern and western European ancestry for over 850'000 SNPs.
We used a genetic association study design to determine if variation in EMSY was associated with breast or ovarian cancer risk. In order to have good power to detect small relative risks, we restricted our attention to common SNPs and haplotypes (frequency ≥ 5%). We first selected htSNPs using HapMap data. We also screened the gene promoter and coding regions in order to identify additional polymorphisms that are likely to be functional, as HapMap might be insufficient to tag genetic variation comprehensively. The selected SNPs were then genotyped in two large case-control sets (one breast cancer set and one ovarian cancer set) of white British women.

Patients and controls
Cases were drawn from the breast and ovarian arms of the SEARCH Cancer Study, an ongoing population-based study with cases ascertained through the East Anglia and West Midlands cancer registries in the United Kingdom [5,24]. All women diagnosed after 1990 with invasive breast cancer under the age of 70 years, or epithelial ovarian cancer under the age of 75 years, were eligible for inclusion. Approximately 65% of eligible breast cancer patients and 60% of ovarian cancer patients have enrolled in the study. Women taking part were asked to provide a 20-ml blood sample for DNA analysis and to complete a comprehensive epidemiological questionnaire. We carried out genotyping on sub-sets consisting of the first 2343 (breast cancer) and 864 (ovarian cancer) cases. Median age at diagnosis was 51 years for breast cancer cases (age range 25 to 69) and 55 years for ovarian cancer cases (age range 16 to 74). Two thousand two hundred and eighty-four and 864 controls were randomly drawn from the Norfolk component of the European Prospective Investigation of Cancer (EPIC), for the breast and the ovarian studies, respectively [25]. The EPIC-Norfolk cohort comprises 25,000 individuals resident in Norfolk (East Anglia), ages 45-74 years. The ethnic background of both cases and controls is similar, with over 98% being white Europeans. Ethical approval was obtained from the Anglia and Oxford Multicentre Research Committee and the Norwich Local Research Ethics Committee and informed consent was obtained from each patient.

SNP identification and selection
We selected htSNPs from the HapMap database (http:// www.hapmap.org, public releases up to September 2004) with the TagSNPs program [26], including 5 kilobases upstream and downstream the gene and aiming for a minimum r 2 h of 0.8. r 2 h is a measure of correlation between haplotypes defined by all SNPs and haplotypes defined by the selected htSNPs. At the time of selection, genotypes were only available for the Centre d'Etude du Polymorphisme Humain (CEPH) samples: these were collected in 1980 from people living in Utah with ancestry from northern and western Europe.
In order to screen the gene promoter and coding regions for polymorphisms, we performed denaturing high performance liquid chromatography (DHPLC) using the Wavemaker detection system (version 4.1, Transgenomics, Crewe, United Kingdom), followed by sequencing (3100 Genetic Analyser, Applied Biosystems, Warrington, United Kingdom) on genomic DNA from 48 random breast cancer cases. A 600-base pair putative promoter starting 500 base pairs upstream the gene was identified with gene2promoter http://www.genomatix.de, a program that allows for automated extraction of groups of promoters from a list of accession numbers or gene IDs.

Genotyping
Genotyping was carried out using Taqman ® (Applied Biosystems) according to manufacturer's instructions. Primers and probes were supplied directly by Applied Biosystems except those for IVS9 a>g that were designed using Primer Express Oligo Design Software v2.0 (Applied Biosystems). Sequences are available on request. Reactions were carried out at 60°C in 384-well plates with cases and controls plated together. Each plate included 2 negative controls with no DNA and 12 samples duplicated on a separate quality control plate. Plates were read on the ABI Prism 7900 using the Sequence Detection Software (Applied Biosystems). Failed genotypes were not repeated.

Statistical methods
For each SNP, deviation of genotype frequencies in controls from the Hardy-Weinberg equilibrium was assessed by a χ 2 test with one degree of freedom (df). Genotype frequencies in cases and controls were compared by a χ 2 test for heterogeneity (2df). Genotype-specific risks were estimated as odds ratios (OR) using standard cross-product ratio and confidence intervals were calculated using the variance of the log (OR), which was estimated by the standard Taylor expansion. A comparison of haplotype frequencies between cases and controls was carried out using the haplo.score routine implemented in S-plus [27]. Haplotypes with an estimated frequency of less than 5% were pooled. Haplo.score uses a likelihood that depends on estimated haplotype frequencies to test the statistical association between haplotypes and phenotype. It is based on score statistics, which provide both global tests and haplotype-specific tests [27]. The LDA program [28] was used to calculate pairwise LD for SNP pairs in the breast cancer study set. LDA is a Java-based program implementing the EM algorithm for pairwise LD analysis [28].
Power was determined using standard statistical methods [29]. We have over 90% power at the 1% significance level to detect a dominant allele with a frequency of 0.2 that confers a relative risk of breast cancer of 1.3 or a relative risk of ovarian cancer of 1.6. Power to detect recessive alleles at the 1% significance level is more limited: 59% for an allele with a frequency of 0.2 that confers a relative risk of breast cancer of 1.5 or 51% for an allele with a frequency of 0.3 that confers a relative risk of ovarian cancer of 1.5.

Results
Genotypes for 22 common EMSY SNPs were available in HapMap, none of the SNPs were in coding regions. The working density was therefore of one SNP per 5 kilobases. The gene consisted of only one LD block [21]. There were 5 common haplotypes which constituted 92% of all the observed haplotypes and were tagged by 5 htSNPs: rs2282611 (5'up t>g), rs4245443 (IVS7 g>a), rs2513511 (IVS16 a>g), rs2155220 (3'down c>t) and rs7106446 (table 1). Taqman ® assays were successfully designed for the first four, but an assay could not be designed for rs7106446. There were no alternative SNPs with similar tagging properties. We were thus left with 4 htSNPs tagging 4 common haplotypes.
Genotyping success rate was over 92%. None of the genotype distributions in the controls differed significantly from those expected under Hardy-Weinberg equilibrium. There was no evidence that any of the SNPs is associated with breast (table 2) or ovarian cancer (table 3); genotypespecific OR were all close to unity with narrow confidence intervals. There was no association of genotype with age in controls and, as expected, age adjusted risks were close to the unadjusted risks (data not shown). The 4 htSNPs generated 5 common haplotypes in our population; the global test of association was not significant for breast cancer (P = 0.27) or for ovarian cancer (P = 0.93), nor were there any differences between cases and controls for the individual haplotype frequencies (Additional file: 1). The number of common haplotypes tagged by the 4 selected htSNPs differed between HapMap (n = 4) and our study (n = 5) because two rare HapMap haplotypes tagged by SNP rs1939468 were grouped into our fifth common haplotype (Additional file: 1).
Screening of the promoter and coding regions revealed two further SNPs located near exon-intron boundaries, rs2508740 (IVS9 a>g) located 4 base pairs upstream exon 10 and rs11600501 (IVS10 c>t) located 14 base pairs upstream exon 11; neither of these were associated with breast or ovarian cancer risk (tables 2 and 3). At the time of study, there were four putative, non-validated coding SNPs mentioned in the dbSNP database http:// www.ncbi.nlm.nih.gov/SNP/: rs1954782, rs11822571, rs3753051 and rs1047196. We did not detect any of them. LD was strong (D' > 0.7) across pairs involving all SNPs except IVS10 c>t while IVS7 g>a and IVS9 a>g were in nearly perfect LD (r 2 = 0.94) ( figure 1).

Discussion
This is the first association study reporting results on EMSY, a gene of importance through its interaction with BRCA2 and its amplification status in tumours. We found no association between any of the EMSY genotypes or associated haplotypes and risk of breast or ovarian cancer in a white British population. We could have failed to observe a true association because of a Type II statistical error, but the large size of our study gives us high statistical power and strongly reduces the likelihood that our results are influenced by chance fluctuations in the case or control genotype frequency [30].
An alternative reason for failure to observe a true association could be that our set of tagging SNPs are poor markers of a true causal variant, which would either be one of the known SNPs in the gene or an as yet unidentified variant. In HapMap, common EMSY haplotypes were tagged by 5 SNPs. However, an assay for rs7106446 could not be designed and thus our htSNP set was suboptimal. Where a tagging SNP is used as a marker for a true disease-predisposing SNP the effective sample size is proportional to the bivariate correlation coefficient (r 2 ) between the marker and causal SNPs [31]. r 2 s is the squared correlation coefficient between multi-locus haplotypes and individuals SNPs and is analogous to r 2 . In order to establish how well we had tagged the known SNPs with our set of tagging SNPs, we calculated r 2 s [26] between the 4 selected htSNPs and the 18 remaining HapMap polymorphisms. Seventeen (94%) SNPs were tagged with r 2 s > 0.8 but 1 SNP,  22-5.32) rs7106446, was tagged with r 2 s = 0.46. Loss of power was therefore marginal for all HapMap SNPs except rs7601446; for a SNP tagged with r 2 s = 0.85, we had 89% power at the 1% significance level to detect a dominant allele with a frequency of 0.3 that confers a relative risk of ovarian cancer of 1.5.
It is also possible that we have not adequately tagged an unidentified, disease-predisposing SNP. Whole-gene resequencing across a sample population would be required to identify all existing polymorphisms and allow investigators to select htSNPs that tag all genetic variants. The HapMap project does not re-sequence the genome; it validates SNPs from the dbSNP public database, aiming for a density of polymorphisms that cover the whole of genetic variation across the human genome. Comprehensive tagging requires a genotyping density of around 1 SNP per 2.5 kb [32]. The 1 SNP per 5 kb density available for EMSY in HapMap might therefore be insufficient. In order to identify additional SNPs that are most likely to be functional [17], we screened the gene promoter and coding sequence with DHPLC, a technique with an estimated sensitivity of 94% [33]. Two SNPs near exon-intron boundaries were identified but neither was associated with breast or ovarian cancer.
We also assessed how well a set of htSNPs would have tagged any unidentified SNPs using a leave-one-out cross validation procedure on the HapMap data: each of the 22 known SNPs were dropped in turn and htSNPs selected from the remaining 21, thus simulating unidentified polymorphisms [32]. The ability of htSNPs to tag the dropped SNP was then evaluated by calculating r 2 s [26]. Mean r 2 s was 0.94. Twenty-one (95.4%) out of 22 dropped SNPs were tagged with r 2 s > 0.8, and only 1 (4.6%) was tagged with 0.4 < r 2 s < 0.8. Over 95% of the common genetic variation in EMSY should therefore be well represented by tagging polymorphisms. We are therefore likely to have tagged any common, functional variants present in our population. After this study was completed and the first version of the manuscript submitted, genotyping data in a white American population for rs3753051, a synonymous coding SNP in exon 19, were released in dbSNP. We were able to assess how this polymorphism was tagged by our set of SNPs as genotypes from the same individuals were also available for 5'up t>g, This study design can not exclude the involvement of a rare allele in predisposition to breast or ovarian cancer; for example, CHEK2*1100delC, has a frequency of around 1% and was recently shown to confer a two-fold increased risk of breast cancer [34]. Our study set would be too small to detect the effect of such an allele if it doubled the risk of ovarian cancer. Some authors have advocated the use of histopathologic or demographic data that subclassify individuals in order to identify homogeneous subsets for analysis [35]. In the absence of any main effect or strong biological rationale, we have not carried out subgroup analyses as much larger sample sizes would be required to obtain reliable results. The number of possible post-hoc, subgroup analyses is large and there is a strong possibility that one or more tests will be statistically significant simply by chance [36].
We are reporting results for a set of htSNPs selected from HapMap. We used genotypes for the CEPH samples to choose htSNPs. According to the HapMap website, it is unclear how accurately the CEPH samples reflect the patterns of genetic variation in people with northern and western European ancestry. Our results suggest that they correctly predict genetic variation in our white British population: allele frequencies in the breast study controls were similar to those obtained from HapMap (P = 0.57, 0.99, 0.88 and 0.85 for 5'up t>g, IVS7 g>a, IVS16 a>g and 3'down c>t, respectively), thus strengthening the argument for a widespread use of the database for htSNPs selection. A predisposing SNP might have a differential effect in another ethnic group via gene-gene or gene-environment interactions, although in a recent study of the genetic effects for 43 validated gene-disease associations across 697 study populations of various descents, Ioannidis et al. concluded that, even if frequencies of polymorphisms varied among populations, their biological impact on the risk for common diseases should be consistent across traditional ethnic boundaries [37,38].

Conclusion
We saw no association between common SNPs in EMSY or their associated haplotypes with risk of breast or ovarian cancer in two large study sets of white British women.
Linkage disequilibrium (LD) Figure 1 Linkage disequilibrium (LD). Pairwise (LD) measures of D' (left bottom half) and r 2 (right top half) for the six singlenucleotide polymorphisms (SNPs) genotyped in the breast and ovarian cancer study sets.