Single nucleotide polymorphism analysis in breast cancer using MassARRAY from Jammu and Kashmir, India


 BackgroundBreast Cancer (BC) is associated with inherited gene mutations. High throughput genotyping of BC samples has led to the identification and characterization of biomarkers for diagnosis of BC. The most common genetic variants studied are SNPs (Single Nucleotide Polymorphisms) that determine susceptibility to an array of diseases thus serving as a potential tool for identifying the underlying causes of breast carcinogenesis. MethodsSNP genotyping employing the Agena MassARRAY offers a robust, sensitive, cost-effective method to assess multiple SNPs and samples simultaneously. In this present study, we analyzed 15 SNPs of 14 genes in 550 samples (150 cases and 400 controls). We identified four SNPs of genes TCF21, SLC19A1, DCC, and ERCC1 showing significant association with BC in the population under study. ResultsThe SNPs were rs12190287 (TCF21) having OR 1.713 (1.08-2.716 at 95% CI) p-value 0.022 (dominant), rs1051266 (SLC19A1) having OR 3.461 (2.136-5.609 at 95% CI) p-value 0.000000466 (dominant), rs2229080 (DCC) having OR 0.6867 (0.5123 -0.9205 at 95% CI) p-value 0.0116 (allelic) and rs2298881 (ERCC1) having OR 0.669 (0.46-0.973 at 95% CI), p-value 0.035 (additive) respectively. The in-silico analysis was further used to fortify the above findings. ConclusionIt is further anticipated that the variants should be evaluated in other population groups that may aid in understanding the genetic complexity and bridge the missing heritability.


Conclusion
It is further anticipated that the variants should be evaluated in other population groups that may aid in understanding the genetic complexity and bridge the missing heritability.

Background
Cancer is a neoplastic disease consisting of cancer cells that harbor numerous biological capabilities that occur due to the accumulation of various genetic aberrations and genomic alterations (1). In 2018, about 18 million new cases of cancer were recorded and about 9.5 million deaths occurred with lung and BC; being the leading cause of mortality in men and women, respectively (2). BC is the leading cancer of women in India, with 27.7% of all the cancers in women (3). A study conducted in the Kashmir province of Jammu and Kashmir highlighted BC to be the second most common cancer in women with 16.1% of all the cancers, closely following colorectal cancer (16.8%) (4). Despite the exacerbating rate of BC in J&K, there is a high susceptibility of post-menopausal women to develop breast cancer, especially with numerous factors underplay (5,6). Early menarche and late menopause ensure a longer exposure to the hormone estrogen thus increasing the risk of BC (7). Various studies have identi ed the variations in high penetrant genes like BRCA1, BRCA2, PTEN, TP53, CDH1, and STK11 along with moderate penetrant genes such as CHEK2, BRIP1, ATM and PALB2 (8,9) and their association with BC. There are about 182 loci that have been identi ed and are susceptible to BC(10) which accounts for 30% of the genetic heritability of BC (11). Keeping in view the missing heritability and scanty literature of BC in studied population group, we investigated cancer-associated variants (12,13) to get an insight into their role in BC development among the population of Jammu and Kashmir. With an effort to bridge this gap, we conducted a casecontrol association study in North Indian post-menopausal women of Jammu and Kashmir.

Sample collection and DNA isolation
Sampling was carried out from January 2015 to December 2018. A total of 550 age-matched samples (150 cases and 400 controls) were recruited for the study. The cases were histo-pathologically con rmed with no evident history of any other cancer. The Institutional Ethical Review Board (IERB) (SMVDU/IERB/18/70) of Shri Mata Vaishno Devi University approved this study. After written informed consent, blood samples, about 2 ml of venous blood was collected into EDTA vacutainers from patients and healthy controls (age and ethnicity matched). Qiagen DNA isolation kit (Catalogue number 51206) was employed for the genomic DNA isolation from a whole blood sample, using the manufacturer's protocol. Quality check was performed using 0.8% gel electrophoresis and was quanti ed using the Eppendorf India Pvt. Ltd. Bio-spectrophotometer™.

SNP Genotyping
Agena Bioscience MassARRAY iPLEX GOLD chemistry technology was used to perform genotyping.
Primers anking the gene region of the SNPs were designed in the AgenaCx platform (GRCh38/hg38) (https://agenacx.com/). DNA sample with a concentration of 10 ng/μl was used with volume 1μl for the PCR reaction. The rst PCR was performed using forward and reverse primer pool and shrimp alkaline phosphatase was then used to neutralize unincorporated dNTPs from the rst PCR reaction. iPLEX extension reaction was performed using pool single base extension primers. Desalting of the ampli ed products was performed using SpectroCLEAN resin following the protocol (14,15). RS100 nanodispenser was used to transfer the cleaned extension products from 384 muti-titration PCR plate to 384 SpectroCHIP and the chip was placed in Agena MassARRAY compact mass spectrometer. A spectrum of the different products was acquired by SpectroAcquire software. Data in the form of genotypes and call clusters were obtained using Agena MassARRAY Typer V4.0.5. The representative work ow of genotyping has been summarized in Figure S1

Candidate Gene and SNP Selection
The genes selected for the present study are involved in following cancer-related pathways: Tumor suppressor, DNA damage and repair, regulation of cell proliferation, telomere maintenance, cell cycle regulation, and apoptosis. Variations with a potential functional effect like causing amino acid alteration, present in the promoter and UTR region, having an effect on splicing site and transcription binding sites were also selected for the study. The potential genetic variants were retrieved from the database. Variations having an allelic frequency greater than 5% in Gujarati Indians from Houston (GIH) and Western European Ancestry (CEU) as it has been observed that genetic makeup of the studied population is similar to Europeans(16). The studied genes and variants with their putative role and annotations are summarized in the Table S3.

Statistical analysis
Chi-square analysis was performed by using the Plink V.1.09 to observe the allelic association between cases and controls (17). Further, the association analysis was performed with different genetic models, and the association was calculated in terms of Odds Ratio (OR) at 95% con dence interval (CI) using Statistical Package for the Social Sciences (SPSS) version 20. The corrected OR was obtained using unconditional logistic regression analysis with age and, BMI as confounding factors. Further, the prediction of mRNA secondary structure of the variants have been illustrated by RNA fold (18), and minimum free energy (MFE) calculated from RNA fold was used to measure the stability of the secondary structures of mRNAs. The gene-gene interaction was performed using Cytoscape version 3.8.0 (19) by installing the GeneMANIA plugin (20). The allele frequencies were compared with the global database 1000 genome (21) using a web-based application LD link (22) to observe the genetic heterogeneity in allele frequencies among different population groups.

Genetic variants associated with the increased risk of BC in the population under study
Out of the fteen variants shortlisted four variants namely rs1051266, rs12190287, rs2229080, and rs2298881 in SLC19A1, TCF21, DCC, and ERCC1 genes respectively, were found to be signi cantly associated with BC in the studied cohort. The variant rs1051266 is located in the second exon of the gene SLC19A1 and is a missense variant. The variant shows a signi cant association with BC with OR 1.745 (1.321-2.304 at 95% CI) and has a p value=7.98E-05 (Allelic). A signi cant association of the variant was also observed in the dominant model with OR 3.461 (2.136-5.609 at 95% CI) and p value=0.000000466 ( Table 1). The variant was providing risk for BC in the studied cohort. The variant rs12190287 is a 3'UTR variant located in the third exon of the gene TCF21. The allelic association of the variant showed a weak association with BC and OR observed was 1.306 (0.995-1.713 at 95% CI) having p-value=0.0491 (Allelic) to observe the maximum effect of allele C, the dominant model was evaluated. Interestingly, the OR observed was 1.713 (1.08-2.716 at 95% CI) and p value=0.022 (Table S2). The variant was providing risk for BC in the dominant model in the studied cohort. The variant rs2298881 is located at the intron of the ERCC1 gene. Variant rs2298881 was found signi cantly associated with BC and the OR observed was 0.6981 (0.36-0.71 at 95% CI) and p value=0.01169 (Allelic). A signi cant association was observed with the additive model with OR 0.669 (0.46-0.973 at 95% CI), p-value=0.035 (Table S2). The variant rs2229080 is a missense variant, located on the third exon of the DCC gene. The variant rs2229080 showed protection against BC having OR 0.6867 (0.5123-0.9205 at 95% CI) and p value=0.011 (Allelic).
However non-signi cant association was observed for rs2229080 using the dominant model: OR 0.797 (0.517-1.229 at 95% CI), p-value=0.305, recessive model: OR 0.505 (0.252-1.014 at 95% CI), p value=0.05 and Additive model: OR 0.758 (0.551-1.042 at 95% p-value = 0.088 (Table S2).  correlates with increased stability in the structure. The observed decrease in MEF of the wild type allele for the variants rs1051266 and the MEF of the centroid structure for the variant rs12190287 corresponds with the increased stability of the wild type allele in both the cases. The altered allele of both the variants has been observed to risk causing. The decrease in the stability of the altered structure might be a potent factor posing the risk threat. The altered alleles of the variants rs2298881 and rs2229080 have a low MEF than the wild allele and have been shown for protecting the BC. The low MEF of the altered alleles confers greater stability to these structures than the wild alleles with a higher MEF value. The free energy values have been summarized in the Table S4.

Network Analysis
The potential involvement of DCC1, ERCC1, SLC19A1, and TCF21 relevant genes by querying the genes in the GeneMANIA(20) ( gure 2). This showed that the expression of the genes is correlated with that of DCC. Further network analysis revealed that the DCC gene displays protein-protein interaction with CASP9 that is associated with multiple cancer risks (23). An interaction, however meek, is seen among the DCC interacting proteins and TCF21. No associated network of interacting proteins was found interacting with SLC19A1.

Discussion
A replicative case-control study was done in about 550 samples to analyze the variants using Agena MassARRAY genotyping for the population of Jammu & Kashmir. Here, we investigate various BC loci in cases and controls. We investigated using the Agena massARRAY platform and identi ed numerous SNPs that were found signi cantly associated with BC genome-wide and independent of each other. Our study demonstrated 4 genome-wide loci which have been associated with BC development in the population under study. The rsIDs rs12190287 and rs1051266 associated with the genes TCF21 and SCL19A1 are causing risk in our population group. We also found six variants following HWE however not showing signi cance with BC development. The allele frequency of all the variants is shown in Figures  S2-S11.
The variant rs1051266 is located on the SLC19A1 gene. SLC19A1 or Solute Carrier family member protein is a gene implicated in placental carcinomas and pediatrics osteosarcomas. Studies have shown the SLC19A1 gene variants to be associate with BC risk in worldwide populations (24) including African American women (25). Our data revealed the variant rs1051266 to be signi cantly associated with BC risk in the population under study. Further, the bioinformatic analysis revealed that the associated variants are conserved in primates including humans and have been located in the conserved domain region ( Figure  S13). We also studied the genotype tissue expression of the variants (GTEx) with their NES (Normalized Effect Size) values, which have been shown in Table S3. The GTEx with NES (Normalized Effect Size) was used to study the correlation between the genetic variation and gene expression in the human tissues. The variant 1051266 (SLC19A1) was signi cantly showing expression in breast tissue with an NES value of -0.4333 and a p-value of 2.4e-6 (<0.05).
TCF21 or Transcription factor gene is a tumor suppressor gene and is associated with Uterine Corpus carcinoma and Pericoronitis. TCF21 is found mutated in several types of cancers (26) Studies have shown a lower expression of TCF21 in breast tumor tissues corresponding to enhanced tumor size and increased lymph node metastasis (27). We analyzed the variant rs12190287 G>C of the TCF21 gene and found it to be signi cantly associated with BC in the studied population group. The variant was found causing risk for BC in the population. The variant rs12190287 (TCF21) showed signi cant expression in breast tissue, with an NES value of 0.210 and a p-value of 3.3e-5. The positive NES value indicated the up-regulated expression in the breast tissue.
ERCC1 gene or Excision Repair Cross-Complementing Rodent Repair gene which harbored the rs2298881 variant, functions in a nucleotide excision repair pathway (28). ERCC1 is found to be associated with multiple cancers. ERCC1 variants have also been linked to an increased risk of BC (29) in women. The variant rs2298881 C>A was found signi cantly associated with breast cancer. The variant was found to be conferring protection for our BC in the studied population group. The variant rs2298881 (ERCC1) showed signi cant expression in the breast with an NES value of -0.260 and a p-value of 3.8e-9.
DCC or Deleted in Colorectal Cancer is a gene encoding the netrin1 receptor. Netrin1 receptor is a transmembrane receptor belonging to the immunoglobulin superfamily. DCC gene is a tumor suppressor gene and is frequently mutated in colorectal carcinomas. DCC is abundantly expressed by neurons and stimulates cell survival and axon regeneration. Apart from mutations in colorectal cancers, studies have highlighted the role of DCC in BC. A variant of the DCC gene, rs2229080, has been found associated with increased BC risk (13). Our study revealed that rs2229080 G>C was signi cantly associated with breast cancer and the altered allele C was causing protection in the studied population group. Though for the variant rs2229080 ( The RNA fold analysis revealed the MEF and structural differences in the wild and the altered allele. We also studied the difference in the secondary structures and the MEF values of the wild type allele and the variant allele. There was a decreased MEF in the case of the wild type allele of the variants rs12190287 and rs1051266 providing them an enhanced stable structure than the altered allele. Whereas, the rsIDs rs2229080 and rs2298881 associated with the genes DCC and ERCC1 were found to be causing protection to BC. The MEF values of these variants were lower for the altered allele thus suggesting a more stable structure of these allele variants. These stable structures were further associated with the conferring of protection against BC. Further analysis of the second structure of the genes with the variants highlighted a substantial difference in the MFE and MFE of centroid secondary structure. The differences in the MFE values of the variants have been summarized in the Table S4. The differences in the secondary structures of the alleles have been shown in gure 1. On comparing the allele frequencies of the associated allele with 1000genome data, we found substantial differences in the allele frequencies. The differences in the allele frequency of the associated alleles have been depicted in the Figure S12. An intermediary value of allele frequency for the variant rs1051266 was observed. The allele frequency in the Indian subcontinent comprising of the PJL (Punjabi's from Lahore, Pakistan), ITU (Indian Telugu from the UK) and STU (Sri Lankan Tamil from the UK) was intermediary, around 0.4 in a range of 0 (low) to 1 (high). Similar allele frequencies were observed in the GIH (Gujaratis Indian from Houston,

Conclusion
This is the rst study that provided the preliminary data for the J&K population highlighting the role of 15 variants with BC development. Among the studied variants, four variants namely rs1051266, rs12190287, rs22290,80 and rs2298881 showed signi cant association with BC. The logistic regression with age and BMI revealed the best-t model to be dominant in the case of rs1051266 (SLC19A1) and rs12190287 (TCF21), while allelic in case of rs2229080 (DCC) and additive in the variant rs2298881 (ERCC1). These variants displayed maximum effect in the populations when incorporated in these models. The RNA fold analysis revealed the structure variations in wild and altered allele along with the differences in the free energy of the secondary structures. In silico analysis of these variants showed the variants to be evolutionarily conserved thus harboring minimum alterations in them over generations, which enables them to maintain their putative structure and function stability e ciently. Three variants rs1051266, rs12190287 and rs2298881 showed signi cant eQTL effect. The network analysis gave a deeper insight into the nuanced interactions of the candidate genes with other common proteins, alleging a common pathway of function. The associated loci may affect the development of BC in the women of Jammu and Kashmir. It might be plausible that there are other variants, which have not been studied and have an eloquent association with BC. Our results provide a clue for further functional validation to reveal underlying genetic mechanisms in BC. The predicted MEF and MEF centroid secondary structures of RNA and energies that were calculated by using RNA fold.

Figure 2
A network analysis of the associated genes using GeneMANIA database.