ELLIPSE study subjects and genotype
The Elucidating Loci Involved in Prostate Cancer Susceptibility (ELLIPSE) consortium prostate cancer meta-analysis and genotypes (dbGaP Study Accession: phs001120.v1.p1) was accessed to analyze Affymetrix genotype calls for 91,644 male PrCa case/controls.
Quality assurance
PLINK (RRID:SCR_001757) genotype files consisting of 505,219 calls from the following consent groups were compiled: c1-c3,c6,c8,c10-18,c20,c23,c25,c27-28. Pre-imputation processing of autosomal and X chromosome genotypes followed below steps:
-
1. Duplicated variants were removed.
-
2. Heterozygous haploid SNPs were set to missing.
-
3. SNPs with call rate < 90% were removed.
-
4. SNPs with minor allele frequency (MAF) < 1% were removed.
-
5. Individuals with genotype coverage < 90% were removed.
-
6. Non-ACGT variants were removed.
Strand flips were reversed using snpflip. After preprocessing genotypes, the remaining 410,116 SNPs and 91,644 individuals were input to the secure Michigan Imputation Server (RRID:SCR_017579) [26]. Whole-genome SNPs were imputed with Minimac4 (RRID:SCR_009292) and ancestry-matched reference panel 1000 Genomes Project Phase 3 version 5 (RRID:SCR_008801). Finally, post-imputation duplicated SNPs and SNPs with MAF < 1% were removed.
Ancestry likelihood calculation (FROG-kb)
For ancestry group calculations, we opted for an ancestry group prediction tool that does not require relationships with other individuals, like principal component analysis (PCA). FROG-kb [27] uses Kidd AISNP panel (55 SNPs) to predict likelihood ratios for world geographic regions. Likelihood ratios for 160 populations were calculated and averaged. European ancestry likelihood ratios were determined from populations in “Europe” region and African ancestry likelihood ratios were determined from populations in the “African” region. For the final European ancestry group, we used a European log likelihood > -10, resulting in 5567 individuals. For the final African ancestry group, we used a European log likelihood < -15 and African log likelihood > -15, resulting in 4533 individuals.
Genome-wide association analyses (GWAS)
PLINK (RRID:SCR_001757) GLM method [28] was used to conduct association analyses with PrCa case/control in European and African ancestry groups. All associations were adjusted for the first 10 principal components (PCA with 55-SNP Kidd panel) and age.
Polygenic risk score calculation
Association analyses within European, African or mixed ancestry training sets were conducted. Significant variants were identified through PLINK (RRID:SCR_001757) linkage-based clumping using a p1 threshold of 5e-08, a p2 threshold of 1e-05, an r2 threshold of 0.1 and a kb threshold of 1000 kb. Ten, seven and fourteen significant variants were identified through African, European and trans-ancestry analysis, respectively. For PRS construction, variants were weighted by log (base 10) odds ratio from the training set association statistics, oriented to PrCa risk allele, and combined. ROC-AUC evaluation across folds was conducted using polygenic scores as predictions.
For mismatched ancestry group analysis, European ancestry training sets were used for prediction on African ancestry test sets. For trans-ancestry group analysis, European and African ancestry training sets were combined and tested on African ancestry test sets. All three ancestry group PRSs were evaluated using tenfold cross validation. AUC for each fold and overall are reported. Confidence intervals were calculated using pROC R package.
For contextualization of our results in relation to other genetic risk models, we compared our PRState score (composed of 10 African ancestry-specific variants) to variants published recently in a large meta-analysis of prostate cancer by Conti et al [11]. Half (5) of the variants used in the PRState score were in high linkage disequilibrium (r2 > 0.3) with Conti et al. variants [11]. We excluded these variants and constructed a Conti PRS with reported odds ratio for the African group. To evaluate PRState and Conti polygenic risk scores, we used both scores as features for a logistic regression model with default parameters. Predicted probabilities were used in ROC evaluation. For the Million Veteran Program (MVP), genotype dosages were extracted for 10 PRState variants across participants and weighted by log (base 10) odds ratio from the best-performing fold in ELLIPSE. Performance was evaluated using ROC-AUC analysis in European and African ancestry groups.
Million veteran program study subjects and genotype
Individual ancestry groups in the Million Veteran Program were characterized through Harmonized Ancestry and Race/Ethnicity (HARE) grouping [29]. HARE grouping was specifically developed to categorize MVP individuals based on self-reported ancestry and genetic ancestry. HARE utilizes a support vector machine to output probabilities of an individual’s ancestry group using self-identified and genetic ancestry. PrCa, metastatic PrCa and fatal PrCa status was determined through ICD 9/10 diagnosis, procedure code, CPT and HCPCS procedure code, laboratory values, medications and clinical notes from inpatient, outpatient and fee-based care in the VA healthcare system. Family history information was available for only 55,610 of 121,964 African individuals and 322,706 of 461,627 European individuals in MVP. Education and income variables were available for 412,174 individuals and were used to stratify individuals according to socioeconomic status as follows: high socioeconomic status was defined as income > $50,000 and at least a bachelor’s degree education level and low socioeconomic status was defined as income < $50,000 and did not obtain bachelor’s degree education level). When evaluating genetic information only, the full population was used and the subset of the population with family history information available was used for additional multivariable association. To evaluate the role of socioeconomic factors in prostate cancer prediction, we conducted logistic regression analysis with PRState, family history, age, education level, income levels, and top 10 principal components. To evaluate predictive value, we conducted PRState ROC-AUC analysis separately in European and African ancestry individuals. For socioeconomic analysis, we conducted PRState ROC-AUC analysis separately in African ancestry men of high socioeconomic status and low socioeconomic status.