Cholesterol Synthesis Pathway Genes in Prostate Cancer are consistently downregulated when tissue confounding is minimized

The relationship between cholesterol and prostate cancer has been extensively studied for decades, where high levels of cellular cholesterol are generally associated with cancer progression and less favorable outcomes. However, the role of in vivo cellular cholesterol synthesis in this process is unclear, and data on the transcriptional activity of cholesterol synthesis pathway genes in tissue from prostate cancer patients are inconsistent. A common problem with cancer tissue data from patient cohorts is the presence of heterogeneous tissue which confounds molecular analysis of the samples. In this study we present a method to minimize systematic confounding from stroma tissue in seven patient cohorts consisting of 1713 prostate cancer and 230 normal tissue samples. When confounding was minimized, differential gene expression analysis over all cohorts showed robust and consistent downregulation of nearly all genes in the cholesterol synthesis pathway. Additional analysis also identified cholesterol synthesis as the most significantly altered metabolic pathway in prostate cancer. This surprising observation is important for our understanding of how prostate cancer cells regulate cholesterol levels in vivo. Moreover, we show that tissue heterogeneity explains the lack of consistency in previous expression analysis of cholesterol synthesis genes in prostate cancer.

The relationship between cholesterol and prostate cancer has been extensively studied for 3 decades, where high levels of cellular cholesterol are generally associated with cancer 4 progression and less favorable outcomes. However, the role of in vivo cellular cholesterol 5 synthesis in this process is unclear, and data on the transcriptional activity of cholesterol 6 synthesis pathway genes in tissue from prostate cancer patients are inconsistent. A common 7 problem with cancer tissue data from patient cohorts is the presence of heterogeneous tissue 8 which confounds molecular analysis of the samples. In this study we present a method to 9 minimize systematic confounding from stroma tissue in seven patient cohorts consisting of 10 1713 prostate cancer and 230 normal tissue samples. When confounding was minimized, 11 differential gene expression analysis over all cohorts showed robust and consistent 12 downregulation of nearly all genes in the cholesterol synthesis pathway. Additional analysis 13 also identified cholesterol synthesis as the most significantly altered pathway in prostate 14 cancer. This surprising observation is important for our understanding of how prostate cancer 15 cells regulate cholesterol levels in vivo. Moreover, we show that tissue heterogeneity explains 16 the lack of consistency in previous expression analysis of cholesterol synthesis genes in 17 prostate cancer .  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  Introduction   1 Increased cholesterol levels in enlarged prostates and prostate cancer have been observed for 2 decades 1-3 , and extensive research has suggested that cholesterol have a role in prostate cancer 3 growth and progression 3-5 . Cholesterol homeostasis is important for cell viability, and is 4 dynamically regulated by a balance between synthesis, uptake, efflux and storage of 5 cholesterol 4,6-9 . For cellular cholesterol synthesis, the conversion of 3-hydroxy-3-6 methylglutaryl coenzyme A (HMG CoA) to mevalonate is the first rate limiting step, which is 7 followed by over 20 flux controlling enzymatic reactions before cholesterol is synthesized as 8 the final product. In prostate cancer cell-lines, elevated activity of the cholesterol synthesis 9 pathway supports cancer growth and aggressiveness [10][11][12][13][14][15][16] . This has led to the general view that 10 increased cholesterol synthesis in prostate cancer cells contributes to cellular accumulation of 11 cholesterol and prostate cancer growth. A diet high in fat and cholesterol increase the risk of 12 prostate cancer, while statins directly targeting the cholesterol synthesis pathway are 13 associated with improved clinical outcome (reviewed in 17 ). This is generally taken as support 14 for the relevance of increased cholesterol synthesis in vivo. This notion was also in line with a 15 recent study showing increased activity of the cholesterol synthesis enzyme squalene 16 monooxygenase (SQLE) in lethal prostate cancer 18 . Accordingly, one would expect that genes 17 in the cholesterol synthesis pathway are upregulated when prostate cancer is compared to 18 normal tissue. However, transcriptional changes in cholesterol genes are rarely highlighted 19 when such comparisons are performed in large patient cohorts. 20 21 We hypothesized that this is due to influence of confounding tissue components present in the 22 samples. Gene expression analysis in human tissue is challenged by the highly heterogeneous 23 tissue composition in each sample 19,20 . The standard way to account for such heterogeneity is 24 to incorporate tissue type percentages from histopathology during the analysis. Although 25 confounding due to tissue composition is generally acknowledged, data from histopathology 1 are missing in most publicly available patient cohorts, which may bias the molecular analyses. 2 In prostate cancer, the presence of stroma tissue is shown to hide underlying molecular 3 features in a differential analysis 21,22 . Prostate tissues are usually histopathologically divided 4 into benign epithelium, stroma tissue and prostate cancer. It is previously shown that the 5 different number of tissue types present in prostate cancer (three tissue types) and in normal 6 samples (two tissue types) leads to a systematic sampling bias of increased stroma content in 7 the normal samples 23,24 . This confounds differential analysis when cancer and normal samples 8 are compared, and controlling for these biases will potentiate the discovery of molecular 9 pathways and features otherwise hidden in the data. 10 11 To address this challenge we utilized two independent patient cohorts where the tissue 12 composition of prostate cancer and normal samples has been thoroughly assessed by 13 histopathology. Based on the gene expression analysis of stroma-enriched genes in these two 14 cohorts, we used Gene Set Enrichment Analysis (GSEA) 25  We used the two stroma gene-sets identified independently from the histopathology cohorts 4 (Bertilsson and Chen) to calculate GSEA scores for all 1943 samples (1713 cancer and 230 5 normal) from all the seven patient cohorts. In the five-study cohort containing 1117 samples 6 (887 cancer and 230 normal), the calculated GSEA scores showed a systematic bias of 7 increased average stroma content in normal samples (Figure 2d), which should support the 8 separation of each cohort into balanced and unbalanced datasets. Balanced and unbalanced 9 datasets were therefore created independently using the two available stroma gene-set, 10 resulting in two independent balanced and unbalanced datasets for each cohort. This 11 stratification equalized the average stroma content in the balanced datasets, and enhanced the 12 difference in average stroma in the unbalanced datasets ( Figure 2d). Differential expression 13 analyses were performed independently for each dataset, and differentially expressed genes 14 were ranked in each dataset according to their p-value. In addition, combined rank-based 15 meta-analysis over the five-study-cohorts and seven-study-cohorts were performed (Methods). 16 The balanced and unbalanced datasets from the two meta-cohorts contained 558/115 and 17 971/115 prostate cancer/normal samples each, respectively.  Table 2, SFig4 in Supplementary File S1 online). In a 24 meta-analysis of the five-study-cohort, 21 of the 25 genes assessed were downregulated. 25 These included key genes of cholesterol synthesis such as HMGCR and SQLE (rate limiting 1 enzymes), FDFT1, LSS (catalyzes first step), DHCR7 (catalyzes last step) in addition to 2 NSDHL, MDMO1, EBP, IDI1, CYP51A1, HMGCS1 and SC5D. All these genes had p-values 3 to the power of -10 or less. The same trend was observed in a meta-analysis over the seven-4 study-cohort (SFig4 in Supplementary File S1 online). In addition, four cohorts in the five-5 study-cohort individually showed highly significant downregulation of cholesterol genes, 6 though the most highlighted genes varied somewhat between the cohorts (SFig4 in 7 Supplementary File S1 online). In the five-study-cohort, 10 central cholesterol genes 8 (HMGCS1, HMGCR, IDI1, FDFT1, SQLE, CYP51A1, MSMO1, NSDHL, EBP and SC5D) 9 ranked among the top 150 most differentially expressed genes in the balanced dataset 10 (average rank of 76) (Supplementary File S2 online). This is in contrast to the unstratified and 11 unbalanced datasets, where the average ranks of the same ten genes were 9195 and 14860, 12 respectively. The unbalanced analysis also shows that upregulation of cholesterol genes is 13 mostly due to differences between cancer tissue and stroma (Figure 3a).
Cholesterol 14 synthesis was a highly important gene ontology term in the balanced dataset, and a clustered 15 set of related terms containing steroid, sterol and cholesterol biosynthesis were among the top 16 three most significant gene ontologies when the 500 most significant genes from the five-17 study-cohort were analyzed by DAVID (Table 3). In summary, the balanced data prove a 18 characteristic transcriptional downregulation of the cholesterol synthesis pathway in primary 19 prostate cancer. All p-values presented in this and the following sections, as well as The pronounced discrepancies between the balanced and unbalanced datasets serve as an 25 illustration of how cholesterol pathway genes are confounded by stroma tissue during 26 differential analysis. Using HMGCR (the rate-limiting enzyme of cholesterol synthesis) as an 1 example, this gene is strongly significant in both datasets. However, it is downregulated when 2 cancer is compared to normal epithelium in the balanced dataset, and upregulated when 3 cancer is compared to stroma in the unbalanced dataset. This typical pattern occurs when a 4 gene highly expressed in the normal epithelium has an intermediate expression in cancer and 5 is weakly expressed in stroma. Significant expression differences in these situations can only 6 be revealed when the confounding effects of stroma is accounted for. Since this pattern is 7 prevalent throughout the entire cholesterol pathway, we hypothesize that stroma confounding 8 is the main reason that this pathway has not been identified in previous analysis of prostate 9 cancer patient cohorts. The only cohort that did not highlight cholesterol synthesis was the 10 cohort from Chen, which showed a consistent absence of significant cholesterol genes in the 11 balanced dataset (SFig4 in Supplementary File S1 online). However, the cholesterol gene 12 expression pattern from the unbalanced dataset in Chen was similar to the other cohorts. 13

14
The selection of stroma genes does not cause bias on differential expression of 15 cholesterol genes. 16 It is important to establish that gene-sets representing stroma content do not impose unwanted 17 biases with respect to the differential expression of cholesterol genes in additional cohorts. 18 Here we present three arguments why this is unlikely for the cholesterol pathway genes in this 19 study. 1) The stroma gene-sets were generated from two independent sources, but produced 20 similar and stable results. 2) Cholesterol genes were either absent or ranked low in the stroma 21 gene-sets. Nevertheless, all genes involved in the cholesterol pathway and regulation were 22 excluded from any stroma gene-set during analysis to ensure unbiased sample stratification. 23 Moreover, re-introduction of these cholesterol genes into the stroma gene-sets did not affect 24 the stratification of samples into balanced and unbalanced datasets in any of the seven 25 cohorts, showing that cholesterol-genes had no impact on the sample stratification. 3) The 1 histopathology cohort from Chen was the only cohort that did not highlight cholesterol 2 pathway genes as significant in the balanced dataset. Yet, all the balanced datasets in the 3 other six cohorts still highlighted cholesterol genes as highly significant when the stroma 4 gene-set derived from the Chen was used to balance the samples. Likewise, cholesterol 5 pathway genes were not highlighted as significant in the balanced dataset from Chen when 6 the stroma gene-set from Bertilsson was used to balance the samples in this cohort. This 7 shows that both the Chen and the Bertilsson stroma gene-sets maintained the divergent 8 balanced expression patterns for cholesterol genes when used in these two cohorts. 9 We also investigated two additional studies from the literature which could complement the 10 findings in our study. One study emphasized cholesterol biosynthesis as a significant pathway 11 in prostate tissue samples using gene ontology analysis 27  the liver leading to reduced circulating levels of cholesterol 4 . This may limit the cholesterol 1 available for cellular uptake, with activation of the cholesterol synthesis pathway and delayed 2 cancer growth as a result. What contradicts this hypothesis is that not only HMGCR is 3 downregulated in prostate cancer, but also LDLR. However, mechanisms alternative to LDLR 4 for cholesterol and sterol uptake and efflux have been suggested, including changed activity 5 of SLCO transporters 34 (for example SLCO2B1 is strongly upregulated in the five-study-6 cohort, Figure 4a) and modulation to cell-membrane structures like lipid rafts 3,7,35 . Recently, 7 cholesteryl esters in lipid droplets in prostate cancer PC3 cells were shown to originate from 8 uptake rather than synthesis 36 , supporting an increased attention to the role of cholesterol 9 uptake in prostate cancer. Alternatively, statins may upregulate HMGCR in prostate cancer 10 directly through feedback mechanisms 37 , again with a possible cancer-preventive effect. with other in vivo reports, which associates low levels of ABCA1 and low levels of circulating 24 HDL with prostate cancer 43,44 . We finally emphasize that the discussion on how our data 25 relate cholesterol metabolism and homeostasis are circumstantial, and that more detailed 1 analysis involving proper model system is needed to elucidate these mechanisms further. influenced the molecular makeup of the tumor at the time of surgery. We were not able to 7 obtain sufficient data to conclude on this issue. Nevertheless, we here discuss the limited data 8 and information we were able to find. Information on statin use prior to surgery were 9 available only for the Bertilsson cohort, were a total of 26 samples (18 cancer and 8 normal) 10 were affected. Re-analysis of the Bertilsson cohort did not change the pattern of consistent 11 downregulation of cholesterol pathway genes (SFig6 in Supplementary File S1 online). There 12 is one report on the in vivo effect of statin on HMGCR levels in breast cancer 37 . This report 13 demonstrated that statins do not necessarily downregulate HMGCR, and that the effect of 14 statin use was highly heterogeneous among patients. Currently we find it unlikely that statin 15 use has a major impact on the highly significant and consistent results observed in our study, 16 though we acknowledge that the information we have on this issue is too limited to conclude. 17 18

Limitations to the histopathological tissue classification. 19
In this study we have used a simplistic tissue classification which divide prostate cancer tissue 20 into three tissue types; cancer, stroma and normal epithelium. However, this classification 21 does not completely account for all tissue characteristics observed in prostate cancer, which 22 can be heterogeneous with respect to all three tissue types. Cancer tissue from the prostate can 23 be further classified into histological grades by Gleason score 45 . Gleason grading of samples 24 was provided for six of the seven cohorts, and did not show any bias with respect to balanced 25 and unbalanced dataset (STab2 in Supplementary File S1 online). We thus conclude that 1 Gleason grade is not a confounding factor in our analysis. Several studies have shown that 2 normal stroma can transform into reactive stroma when located adjacent to cancer tissue 46 . 3 Thus the balanced analysis may also highlight genes resulting from differences between 4 reactive and normal stroma. The strength of these differences will depend on the fraction of 5 reactive stroma compared to normal stroma in the cancer samples. Histopathological 6 differences between normal and reactive stroma were not assessed in the cohorts used in this 7 study, and thus represents a limitation. Finally, normal epithelium from the prostate can 8 display various precancerous aberrations with distinct molecular profiles 47 . We acknowledge 9 that these are limitations of the current classification, and that further research and data 10 generation in this field should focus on delineating additional molecular tissue profiles as 11 well. measurements from prostate cancer and normal tissue samples from seven publicly available 20 patient cohorts. An overview of data from the seven patient cohorts is given in Table 1. 21 Cancer samples for all cohorts were from radical prostatectomy specimens, except for the 22 Sboner cohort which was from a watchful waiting cohort. Normal samples were adjacent 23 normal prostate tissue from prostatectomy specimens, except for four normal prostate samples 24 in the Chen cohort which were autopsy samples from subjects without prostate cancer. Gene 25 expression measurements from each patient cohort were downloaded and processed in the 1 following way: Gene expression and metadata from Bertilsson were created by our group and 2 processed as previously described 52 . Data are available at Array Express with accession E-3 MTAB-1041. The best probe for each gene was selected as the one with the highest average 4 rank by p-value in differential expression analysis (average over unstratified, balanced and 5 unbalanced comparison, see below or main text for explanation). Gene expression and 6 metadata from Chen 21,53 were downloaded from Gene Expression Omnibus (GEO) accession 7 GSE8218. Probes were matched to gene names by the hg133a.db reference using limma in R. 8 The best probe for each gene was selected as the one with the highest average rank by p-value 9 in differential expression analysis. Gene expression and metadata from Taylor 54 were 10 downloaded from GEO accession GSE21034. Probes were matched to genes using the 11 GPL10264 reference available at GEO. Probes with no matching gene were removed from 12 further analysis. The best probe for each gene was selected as the one with the highest rank in 13 a differential expression analysis between prostate cancer and normal samples. In the Taylor  to gene names using the GPL5474 reference available at GEO. Only four genes in the Sboner 24 cohort had more than one probe. For these genes, the probes with the highest overall 25 expression value were selected as the best probe. Quantile normalized exon expression data 1 and metadata from the Erho cohort 61 were downloaded from GEO with accession GSE46691. 2 Exons identifiers were matched to gene names using the GPL5188 reference available at 3 GEO. The total expression for each gene was calculated as the average expression over all 4 exons for that gene. Differential expression of genes from Bertilsson, Chen and Taylor were 5 identified using the limma package in R as described previously 52 , while voom on raw RNA-6 Seq read counts was used for differential expression of genes from TCGA and Prensner. In

Identification of gene-sets for assessment of stroma content in prostate tissue samples 1
Since the procedure for identification of gene-sets requires histopathology on both prostate 2 cancer and normal samples in the same cohort, only the Bertilsson and Chen could be utilized 3 for this purpose. Both these cohorts include detailed histopathological evaluation on the 4 percentage tissue composition of prostate cancer, benign epithelium and stroma in both 5 prostate cancer and normal samples. Stroma gene-sets were created independently from each 6 of the two cohorts by the exact same procedure. The difference in average tissue composition 7 between the balanced and unbalanced datasets (described in the previous section) facilitates 8 the identification of genes specifically up-or downregulated in stroma compared to benign 9 epithelium and cancer tissue by comparing p-values between the two datasets. Specifically, 10 genes which display up or downregulation characteristic for stroma tissue will have lower p-11 values in the unbalanced compared to the balanced dataset. We thus used the following 12 formula to rank all genes according to their suitability for creating stroma gene-sets: 13 14 = 2 (1) 15

16
The squared term in the denominator was included to reflect that more pronounced 17 differences in p-values are necessary for highly significant genes to be regarded as stroma 18 genes. (Compare a gene with p-value 1e-5 in the unbalanced which is not significant in the 19 balanced dataset, to a gene with p-value 1e-20 in the unbalanced and 1e-15 in the balanced. 20 The former is more likely to be a valid stroma marker than the latter, even though the p-value 21 ratio is the same). The stroma gene-sets included were based on the top 1000 ranked 22 upregulated and top 1000 ranked downregulated genes. To avoid any bias from the 23 cholesterol pathway genes, any genes from Table 2 were removed from all stroma gene-sets 24 during analysis. 25 1

Validation of stroma gene-sets in the histopathology cohorts 2
The independent stroma gene-sets created from the Bertilsson and Chen cohorts were 3 validated by assessing the percentage of shared genes between the two gene-sets (Figure 2a). 4 This percentage was compared to the percentage of shared genes expected by chance in 50 5 randomly generated gene sets of same size. The number of shared genes was also compared in 6 gene sets created using a naïve approach of Pearson correlation to histopathological stroma 7 content, and four previously published gene sets related to the content of stroma in prostate 8 cancer tissue samples from Wang et al. 21 (Figure 2b). 9

10
The assessment of the stroma content in any single sample from any cohort was performed 11 using Gene Set Enrichment Analysis (GSEA) 54 . Two measures will influence the GSEA 12 scores; the number of genes in the applied gene-set, and the total number of genes used for the 13 calculation. We calculated 10 GSEA scores for each sample using varying numbers of the top 14 scoring stroma genes (top 100, 150, 200, 250, 300, 350, 400, 450, 500 and 1000 genes), and 15 normalized the scores in each of the 10 calculations to a 0-100 range over all samples to make 16 them comparable. To enable comparisons between datasets, we only used genes shared by all 17 datasets in each GSEA calculation. Two total gene selections were made, one containing 9527 18 genes shared by the five-study-cohort, and one with 4804 genes shared for the seven-study-19 cohort. Averaging over the 10 GSEA scores in each selection produced a total of four GSEA 20 scores for each sample in Bertilsson, Chen, Taylor, TCGA and Prensner (using gene-sets from 21 Bertilsson and Chen for the five-study-cohort and the seven-study-cohort respectively), and 22 two GSEA scores for each sample in Sboner and Erho (Bertilsson and Chen gene-set for the 23 seven-study-cohort). The main reason for the lower number of shared genes in the seven-24 study-cohort is the relatively few genes measured in Sboner (6100 unique genes). For 25 Bertilsson and Chen, GSEA scores for each sample were converted to predicted stroma 1 percentages using a linear least squares fit, and compared to the stroma percentages obtained 2 from histopathology. Predicted stroma percentages by the fit model based on the Bertilsson 3 and Chen stroma gene-sets respectively in each of the two cohorts were also compared. 4 Finally, GSEA score correlations when using the Bertilsson and the Chen stroma gene-sets 5 were compared for all cohorts. 6 7 Defining balanced and unbalanced data-sets in cohorts lacking histopathology 8 The GSEA scores representing the content of stroma in each sample were used to separate 9 samples in each new patient cohort into balanced and unbalanced datasets as described 10 above. This was done independently for each patient cohort. Differentially expressed genes 11 for each cohort were calculated and corrected for multiple testing by Benjamini Hochberg 12 false discovery rate (FDR) separately in each cohort, based on the total number of analyzed 13 genes in each cohort. For the Sboner and Erho cohorts, only the cancer samples were 14 separated into datasets with high and low stroma content, and no differential analysis was 15 performed. 16 17

Rank based meta-analysis for combined cohorts 18
To identify differentially expressed genes in a meta-analysis over the five-study-cohort the 19 following procedure was used: 1) Each gene was sorted according to its expression value over 20 all samples independently in each cohort, and rank-normalized to a score-value between 0 and 21 100, where 0 is the rank based expression value for the sample with the lowest expression 22 value of the gene, and 100 is rank-based expression value the sample with the highest 23 expression. 2) Rank-normalized values were mean centered independently in each cohort, 24 where the mean centering was weighted by the relative number of prostate cancer and normal 25 samples in the cohort. This was to avoid mean-value biases due to the huge relative difference 1 between cancer and normal samples in each cohort. 3) Samples were separated into one 2 balanced and one unbalanced meta-dataset using their previous assignment to balanced and 3 unbalanced datasets in each individual cohort. Differential analysis based on two 4 classifications were performed, one based on the stroma gene-set from Bertilsson and one on 5 the stroma gene-set from Chen. 4) Differentially expressed genes for the unstratified, 6 balanced and unbalanced datasets were calculated for weighted mean centered rank-7 normalized values between prostate cancer and normal samples combined for all five cohorts 8 using the Mann-Whitney-Wilcoxon test 64 for rank-based differential expression. P-values of 9 differentially expressed genes were corrected for multiple testing using the Benjamini-10 Hochberg FDR for the total number of genes analyzed (25 964 unique gene identifiers). If a 11 gene was not present in all datasets, only the datasets that contained that gene were used for 12 differential expression. The seven-study cohort was analyzed in the same way, but with mean 13 centering rather than weighted mean centering used to adjust gene-ranks between cohorts. 14 This was due to the lack of normal samples in the Sboner and Erho cohorts.    Balanced and unbalanced datasets from the five-study-cohort are merged into one meta-20 analysis for differential expression. Balanced and unbalanced datasets from the five-study-21 cohort, as well as high and low stroma datasets from the Sboner and Erho cohorts are merged 22 into one meta-analysis of the seven-study-cohort. Pearson correlation (c) between stroma percentage predicted by gene-sets from Bertilsson and 8 Chen in each of their respective cohorts (bottom). e) Bias towards higher GSEA stroma scores 9 in normal compared to cancer samples present in all unstratified cohorts from the five-study-10 cohort. Dividing samples into balanced and unbalanced datasets minimizes and maximizes, 11 respectively, the stroma bias between cancer and normal samples. A difference in the average 12 overall GSEA score between the cohorts is also evident in the figure.    Footnote: The analysis was performed using the top 500 ranked genes from the balanced 3 analysis in the five-study-cohort as input to DAVID. Only the top terms are listed. All terms 4 are from the category "SP_PIR_KEYWORDS". The top categories were the same when the 5 top 1000 ranked genes were used. All terms related to steroid, sterol and cholesterol synthesis 6 were part of the same functional cluster in DAVID. 7 8 Bertilsson cohort