Intrinsic bias in breast cancer gene expression data sets
© Mosley and Keri. 2009
Received: 27 February 2009
Accepted: 29 June 2009
Published: 29 June 2009
Skip to main content
© Mosley and Keri. 2009
Received: 27 February 2009
Accepted: 29 June 2009
Published: 29 June 2009
While global breast cancer gene expression data sets have considerable commonality in terms of their data content, the populations that they represent and the data collection methods utilized can be quite disparate. We sought to assess the extent and consequence of these systematic differences with respect to identifying clinically significant prognostic groups.
We ascertained how effectively unsupervised clustering employing randomly generated sets of genes could segregate tumors into prognostic groups using four well-characterized breast cancer data sets.
Using a common set of 5,000 randomly generated lists (70 genes/list), the percentages of clusters with significant differences in metastasis latencies (HR p-value < 0.01) was 62%, 15%, 21% and 0% in the NKI2 (Netherlands Cancer Institute), Wang, TRANSBIG and KJX64/KJ125 data sets, respectively. Among ER positive tumors, the percentages were 38%, 11%, 4% and 0%, respectively. Few random lists were predictive among ER negative tumors in any data set. Clustering was associated with ER status and, after globally adjusting for the effects of ER-α gene expression, the percentages were 25%, 33%, 1% and 0%, respectively. The impact of adjusting for ER status depended on the extent of confounding between ER-α gene expression and markers of proliferation.
It is highly probable to identify a statistically significant association between a given gene list and prognosis in the NKI2 dataset due to its large sample size and the interrelationship between ER-α expression and markers of proliferation. In most respects, the TRANSBIG data set generated similar outcomes as the NKI2 data set, although its smaller sample size led to fewer statistically significant results.
Over the past decade, a large number of global gene expression data sets of human breast cancers have become publicly available [1–6]. These data sets have provided a wealth of information for the generation and testing of biological and clinical hypotheses . Clinical and pathological factors with relevance to breast cancer are extensively characterized, and the prognostic significance of these factors is reflected in these publicly available data sets. These factors include tumor grade, Her2 and estrogen receptor (ER) expression . Whether gene expression data contributes additional prognostic information beyond what is offered by these clinical factors is debated [9, 10]. Gene expression profiles associated with individual clinical hallmarks have also been described. For instance, a large set of genes is associated with ER gene expression [5, 11]. In addition, the molecular basis of grade has also been examined with results showing a strong relationship between histological grade and tumor proliferation [6, 12]. Indeed, the consistent prognostic efficacy of a proliferation signature is now well established [13, 14].
The strong inter-relationships between clinical features, gene expression patterns and prognosis has led to the postulate that, depending upon the underlying relationship between the clinical and prognostic factors in a given data set, prognostic gene expression signatures may simply function as a proxy measure for these established clinical variables. For example, Gruvberger et. al. showed that a gene expression signature derived from the van't Veer data set , for which ER status was a strong predictor of the incidence of metastasis, was not predictive of metastasis in a data set for which this relationship did not exist . This observation led the authors to propose that derivation of future prognostic gene signatures stratify analyses by ER status in order to adjust for the known association between gene expression and ER status. This suggestion has been variably implemented, but is often ignored. Another consequence of the association between prognosis and large sets of correlated genes is that a large number of predictive gene lists can be derived by selecting different members of predictive clusters of correlated genes. This phenomenon can occur even when gene selection adheres to a standardized protocol due to variations such as the specific tumors used in training sets (subsets of tumors used to derive prognostic lists) [10, 17].
While breast cancer gene expression data sets have considerable commonality in terms of their data content, the populations that they represent and the data collection and analysis methods can be quite disparate. The advantage of this heterogeneity is that it provides an opportunity to test the robustness of a hypothesis across multiple populations represented in these data sets . However, a potential disadvantage of this heterogeneity is that systematic differences between data sets, unrelated to analytical approaches, may create sources of bias that impact their intrinsic likelihood of confirming a given hypothesis.
To gain insight into the extent of potential systematic differences in obtaining statistically significant results across breast cancer gene expression data sets, we ascertained whether there were intrinsic differences in the likelihood of observing an association between the expression of a selected set of genes and metastasis using four well-characterized data sets. To minimize bias in our approach, an unsupervised clustering algorithm was used to segregate tumors into one of two clusters based on the expression levels of randomly selected sets of genes (30–400 genes/set). These clusters were then compared for differences in metastasis latencies. We found that one data set, the Netherlands Cancer Institute (NKI2) set of young women with early stage disease, was considerably more likely to give a significant finding when examining either all tumors or only ER positive tumors, as compared to the other data sets examined. Factors that contributed to an increased likelihood of a random gene list being predictive of metastasis within a given data set were 1) the number of tumors analyzed and 2) the inter-relationships between ER expression, proliferation and metastases. We suggest that these intrinsic differences between the data sets should be considered in the design and analysis of future studies incorporating gene expression data.
Global gene expression and clinical data (including estrogen receptor status and metastasis recurrence latencies) were analyzed in four independent, publicly available breast cancer gene expression data sets. The Netherlands Clinical Institute (NKI2) data set contains data on 295 women with early stage breast cancer (downloaded from http://www.rii.com/publications/default.htm) [4, 19]. The Wang data set contains gene expression data on 296 women with lymph node negative disease  (GEO series GSE2034). The KJX64 and KJ125 data sets contain data on 189 women, 64 of which were treated with tamoxifen, with primary operable invasive breast cancer (GEO series GSE2990) . The TRANSBIG data set contains data for 183 untreated women from the Bordet Institute (GEO series GSE7390) [20, 21].
All probes from each data set were used in the analyses except in the NKI2 data set where only the probes where there was complete data on more than 291 of the 295 subjects were used (n = 24,023 probes). Missing values for genes in the NKI2 set were imputed using the "impute" option in the FastClus procedure. Gene expression data for the NKI2 data set were given as log10 expression ratios, while data from the TRANSBIG and KJX64/KJ125 sets were log2 expression values. Expression values were log2 transformed in the Wang gene expression data set. Each gene probe in each data set was mapped to a unigene cluster ID using the SOURCE database (source.stanford.edu).
For each gene expression data set, a new data set containing a single set of expression data for each unique unigene cluster ID was created. In the Wang, Miller and KJX64/KJ125 data sets, the expression values for each probe was set to have a median value of 0 and standard deviation of 1. In instances where there were multiple gene probes with a common unigene cluster ID, the median expression value of all the common probes was used. There were 11,318 genes with unique unigene identifiers that were common to all four data sets.
Random lists of genes of various sizes (30 to 400 genes per list) were generated by simple random sampling (Surveyselect procedure). For each randomly-generated gene list, tumors were separated into two groups using an unsupervised hierarchical clustering procedure that was based on a correlation matrix derived from standardized centroid components for cluster assignment (Varclus procedure). The two groups represent the first bifurcation of the clustering hierarchy. Random gene lists were also generated using subsets of genes that were identified as being associated with overall metastases latencies. The subsets of genes utilized in this analysis were those that had a relatively modest proportional hazards p-value of less than 0.1 in a Cox regression analysis.
All survival analyses were based on 5-year metastatic recurrence latencies. All subjects not experiencing metastasis within 5 years were censored at that time point. Cox proportional hazards regression models were used to ascertain differences in latencies between groups assigned by hierarchical clustering (PHReg procedure). Univariate proportional hazard ratios and p-values are reported for all analyses and represent the differences in risk between the relatively "poor" prognosis group as compared to a relatively "good" prognosis group. In univariate analyses examining the association between ER status and latencies, hazard ratios represent the change in risk for ER positive tumors, as compared to ER negative tumors. Multivariable models adjusted for ER-α expression used the ESR1 probe in the NKI2 data set and the "205225_at" gene probe in other data sets.
For each pair of clusters created by the hierarchical clustering procedure, the percentage of all ER negative and ER positive tumors contained in each cluster was computed. The maximum value of the ratio of the ER negative percentage to the ER positive percentage among the two clusters was then computed. If ER positive and negative tumors were assigned to clusters in equal proportions, this ratio would be approximately 1. Otherwise, this ratio would be greater than 1, indicating that the clusters contained a relatively disproportionate number of ER negative tumors, as compared to ER positive tumors.
To eliminate correlations between all genes in the NKI2 data set and either ER-α or proliferation genes, the data were globally adjusted by fitting a least-squares regression line to each probe in the dataset and then computing the residuals (GLM procedure) . Each probe was the dependent variable and either ER-α or a proliferation gene was the continuous independent variable. The residuals (adjusted values) for each probe represent the new expression values for that probe. This new adjusted value is no longer linearly correlated with the independent variable. The gene selected to adjust each data set for proliferation was based on a previously published analysis and represents the gene probe that was the most predictive member of a set of correlated proliferation-associated genes linked with an increased risk of developing metastases . The specific genes used in each data set were UBE2C (NKI2 and TRANSBIG), HMMR (KJX64/KJ125) and RACGAP1 (Wang).
where Var(EP) is the percent change in variance after sequentially adjusting a data set for ER-α and proliferation and Var(E) and Var(P) are the percent changes in variance after individually adjusting for either the ER-α or proliferation genes, respectively.
All calculations were performed using SAS version 9.1 (SAS Institute, Cary, NC). All statistical tests were two-sided, and a p-value less than 0.01 was considered statistically significant.
We next assessed whether the differences among the data sets was the result of differences in the content of the genes represented on the microarrays used in each study. We performed clustering analysis using a set of 5,000 random lists (70 genes/list) that were comprised of genes that were common to all four data sets (figure 1B). For this analysis, each data set was modified so that it included only 1 instance of a given gene. Consistent with our initial observations, there was a significantly higher likelihood of obtaining a significant result when examining recurrence latencies in the NKI2 data set (61.6%), as compared to the other data sets. For the Wang  and TRANSBIG  data sets, use of a data set with only a single instance of each gene resulted in an increase in the proportion of predictive gene lists to 15% and 21%, respectively. This discrepancy is explained by the fact that decreasing the number of redundant probes in these data sets increased the relative proportion of genes predictive of outcome. Thus, these data sets likely contain multiple probes for many genes that are not associated with metastases.
ER status is predictive of metastasis in the NKI2 and TRANSBIG data sets.
(ER+ vs ER-)
Gruvberger et. al. has suggested that prognostic gene lists be tested independently on ER positive and negative tumors in order to control for the effects of ER-α gene expression on tumor assignment . We re-examined our prognostic lists separately in ER positive and negative tumors. While none of the lists were predictive amongst ER negative tumors, the NKI2 data set continued to show a significantly higher likelihood of giving a positive finding, as compared to the other data sets (figure 1C). Specifically, 38% of the lists were predictive in the NKI2 set versus 11% and 4% in the Wang and NKI2 data sets, respectively. Again, no random lists were predictive in the KJX64/KJ125 data set.
In summary, these results indicate that there is a large disparity in the likelihood of observing significant differences in metastasis risk among breast cancer data sets using arbitrary gene expression levels as a classifier. In particular, it was 4–5 times more likely to obtain a significantly predictive gene list in the NKI2 data set than the other data sets, even in stratified analyses examining ER positive tumors.
While clustering is associated with ER status, it is not clear whether this association is due to the specific contribution of ER-α correlated genes, as has been suggested , or whether other gene clusters associated with ER status may be driving this association. To directly ascertain the contribution of ER-α correlated genes to cluster assignment, the gene expression data were globally adjusted to eliminate the correlations between all genes in a data set and the ER-α gene [14, 23]. To accomplish this, we fit a least-squares regression line to each probe in the dataset, using ER-α as the independent variable, and then computed the residual. This residual represented the new expression value for that probe. After adjustment, the tumors were reclustered using the previously generated 5,000 lists of 70-gene lists and reanalyzed. Global adjustment eliminated the propensity for clusters to be associated with ER status in all data sets except the KJX64/KJ125 data set, in which there was only a modest attenuation (figure 2B). This anomaly may be due to the fact that we observed that ER-α gene expression was poorly correlated with ER status in the KJX64/KJ125 data set (data not shown), as compared to the other data sets. ER-α adjustment of the NKI2 and TRANSBIG sets showed a similar attenuation in the proportion of predictive gene lists observed using multivariable adjustment for ER-α. This indicates that ER-α correlated genes make a significant contribution to the efficacy of the predictive gene lists in these data sets (figure 2C). In contrast, ER-α adjustment substantially increased the proportion of lists that were predictive in the Wang data set from 15% to 34%. This observation is likely explained by the fact that ER status is not associated with outcome in this data set. Hence, removing the constraint on the clustering algorithm to assign tumors to clusters associated with ER status increased the influence of other sets of correlated genes associated with prognosis on the clustering program. Adjustment did not have any impact on the KJX64/KJ125 data set.
In summary, these data show that ER-α correlated genes can serve to either increase or decrease the likelihood that a random set of genes will cluster tumors into groups with significant differences in latencies. The differences in the direction of the effect are related to the underlying association between ER status and the risk of metastasis in each data set.
It has been previously shown that genes associated with cellular proliferation are strong predictors of metastases [13, 14, 24–26]. To ascertain the impact of adjusting for proliferation-correlated genes on cluster assignment and outcome, each data set was globally adjusted for proliferation. Adjustment almost completely eliminated the prognostic abilities of all of the random gene lists in the NKI2, Wang and TRANSBIG data sets (figure 2C), confirming the prognostic efficacy of the proliferation gene cluster. Interestingly, while adjustment for proliferation decreased the rate of disproportional cluster assignment associated with ER status, especially in the TRANSBIG data set, it did not eliminate it (figure 2D). Thus, the prognostic ability of the gene lists can be eliminated while maintaining a predilection for cluster assignment associated with ER status. These results suggest that the prognostic contribution of ER-α correlated genes can be explained by their correlation with proliferation genes.
It is important to note that the variation explained by the ER-α and proliferation-correlated genes is not independent. Thus, in contrast to the above similarities between the data sets, the proportion of the proliferation-associated variance that is explained by ER-α differs substantially (figure 3B). In the NKI2 and TRANSBIG data sets, ER-α accounted for approximately 35–40% of the proliferation-associated variance. This relatively high proportion likely explains the significant association between ER status and prognosis and the impact of adjusting for ER-α on the prognostic abilities of the random gene lists in the NKI2 and TRANSBIG data sets. In the Wang data set, ER-α expression only accounted for 12% of the proliferative variance, hence explaining the weak association between ER status and prognosis.
In the present study, we show that there are marked differences in the likelihood of observing a positive association between the expression patterns of a set of genes and metastasis across four breast cancer global gene expression data sets. Most striking was the observation that over 60% of randomly selected lists of 70 genes each could segregate tumors into two groups with significant differences in outcome in the NKI2 data set. This finding is particularly salient in light of the fact that this data set has been frequently used to both generate and verify prognostic lists of genes [4, 6, 19, 27–29]. The ability of the majority of random gene lists to predict outcome calls into question the biological validity of obtaining a positive finding in this data set, as a vast number of combinations of likely biologically unrelated genes are predictive of metastasis.
Gruvberger et. al., noting a strong association between ER status and outcome in the van't Veer data set , suggested that analyses of prognostic gene lists be stratified by ER status to ascertain whether the lists were simply functioning as proxies of ER-α gene expression . ER status was associated with the risk of metastases in two of the data sets that we analyzed, the NKI2 (a partial superset of the van't Veer data set) and TRANSBIG data sets. In three of the four data sets, we found that tumors tended to cluster by ER status. This was directly attributable to genes correlated with ER-α expression, as elimination of the variance in gene expression associated with ER-α expression completely attenuated disproportionate clustering by ER status. Adjusting for the expression of ER-α decreased the proportion of predictive gene lists in the NKI2 and TRANSBIG data sets, as would be expected. It also substantially increased the proportion of predictive gene lists from approximately 15% to 35% in the Wang data set where ER status is not an independent predictor of metastasis. Thus, a gene list which gives disparate results in the Wang versus NKI2 or TRANSBIG data sets may, indeed, be functioning as a proxy for ER expression.
While ER-α expression may be an important contributor to cluster assignment, it is not the primary determinant for obtaining a significant prognostic gene list. When each of the data sets was globally adjusted for proliferation, there were almost no significantly predictive gene lists, even though tumors tended to cluster by ER status. Genes associated with proliferation have been shown to be the essential contributors to an effective prognostic gene list in gene expression data sets of combined ER positive and negative tumors or just ER positive tumors [13, 14, 26]. We show that the extent to which adjusting for ER expression impacts a prognostic gene list is related to the extent to which the expression the ER-α gene and proliferation genes covary. Hence, ER status is predictive of prognosis to the extent to which it functions as a confounder to the relationship between the expression of proliferation genes and the risk of metastasis.
Stratifying our analyses by ER status had differential effects on the data sets. In the NKI2 data set, almost 40% of gene lists were predictive in ER positive tumors, while less than 5% were predictive in the TRANSBIG data set. This result is interesting since both data sets had a similar proportion of events occurring among ER positive and negative tumors and ER status was associated with outcome in both data sets. When we examined the distribution of the hazard ratio estimates, which show the magnitude of the differences in rates of metastases of the tumors in the cluster pairs, the hazard ratio estimates tended to be larger in the TRANSBIG data set. This analysis suggests that clusters generated with the random gene lists tend to have larger differences in the metastasis latencies in the TRANSBIG data. Thus, the most likely explanation for the difference in the proportion of significant findings between these data sets is the larger sample size of the NKI2 data set. Other than sample size variations, these data sets were similar in many regards. The similarities may reflect commonalities in their patient populations, both of which were derived from young women in northern European countries.
In contrast to the ER positive tumors, virtually none of the random gene lists was prognostic in ER negative tumors in stratified analyses. This observation may suggest that, among these data sets, there is not a large, dominant cluster of correlated genes associated with prognosis among these tumors. Another possible explanation is that there is more heterogeneity among ER negative tumors, which can be comprised on Her2 amplified tumors, basal as well as familial forms of breast cancer [30–32]. We would note, however, that gene-expression based prognostic signatures for ER negative tumors have been described [33, 34], suggesting that robust common prognostic features likely exist amongst these tumors.
We found that, in general, random gene lists were less predictive of recurrence in the Wang data set. In analyses of all tumors, this effect could be partially attributed to the lack of an association between ER status and outcome in this data set, as described above. The random gene lists were also far less predictive among ER positive tumors in this data set, as compared to the NKI2 data set. This difference cannot be attributed to sample size differences as these data sets are approximately the same size. Like the TRANSBIG data set, the Wang data set is comprised of lymph node negative patients. While this difference could account for disparities between the Wang and NKI2 data sets (which utilized tumors from patients with node positive and negative disease), it would not account for differences between it and the TRANSBIG data set. Thus, the relevance of the difference in tumor composition with respect to this clinical variable is not clear. Another notable difference between the data sets is the older age representation among participants in the Wang study which could contribute additional heterogeneity amongst these tumors. It is also interesting to note that random gene lists containing genes that were independently predictive of metastases were only predictive approximately 55% percent of the time in this data set, versus 90% or more of the time in the other data sets (see Additional File 1). This might suggest that there are more independent clusters of genes predictive of prognosis in this data set. Consistent with this hypothesis, we have previously found a higher degree of correlation among independently predictive genes in NKI2 data set versus the Wang data set . Michiels et. al. showed in the van't Veer data set that the set of genes selected for a prognostic classifier is highly variable depending upon the training set used to identify prognostic genes . Based on our findings, we would anticipate an even greater degree of variability in the genes identified if their analysis was repeated using the Wang data set.
The KJX64/KJ125 data set produced considerably different results than the other data sets. Virtually none of the random gene lists were prognostic in this data set. It is important to note that it is possible to generate a large set of prognostic genes lists in this data set, as we found that 97% of random gene list derived from genes independently associated with metastasis were indeed predictive (see Additional File 1). In addition, this data set has also been used to generate other validated gene lists . We found anomalies in this data set that may account for differences in its behavior, as compared to the other data sets, such as the fact that the ER-α gene probe levels were poorly correlated with ER status. This might suggest that there are issues with scaling of the raw data or that there are a large proportion of gene probes which did not perform optimally. However, regardless of how we rescaled the data, we obtained nearly identical results. Furthermore, this data set was derived using the same gene expression technology (Affymetrix) as the TRANSBIG and Wang data sets. Hence, differences in behavior cannot be attributed to differences in technology.
In summary, these analyses demonstrate that there are systematic differences among the four breast cancer gene expression data sets examined herein. Most notable is the fact that the NKI2 data set provides a very high likelihood of obtaining a significant association between expression and metastases for a given set of genes. This disparity persists when examining all tumors in the data set or when looking only at ER positive tumors. The TRANSBIG data set behaved similarly to the NKI2 data set and the lower likelihood of obtaining a positive result in this data set is likely attributable to its smaller sample size. Fewer random gene lists are predictive in the Wang data set due to the lack of an association between ER status and the risk of metastasis. These differences in data sets should be considered when using them for developing new prognostic classifiers or assessing the robustness of classifiers that were generated with other data sets.
This work was supported by grants from the National Institutes of Health [RO1-CA090398 (to RAK) and training grant T32-GM07250 (MSTP/CASE) (to JDM)] and an Ohio Innovation Incentive Fellowship (to JDM).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.