Robust assignment of cancer subtypes from expression data using a uni-variate gene expression average as classifier

Background Genome wide gene expression data is a rich source for the identification of gene signatures suitable for clinical purposes and a number of statistical algorithms have been described for both identification and evaluation of such signatures. Some employed algorithms are fairly complex and hence sensitive to over-fitting whereas others are more simple and straight forward. Here we present a new type of simple algorithm based on ROC analysis and the use of metagenes that we believe will be a good complement to existing algorithms. Results The basis for the proposed approach is the use of metagenes, instead of collections of individual genes, and a feature selection using AUC values obtained by ROC analysis. Each gene in a data set is assigned an AUC value relative to the tumor class under investigation and the genes are ranked according to these values. Metagenes are then formed by calculating the mean expression level for an increasing number of ranked genes, and the metagene expression value that optimally discriminates tumor classes in the training set is used for classification of new samples. The performance of the metagene is then evaluated using LOOCV and balanced accuracies. Conclusions We show that the simple uni-variate gene expression average algorithm performs as well as several alternative algorithms such as discriminant analysis and the more complex approaches such as SVM and neural networks. The R package rocc is freely available at http://cran.r-project.org/web/packages/rocc/index.html.


Background
One of the most promising clinical applications of genome wide expression studies is the construction of robust and reliable disease classifiers. Correct identification and sub-classification of diseases such as cancer is a prerequisite for proper and efficient treatment. To date a large number of different algorithms for disease classification have been described. They range in complexity from neural network approaches [1] to the simpler nearest-neighbor classification algorithms [2]. Even though some of the more complex approaches such as neural networks and self organized maps (SOM) [3] have proved to be very efficient, these methods often rely on the tuning of several parameters and hence are liable for over-fitting. Furthermore, simple classifiers seem to perform remarkably well when compared to more sophisticated ones [4]. In the present investigation our aim has been to design a simple predictor system useful for cancer subtype classification. Features to be included in the predictor signatures are selected based on their classification capacity as determined by a receiver operating characteristic (ROC) analysis and area under the curve (AUC) estimates [5,6]. After selection of the appropriate number of genes in the predictor signature, the mean expression level of all genes included is calculated, transforming the ensemble of genes into one vector and used as a uni-variate gene expression average, or a metagene, as classifier. Two features of gene expression are exploited by the merging of genes, genes are often co-regulated and hence correlated, and by using the expression level of the metagene, effects by random noise from single genes are minimized. Most of the commonly used algorithms such as SVM [7] and PAM [8] apply specifications such as support vectors and weights to the individual features included in the predictor gene signatures which potentially complicate their application to independent data [9]. Hence in this investigation we use an alternative way to evaluate the results by using the obtained training set gene signature genes only and then establish new parameters in the validation set to evaluate the performance of the classifier. We show that the proposed metagene classifier produces excellent accuracies, similar to what is obtained with a SVM approach, in several types of cancer data sets using a variety of tumor classification criteria.

Data sets
To establish the classifier we used bladder cancer datasets produced by Sanchez-Carbayo et al. [10] (Supplementary Table 10 in [10]) "SanchezC", Stransky et al. [11] (ArrayExpress: E-TABM-147) "Stransky"; and Blaveri et al. [12] (Supplementary Table 4 in [12]) "Blaveri". The remaining datasets were obtained from Gene Expression Omnibus (GEO) [13], except for the vandeVijver breast cancer dataset [14]. The following datasets were downloaded from GEO; for breast GSE2034 (WangY), GSE2990 (Sotiriou), for neuroblastoma GSE3960 (WangQ), GSE12460 (JanoueixL), GSE19274 (Attiyeh), for lung GSE8569 (Angulo), GSE11969 (Takeuchi). For a detailed description of the datasets see Additional file 1. Normal urothelium samples, recurring tumors from the same patient, cell lines, and technical replicates were not included in the final bladder cancer data sets. The San-chezC dataset was quantile-normalized using the normal-izeBetweenArrays function of the R package limma [15]. Robust Multi-array Average (RMA) was performed separately for two samples sets of the Stransky dataset (on U95A and U95Av2 respectively) using the affy package [16]. Obtained RMA expression values were de-logged, the samples sets combined, and quantile normalized using limma. The SanchezC and Stransky datasets were both transformed to log2 scale. To obtain gene-centered values the gene expression values were subtracted by the mean expression of the gene in each dataset separately. The Blaveri dataset was imputed for missing values using k-nearest neighbors (k = 10) for genes that had no more than 20% missing data, and genes with >20% missing data were omitted [17]. The HGNC GeneSymbols were updated in all datasets with the official HGNC GeneSymbols from the HGNC webpage [18]. The expression values of GeneSymbols with multiple reporters were merged by taking the median expression value. All reporters in the datasets without a GeneSymbol were discarded. The final SanchezC dataset contained 90 patients and 12761 genes, the Stransky dataset 56 patients and 8955 genes, and the Blaveri dataset 74 patients and 4430 genes. The SanchezC and Stransky datasets share a total of 8518 GeneSymbols and were used to explore the AUC characteristics. For classification, Ta and T1 cases were considered non-muscle invasive (NMI), and ≥T2 cases as muscle-invasive (MI). Grade is discriminated between Grade 2 and 3 in SanchezC, and between Grade1+2 and Grade 3 in Sanchez. Randomized versions of the datasets were generated using the mean and standard deviation of the original datasets. Non-bladder cancer Affymetrix datasets not already normalized were downloaded as cel files and normalized using RMA. All other datasets were downloaded as normalized 'series matrix files'. In the case of missing values, k-nearest neighbor imputation was performed (k = 10). Gene Symbols were updated using the official HGNC nomenclature file, and then expression values for reporters with the same GeneSymbols were merged. The data was mean-centered, except for two-color array data, as this data comes already in ratios. Reporters with no GeneSymbols were excluded from the final data. The vandeVijver data was imputed for missing values by k-nearest neighbor, transformed to log2 ratios and GeneSymbols were updated and merged.

ROC analyses
The receiver operating characteristic (ROC) curve is the plot of sensitivity (true positive rate) vs. 1-specificity (false positive rate), for predicting a binary classification variable z using some covariate x. That is, if x >t for some threshold t then z is predicted as 1, otherwise as 0. As the threshold t ranges from +N to -N the fraction of true positive and false positive predictions will both increase from 0 to 1, yielding the ROC curve. The area under this curve (AUC) is an overall measure of the predictor's performance. An ideal predictor obtains true positive rate 1 and false positive rate 0 for some threshold t, and then AUC = 1. Ignoring covariate information and guessing randomly by predicting z = 1 with some probability q yields AUC = 1/2 by letting q range from 0 to 1, so 1/2 is a worst-case AUC. If AUC < 1/2 the covariate x is however negatively correlated with z, and replacing x by -x turns AUC into 1-AUC (which will be >1/2). For a given covariate x we can thus view max(AUC,1-AUC) as its performance. For a finite sample of x's and z's, AUC can be computed as a function of the Mann-Whitney (or two-sample Wilcoxon) statistic for comparing the x's associated z = 0 and z = 1 respectively.
Supervised classification using the uni-variate gene expression average classifier Genes to be included in the gene signatures are selected in order of their ranked max(AUC,1-AUC) values. To merge the gene expression of a given gene signature to a single metagene expression value, an arithmetic mean is computed by summing up the expression values after multiplying expression values for genes negatively associated with the feature (AUC < 0.5) by -1 (Additional file 2). The resulting metagene expression values are then used in ROC analyses, i.e. by ranking the samples according to their metagene expression values. The optimal split of positive (i.e., 1) and negative (i.e., 0) samples is determined as the metagene expression threshold which produces the highest accuracy i.e., correct class assignments in respect to the real class, in the training set. More precisely, the threshold is computed as the mean metagene expression value of the two samples that constitute the border of the split. A new sample to be classified has its metagene expression value determined with the same genes to be multiplied by -1. The new sample is classified according to which side of the threshold the sample falls in, with a sample having higher metagene expression being classified as positive (i.e., 1) and with lower expression as negative (i.e., 0). The split yielding optimal accuracy in the ROC curve is determined using the R package ROCR [19]. The outlined approach is to some respect similar to the approach described by Rosenwald et al [20] except that we have simplified the use of metagenes further by using one single metagene and thus do not have to assign any specific weights to individual metagenes. In addition we optimize the threshold for each dataset.

Additional classification algorithms
We compared our classifier initially to a Support Vector Machine (SVM) with a linear kernel. SVMs have been shown to perform considerably well in microarray data [21] and SVMs with a linear kernel has been suggested to perform better in gene expression data than more complex SVM versions (Manual BRB Array Tools, [22]), and additionally, no parameter tuning is necessary for linear SVMs. Briefly, Support Vector Machines (SVMs) identify the maximum margin hyperplane that optimally separates the training samples (based on support vectors) and then classify unseen samples according to the side of the hyperplane they fall into [23]. We used the 'svm' function of the R package e1071 [24]. Additional classification algorithms were implemented using the R package MLInterfaces [25] using the default settings and included SVM with radial kernel (SVMradial), SVM with polynomial kernel (SVMpoly), k-nearest neighbor (knn, k = 3), random forest (rforest), recursive partitioning trees (rpart), bagging (bagging), linear discriminant analysis (lda), diagonal linear discriminant analysis (dlda), stabilized linear discriminant analysis (slda), neural network (neural net, with 3 hidden layers), except for the nearest centroid classifier (ncc) that was implemented using the pamr package [8].

Performance Evaluation
As the accuracies of prediction are dependent on the prior distribution of samples, we used balanced accuracies computed by (sensitivity + specificity)/2. Balanced accuracies are independent of the prior distributions [26]. Unbiased accuracies were obtained by leave-one-out-cross-validation (LOOCV). Feature selection was repeated in each loop of LOOCV. When testing the performance of a gene signature in independent validation data only the genes were used in the validation data, i.e. not the information on AUC and 1-AUC or the classification threshold. In a clinical validation platform, as qPCR and IHC, we assume that the classifier specification might not be taken over. We applied a LOOCV loop to determine max(AUC,1-AUC) and optimal classification threshold for the signature metagene in each loop. Accuracies obtained from cross-validation loops are an estimate for the accuracy obtained from the whole dataset, i.e. all samples.

R package rocc
Briefly, the package includes the functions tr.rocc, p. rocc, and o.rocc. The function tr.rocc uses a training set with a given phenotype to determine the metagene threshold. The function p.rocc predicts the class of a new sample using the classifier specification of the tr. rocc output. The function o.rocc performs a LOOCV loop using a metagene of given size, e.g., top 200 genes, with feature selection in each loop separately.
tr.rocc (data,out,xgenes = 200) p.rocc (tr.rocc.object,newsample) o.rocc (data,out,xgenes = 200) data = dataset as a matrix file with samples as columns and genes as rows out = phenotype as factor with levels 0 and 1 xgenes = number of genes that constitute the metagene; can be a numeric vector. newsample = sample to be classified using a classifier obtained from tr.rocc()

Feature selection
Two bladder cancer datasets, SanchezC and Stransky, were selected to explore the efficiency of predictor gene signatures based on ROC statistics. For each gene the association with muscle-invasiveness (MI) and tumor grade (G) as feature variables was estimated by calculating the AUC value in each dataset. The obtained AUC values ranged between 0.06 and 0.95, and 0.07 and 0.94 for MI, and between 0.14 and 0.85, and 0.10 and 0.89 for high grade (G3) in the SanchezC and Stransky data, respectively. The distribution of the AUC values deviated from the normal distribution and showed heavy tails (Figure 1). As a comparison, AUC values were also estimated in randomized versions of the San-chezC and Stransky datasets. In the randomized data 99% of the obtained AUC values were approximately between 0.7 and 0.3 for MI and tumor grade. A total of 31% and 23%, and 11% and 14% of genes fall outside these 99% borders for MI and Grade variables in the SanchezC and Stransky dataset, respectively. In contrast, 1% of genes are expected to fall outside for a randomized variable. The genes in excess to this 1% of false discovery genes are informative, and hence may be considered as the maximum size of a gene signature. Hence, a large proportion of genes show informative AUC values. Furthermore, as more genes are associated with MI than with grade, the major difference in bladder cancer phenotype seems to be associated with stage.
We then investigated the robustness of obtained AUC values by comparing AUC values from two different datasets. In Figure 2A we have plotted the AUC values for NMI/MI status and in Figure 2B the equivalent data for grade, for the SanchezC and Stransky data respectively. The correlations between the two datasets were Very few genes showed AUC values outside the 95% confidence interval of the randomized data in both datasets ( Figure 2).
We then investigated the possibility to identify robust predictor genes i.e., genes with informative AUC values in more than one dataset, by using more information during the feature selection process. We investigated deviations from a normal distribution as a possible additional criterion. In an optimal scenario an informative gene should show high expression in one and low expression in a second group in a two class situation and hence produce a bimodal distribution, or a distribution with a heavy tail. We used the Shapiro test for normality to identify genes with skewed or otherwise distorted distributions; the analysis revealed no differences between the two groups ( Figure 3A and 3B). We investigated the variance of the informative genes as genes with large variance are expected to produce more robust results and did note a significant shift to larger variance for genes with AUC values >0.7. This shift was however too small to be of any practical use ( Figure 3C and 3D). Similar results were obtained for the Stransky data (Additional file 3). From this we conclude that deviation from the normal distribution or large variance cannot be used to preselect genes with robust AUC values and hence no such functions were added to the software.
Gene expression is inherently noisy and random noise is expected to reduce the performance of predictors based on single genes. We therefore reasoned that the random noise effect could be counteracted by using the mean expression level for more than one gene. We consequently calculated the AUC values for all genes with respect to MI and grade in the Sanchez data. Before ranking, AUC values for genes with negative correlation, and hence showing AUC values <0.5, were inverted to values >0.5 by assigning AUC = 1-AUC -. We then designed gene signatures with increasing sizes by adding genes in order of their rank with steps of 1 at a time. For each signature the average expression level was calculated. In cases when genes showed negative association with MI or grade, the gene expression levels were inverted. The AUC values were then estimated using the expression levels of the created metagenes. Figure  4A and 4B show that the use of uni-variate gene expression average classifier results in considerably higher AUC values than those obtained for single and top ranking AUC genes. Furthermore, the AUC values turn stable when the metagenes become large enough.

Classification performance
We next compared the prediction performance of AUCbased metagenes with the standard and frequently used SVM-method. To accomplish this we constructed AUC metagenes with increasing number of genes in steps of 5 up to 500 genes. For comparison, we applied the most differentially expressed genes as determined by a t-test or AUC values to SVM, also in increasing steps of 5 genes. The accuracies were estimated using LOOCV in both cases. In Figure 4C and 4D we tested the NMI/MI classification in the SanchezC and in the Stransky data, respectively. As can be seen the AUC metagene approach is as efficient as the SVM approach using both t-test and AUC as feature selection criteria; only slightly poorer in the SanchezC data, with accuracies close to 0.9 in both cases, but slightly better in the Stransky data, with accuracies close to 0.85 for the AUC metagene and 0.80 for the SVM approach. Hence, the simple AUC metagene approach seems to be as efficient as the more sophisticated SVM.
The robustness of the AUC metagene approach was then tested in independent data. For these purposes, we designed AUC metagene signatures with increasing sizes in one of the three datasets, SanchezC, Stransky, and  Figure  2A. B) Deviation from normal distribution of genes grouped by their association to grade in Figure 2B. C) Box plot for standard deviation of genes grouped by association to MI. D) Box plot for standard deviation of genes grouped by association to grade. -log10 p-value = logarithmic p-value of Shapiro test for normality.
Blaveri. The accuracy for each AUC metagene was first determined in the respective training set using LOOCV ( Figure 5). Genes were then ranked according to AUC values using all cases in the training set and used as a source for producing gene signatures of increasing size applied to the independent data sets. These metagenes were then used to establish thresholds for maximum accuracies in the validation data sets within LOOCV loops. Hence, the signature genes used to produce the metagene were derived from the training set but the actual thresholds were derived from the validation sets. To estimate classification performance in the validation data sets we again performed LOOCV. In Figure 5A we have used metagenes obtained from the SanchezC dataset to predict NMI/MI in the Blaveri and Stransky datasets. As can be seen, when applied to the training set, i.e. the SanchezC dataset itself, accuracies close to 0.9 are obtained whereas in the Blaveri and Stransky accuracies slightly above 0.9 are obtained. Similar results were obtained for the other combinations of training and validation data, and hence, the AUC metagene approach is robust to independent data.
A close inspection of Figure 5A-C seems to indicate that signatures for MI in bladder cancer smaller than 100 genes are less robust i.e. show a large variance in accuracy when applied to validation data. To explore this further, we calculated the change in accuracies between each step of signature size for signatures derived from the same dataset. These values were then plotted ( Figure 5D). It is obvious form Figure 5D that signatures smaller than 100 are sensitive to the composition of the signature, whereas signatures above 200 show robust performance, seen as small or no changes in accuracies for different AUC metagenes. From this The metagene classifier approach was then evaluated further using two approaches. First, we evaluated the algorithm performance regarding different endpoints. For this purpose we used three breast cancer datasets (vandeVijver, WangY, Sotiriou), three neuroblastoma datasets (WangQ, JanoueixL, Attiyeh) and two lung cancer datasets (Angulo, Takeuchi), and six different endpoints; ER status, tumor grade, tumor size, tumor stage, MYCN amplification status, and histological subtype. Second, we compared the uni-variate gene expression average classifier with 12 alternative classification algorithms using two feature selection procedures.
In these comparisons we predetermined the number of genes to be included in the metagene to 50, as this is close to the median number of genes included in published gene signatures. For each endpoint the selected features (genes) were determined in one dataset and then applied to the remaining ones (Table 1). For classification of e.g. ER status in breast cancer the gene list was derived from one dataset and then applied to the two remaining ones. This procedure was repeated using each dataset for feature selection resulting in a total of six tests. Overall the accuracies were high, ranging from 0.77 to 0.94 and similar to those obtained by SVM (range 0.71 to 0.95) when applied to validation data. Instances resulting in lower range accuracies using the metagene predictor where also low when using the SVM approach, indicating that the obtained lower accuracies were dependent on dataset and not on the algorithm used. Tumor grade was only available for two breast cancer datasets, vandeVijver and Sotiriou. Grade was predicted in validation data with accuracies ranging from 0.82 to 0.85 using the metagene predictor, similar to what was obtained by SVM ( Table 1). The uni-variate gene exression average classifier could also faithfully predict MYCN status in neuroblastoma and histopathological subtype in lung cancers. Tumor stage in  neuroblastoma and tumor grade in lung cancers were predicted with lower but significant accuracies. As can be seen from the computed average accuracies for each phenotype, the simple metagene predictor on average performs just as well, or better, than the more complex SVM approach. All predictor evaluations were repeated using 10 and 200 member metagenes, the obtained accuracies were however not influenced by the size of the metagenes (Additional file 4).
For the extensive comparison with other classification algorithms we limited the analysis to ER status in breast cancer, MYCN status in neuroblastoma, and adenocarcinoma/squamous cell carcinoma status in lung cancer using a t-test as feature selection criteria ( Table 2). For comparison we also used AUC as feature selection criteria (Additional file 5). As can be seen in Table 2, all classification algorithms show similar performance. The performance is more dependent on the endpoint/dataset than the actual algorithms used. Furthermore, simple classifiers such as the discriminant analysis methods (lda, dlda, slda) perform just as well as the more sophisticated ones, such as the neural network algorithm neuralnet. Our uni-variate gene expression average classifier shows good performance, ranking first, third, and fifth in the lung cancer, breast cancer and neuroblastoma datasets, respectively. Almost identical results were obtained when using AUC as feature selection criteria (Additional file 5).

Conclusions
We have developed a new algorithm for tumor classification based on the formation of gene expression metagenes and of feature selection using AUC values obtained by ROC analysis. This simple classification algorithm shows good performance and is robust in independent validation data. The described approach has the potential to be a valuable complement to algorithms based on alternative principles.