Robust multi-tissue gene panel for cancer detection

Background We have identified a set of genes whose relative mRNA expression levels in various solid tumors can be used to robustly distinguish cancer from matching normal tissue. Our current feature set consists of 113 gene probes for 104 unique genes, originally identified as differentially expressed in solid primary tumors in microarray data on Affymetrix HG-U133A platform in five tissue types: breast, colon, lung, prostate and ovary. For each dataset, we first identified a set of genes significantly differentially expressed in tumor vs. normal tissue at p-value = 0.05 using an experimentally derived error model. Our common cancer gene panel is the intersection of these sets of significantly dysregulated genes and can distinguish tumors from normal tissue on all these five tissue types. Methods Frozen tumor specimens were obtained from two commercial vendors Clinomics (Pittsfield, MA) and Asterand (Detroit, MI). Biotinylated targets were prepared using published methods (Affymetrix, CA) and hybridized to Affymetrix U133A GeneChips (Affymetrix, CA). Expression values for each gene were calculated using Affymetrix GeneChip analysis software MAS 5.0. We then used a software package called Genes@Work for differential expression discovery, and SVM light linear kernel for building classification models. Results We validate the predictability of this gene list on several publicly available data sets generated on the same platform. Of note, when analysing the lung cancer data set of Spira et al, using an SVM linear kernel classifier, our gene panel had 94.7% leave-one-out accuracy compared to 87.8% using the gene panel in the original paper. In addition, we performed high-throughput validation on the Dana Farber Cancer Institute GCOD database and several GEO datasets. Conclusions Our result showed the potential for this panel as a robust classification tool for multiple tumor types on the Affymetrix platform, as well as other whole genome arrays. Apart from possible use in diagnosis of early tumorigenesis, some other potential uses of our methodology and gene panel would be in assisting pathologists in diagnosis of pre-cancerous lesions, determining tumor boundaries, assessing levels of contamination in cell populations in vitro and identifying transformations in cell cultures after multiple passages. Moreover, based on the robustness of this gene panel in identifying normal vs. tumor, mislabelled or misinterpreted samples can be pinpointed with high confidence.


Background
Rapid and accurate classification of cancerous tissue samples is an unmet scientific and clinical need. Standard clinical practice in identifying cancer relies on pathological examination of biopsy specimens, radiological images and histology. However, these diagnoses can be incorrect because of atypical morphologies, or poorly extracted biopsies. In cases where the pathologist makes an error in determining whether a surgically resected tumor has suf-ficient normal cells in its margins, an error could have significant consequences to the patient. A corroboratory analysis may also benefit laboratory experiments on cell lines or tissue samples which might be labelled as cancerous, but might in fact be significantly or wholly contaminated by surrounding or externally derived noncancerous tissue.
Several previous studies have attempted to find a common gene signature in multiple neoplasms. One such group at the NIH has also established a gene panel capable of distinguishing benign from malignant tumor in four different tissue types [1]. In terms of diagnosing can-cer from normal specifically, two groups from Johns Hopkins [2,3] have used different methods to analyse the data being collected by ONCOMINE http://www.oncomine.com and have attempted to establish a multi-tissue cancer signature and have claimed and demonstrated success in classifying cancer from normal tissue. The main difference between these two approaches is the algorithm used for feature extraction. Xu et al [4] used a method called top-scoring pair of groups (TSPG) to select informative genes which relies on a random sub sampling of genes. Rhodes et al [2] used a more classical approach to determine the most significantly differentially expressed genes that treats each gene as an independent feature in the dataset. We also use the t-statistic to determine differential expression, which is similar to Rhodes et al [2], but do not assume an underlying normal distribution. Instead, we used an experimentally derived error model for Affymetrix chips incorporated in the Genes@Work software suite from IBM Research which is freely available at: http://www.research.ibm.com/Fun-Gen/FGDownloads.htm. The experimental model used in Genes@Work determines p-values based on a multitissue model derived from replicate measurements on Affymetrix chips to assess stochastic and systematic (handling) errors in microarray data analysis.
Our training set consists of a proprietary sample set for normal and cancerous tissue from breast, colon, lung, prostate and ovary. A detailed description of this data is available in the methods section. Using this high quality multi-tissue data set, we applied an integrated informatics strategy which combined targeted bioinformatics and analytical approaches to identify and validate a panel of genes to distinguish normal from cancer tissue. We also demonstrated that an accurate diagnosis of cancer tissue is possible using modern gene expression arrays.

Training set sample and microarray data generation
Frozen tumor specimens were obtained from two commercial vendors Clinomics (Pittsfield, MA) and Asterand (Detroit, MI). The data was obtained from five tissue types ( [4] and hybridized to Affymetrix U133A GeneChips (Affymetrix, CA). Expression values for each gene were calculated using Affymetrix GeneChip analysis software MAS 5.0. Chips were rejected if average intensity was < 40 or if the background signal > 100. For normalization, all probe sets were scaled to a target intensity of 600 and scale mask files were not selected.

Mapping Affymetrix probes to Agilent probes
Affymetrix probes were first mapped to UniGene Ids using their publically available annotation table and then subsequently mapped to Agilent probes using their respective annotation table. Since many of the genes in our panel are represented by multiple probes, the average expression of each probe was measured across all samples and the probe with the highest overall signal for each gene in our panel was selected as the equivalent diagnostic feature.

Description of the Genes@Work Software
We used a software package called Genes@Work [5] created by IBM Research, to determine differentially expressed genes in each tissue type. This software uses an experimentally validated non-linear error model for gene expression measurement error derived from replicate measurements which underemphasizes the significance of variations in genes that have lower expression and overemphasizes the significance of variations in genes that have high expression.

SVM Classifiers
All classifiers were built using SVM light http://svmlight.joachims.org/ using a linear kernel option and complete leave-one-out estimations were calculated for each experiment. M-fold cross validation was also performed at m = 5 and 300 samplings with replacement ( Table 2). Normalized data for each of the tissue types were separated into normal and cancer classes and designated as positive and negative classes respectively. Classifiers were built for each tissue type separately as well as globally and saved as SVM model files with support vector information for future classification.

Generating a common gene panel for cancer classification
In order to explore the possibility of a common gene panel that can reliably detect cancer originated from multiple tissue types, we created a compendia of 5 microarray datasets from prostate, lung, ovarian, colon, and breast, respectively, each with cancer and normal samples from multiple subjects (Table 1). Primary tumor samples and normal samples were collected and processed on Affymetrix HG-U133A GeneChip and subsequently RMA nor-malized and uploaded to an internal database. The raw intensity data was also log2 transformed, normalized and then input to Genes@Work. This software uses an error model based on replicate Affymetrix chip measurements to determine the true error bounds and p-values and was therefore was an ideal choice for this type of analysis. Figure 1 shows an example of the program output. Points outside of the lines correspond to genes that are differentially expressed with a p-value < 0.05. We generated one set of significantly dysregulated genes for each tissue type by comparing normal samples to tumor samples in that tissue type. Table 1 shows the distribution of samples and the corresponding differentially expressed genes.
Next, we identified the intersection of all genes differentially expressed in all five tissue types, resulting in our common cancer gene panel. Using leave-one-out (LOO) cross validation as well as m-fold cross validation, we verified that our common "tumor discriminating" gene panel was robust and could separate cancerous tissue from normal tissue with accuracies exceeding 90% when the tissue of origin was known (see Table 2). Figure 2a shows the relative expressions of each gene in our common feature set in cancer and corresponding normal samples for each tissue type. This figure also shows the expression trends of each of the genes in our panel. A binary table representing this information including all probe ids used is available in Table 3. Figure 2b depicts entire data set grouped by phenotype (cancer versus normal) and demonstrates the striking distinction in gene expression of the panel genes between cancer and normal tissue. To validate that our common caner signature can correctly and robustly classify tumors, we applied our panel to data from several published studies on tumors originating from different tissue types. The results are summarized in Table 4, and described in detail below.

Validation set #1: Lung Cancer
Spira et al identified an 80 gene panel which, together with cytopathology, was able to distinguish smokers with and without stage 1 lung cancer using bronchial epithelial brushing samples [6]. Using a weighted voting algorithm, they were able to achieve 80% accuracy which improved to 95% when these predictions were combined with cytopathology. In comparison, our gene panel applied to their data and trained on an SVM classifier (without using cytopathology) was able to achieve 94.75% leave-one-out accuracy using the microarray data alone. To make a more direct comparison, we also built SVM classifiers based on Spira's 80 gene probes, as well as the top 104 probes using the same feature extraction methodology that they used. Figure 3 shows the distributions of accuracies for classifiers based on a random choice of genes compared with our panel as well as Spira's original panel described in the publication.
Although the data of Spira et al had samples from bronchial airway epithelial cells and our lung data was for primary lung cancer, the tumor samples in the datasets had similar signatures with respect to the genes in our panel. This was verified by checking that both SVM classifiers using our gene panel, built on the Spira data and tested on primary lung cancer data or vice versa, had classification accuracy > 90%.

Validation set #2: Breast Cancer
Wang et al published a study with a 286 sample data set of breast cancer samples, 180 from patients who eventually developed distant metastasis within 5 years and 106 from patients who did not [7]. Using both a breast specific and the multi-tissue tissue SVM classifier, we classified all 286 of these samples as cancerous. Further, although we did not expect a positive result, as an exploratory measure we built a classifier that attempted to distinguish between metastatic tumors and non-metastatic lesions. Our classifier achieved 88.8% LOO accuracy, suggesting that although it was developed to distinguish tumors from normal, some of these genes seem to also carry information about the metastatic properties of these tumors.

Validation set #3: Ovarian Cancer
Data from the Dana Farber Cancer Institute GCOD dataset repository [8] was selected for high-throughput validation of our cancer signatures. We selected data from ovarian and colon tissue which had data points for all 113 of our probes. The array normalized data was then tested using SVM classifiers trained on our data set. Summaries of the results of this analysis are included in Table 4 and show that our panel is highly accurate. Study 1: [9]: 103 samples ranging from mucinous, clear cell, serious and endometrioid ovarian carcinomas.  Using both an ovarian specific classifier & a multi-tissue classifier, we obtained 100% accuracy.
Using a colon specific classifier & a multi-tissue classifier we were able to correctly label all cancer samples.
To further test the significance of our gene panel, we used an internal data warehouse called tranSMART (private communication, J. Smart, Centocor R&D, Inc) that takes a gene signature as input and searches all deposited internal and public clinical and non-clinical microarray analyses for similar signatures. When inputting our multi-tissue average signature, the top significant results all refer to normal vs cancer comparisons. Our global signature hit 47% of the cancer vs. normal comparisons which was significant relative to the total distribution that only comprises 5% of the database. We then tested the top publicly available hits to see how accurately we could predict the two classes. Specifically, we tested two datasets containing tissue types outside the scope of our training set.

Validation set #5: Bladder cancer and Melanoma
The first dataset analysed was a bladder cancer dataset from Dyrskjot et al [13] containing 60 samples, 9 of which were normal and the rest were superficial transitional cell carcinoma with and without surrounding carcinoma in situ. The second [14] had 7 normal skin samples and 45 melanoma samples. An interesting visualization of this dataset is available in Additional File 1. Note that our gene panel demonstrates an incomplete transformation signature of the benign nevi in this data set. An analysis combining these two data sets using our SVM classifier yielded 95.2% LOO accuracy for distinguishing tumor from normal (Table 4).

Validation set #6: Clear Cell Renal Cell Carcinoma (ccRCC)
To establish how well our panel performed for ccRCC/ normal-kidney discrimination we used a previously pub-lished dataset which was also used by two other groups [2,3] in testing their gene lists to distinguish ccRCC from normal kidney. We downloaded the data the GSE781 dataset from GEO http://www.ncbi.nlm.nih.gov/gds/ ?term=GSE781 and applied our SVM classifier on this data to distinguish ccRCC from normal kidney. The results are shown in Table 5. At first pass, we noticed that our gene panel performed poorly on this dataset. TMEV [15] visualization of this dataset using our gene panel revealed that several of the samples with sample headings listed as cancerous very closely resembled normal sam-  ples (Additional File 2). Furthermore, a careful analysis of the tumor and normal labels in the original data showed that these three samples were indeed normal. Apparently, these three samples were misinterpreted as tumors in the analysis of Rhodes et al [2] and Xu et al [3](we describe this issue further in the discussion section). After properly labelling these samples as normal, our SVM had 100% testing accuracy in distinguishing ccRCC from normal kidney.
To further test our accuracy on ccRCC we also analyzed a microarray dataset from an on-going collaboration at the University of North Carolina Medical School consisting of 52 ccRCC samples and 18 patient matched normal kidney samples from two studies [16][17][18]. The data was collected on Agilent G4112F whole genome arrays and is posted on GEO (GSE16449). After mapping our Affymetrix probe features to genes, we created a dataset where the level of each gene was represented as the expression level of the probe (for that gene) with the highest average expression across samples (see methods). This reduced the dataset to 100 genes, on which we applied our SVM analysis methodology to distinguish ccRCC from normal kidney. The SVM also had 100% leave-one-out cross validation accuracy on these samples, which suggests that our panel is not only robust across tissue types but also appears to be platform independent. TMEV [15] visualizations of this dataset is also available in Additional File 3.

Discussion
For the public datasets we used, the pathological stage is given in the papers cited above. For the proprietary data-sets, Additional File 4 details all the staging information we have on the tumor samples. We have identified a robust panel of commonly differentially expressed genes from 77 normal tissues and 125 cancer samples and demonstrated that these features capture the neoplastic elements of the cancer samples in sufficient detail across many tumor types to provide an accurate diagnostic for cancer detection. Some of the genes in our panel have been previously indicated in cancer association studies. For example, KIAA0101 (p15PAF) is a PCNA associated factor which has been shown to be upregulated in tumor samples [19]. This gene as well as two others, NME1 and CKS2, is also present in both the Rhodes and Xu signatures [2,3]. However, there are some unexpected discoveries. For example, we found GAPDH, generally believed to be a housekeeping gene, upregulated in tumors vs. normal in all tissue with the exception of prostate tissue where it had the opposite signature. Our results are in agreement with a recent observation [20] that several housekeeping genes, including GAPDH, have a differential signature in tumors versus normal tissue.
When the gene signature is analysed for its functional classifications, it becomes clear that it is enriched in categories important for cell growth. Among the highly enriched categories include actin cytoskeleton organization, DNA packaging, nucleosome and chromatin assembly, and regulation of blood vessel size.
While the Rhodes and Xu gene panels have demonstrated similar success in diagnostic capabilities our gene panel is more robust both in design and performance (Table 5). Our panel outperformed the panels by Rhodes et al [2] and Xu et al [3] in 4 out of 6 tumor types while Figure 3 Distribution of LOO accuracies. Distribution of LOO accuracies using randomly selected gene lists to classify Lung cancer from the dataset of Spira et al. [6], overlayed with accuracies using the dataset from [6] compared to our gene panel.
performed equally well for the remaining two. Taken as a whole, our panel achieved 96% accuracy. It should be noted, however, that three samples interpreted as ccRCC in the Rhodes et al [2] and Xu et al [3] studies were actually normal samples in the original dataset and thus the sample headings were misleading. For instance, one of the samples interpreted as "tumor" in the Rhodes et al [2] and Xu et al [3] studies had the description: "N4 Renal Cell Carcinoma" in GEO. However, the detailed description in the original data read: "Trizol isolation of total RNA from normal tissue adjacent to Renal Cell Carcinoma", which meant that this sample should have been classified as Normal Kidney. In the analysis done by Rhodes et al [3] and Xu et al [4], these errors were not corrected, which may partly explain their poor results on this dataset. The three samples which are misinterpreted in the GEO database as tumor are sample IDs: GSM12283, GSM12300, GSM12444. After properly labelling these samples as normal, our SVM had 100% testing accuracy in distinguishing ccRCC from normal kidney.
The reason for the robustness of the panel is that we use a methodology (Genes@Work) where the error model for expression levels (noise) is derived from actual replicate measurements [5] rather than from assuming that the underlying error is Gaussian, which is implicit in the error models used in other panels. On a practical level, our panel performs remarkably well in unseen testing datasets, many of which were on tissue types which were outside the tissue sets in the training data. Due to the robustness of our gene panel, it should be particularly useful in accurate identification of mislabelled or misinterpreted samples.

Conclusions
Our feature set has several potential uses. A common problem of cancer sample collection is that surrounding normal tissue can severely contaminate the sample. Since the tissue of origin is usually known in biopsies, our panel can be used as a tool for rapid determination of the presence or absence of cancer cells as an aid to pathologists. It can also be used to determine contamination in in-vitro experiments.
Expression-based diagnosis and risk assessment is rapidly gaining popularity in the clinic [21] with diagnostic panels using measurements of 10 to 35 genes via qRT-PCR experiments. A possible follow up to our study would be to identify and validate a minimal subset of probes which retain sufficient predictive power and which can be measured using RT-PCR from FFPE sections.

Additional material
Additional File 1 Melanoma data set (GSE 3189) visualization. Melanoma data set (GSE 3189) gene expression sorted by ratio of gene expression ratio of cancer vs. normal. The middle portion contains nevus samples which are considered benign. Interestingly, they appear to have a mixed signature that is an incomplete transformation from normal to cancer. Additional File 2 The kidney cancer validation sets. Samples on the left are normal, right are cancer. The first image represents the Lenburg et al. [19] data set. The mislabelled samples in question are the rightmost 3 samples in the normal subgroup.