- Research article
- Open Access
- Open Peer Review
Classification between normal and tumor tissues based on the pair-wise gene expression ratio
BMC Cancervolume 4, Article number: 72 (2004)
Precise classification of cancer types is critically important for early cancer diagnosis and treatment. Numerous efforts have been made to use gene expression profiles to improve precision of tumor classification. However, reliable cancer-related signals are generally lacking.
Using recent datasets on colon and prostate cancer, a data transformation procedure from single gene expression to pair-wise gene expression ratio is proposed. Making use of the internal consistency of each expression profiling dataset this transformation improves the signal to noise ratio of the dataset and uncovers new relevant cancer-related signals (features). The efficiency in using the transformed dataset to perform normal/tumor classification was investigated using feature partitioning with informative features (gene annotation) as discriminating axes (single gene expression or pair-wise gene expression ratio). Classification results were compared to the original datasets for up to 10-feature model classifiers.
82 and 262 genes that have high correlation to tissue phenotype were selected from the colon and prostate datasets respectively. Remarkably, data transformation of the highly noisy expression data successfully led to lower the coefficient of variation (CV) for the within-class samples as well as improved the correlation with tissue phenotypes. The transformed dataset exhibited lower CV when compared to that of single gene expression. In the colon cancer set, the minimum CV decreased from 45.3% to 16.5%. In prostate cancer, comparable CV was achieved with and without transformation. This improvement in CV, coupled with the improved correlation between the pair-wise gene expression ratio and tissue phenotypes, yielded higher classification efficiency, especially with the colon dataset – from 87.1% to 93.5%. Over 90% of the top ten discriminating axes in both datasets showed significant improvement after data transformation. The high classification efficiency achieved suggested that there exist some cancer-related signals in the form of pair-wise gene expression ratio.
The results from this study indicated that: 1) in the case when the pair-wise expression ratio transformation achieves lower CV and higher correlation to tissue phenotypes, a better classification of tissue type will follow. 2) the comparable classification accuracy achieved after data transformation suggested that pair-wise gene expression ratio between some pairs of genes can identify reliable markers for cancer.
Tumor development is a process in which gene expression is modified, causing abnormal cell behaviour . Many techniques have been developed to identify abnormalities of gene expression, as reflected by abundance of mRNA transcripts between normal and tumor. The completion of the Human Genome Project and advances in DNA-array technology have allowed highly parallel genetic analyses to take place on a genome-wide scale. They have revolutionized the way tumors are studied, and promised to provide a better and more thorough understanding of the underlying mechanisms for tumorigenesis. Eventually, they will lead to more comprehensive diagnosis/prognosis of tumor with more effective therapeutic interventions.
Despite its advantages, the DNA-array technology poses three major challenges that render the interpretation of expression data less efficient than expected. Firstly, the gene expression data is inherently variable due to various factors that either depend on biological factors that remain difficult to control (cross-contaminated samples of tumor and normal cells), or depend on difficulties in setting up of the experiment (RNA extraction) . These drawbacks interfere with the subsequent array analysis aimed to identify reliable markers that best correlate with the tissue phenotypes. Efforts have been devoted to address these drawbacks by incorporating various raw data scaling, data filtering, normalization and improvement of the classifier algorithm . Promising results have been reported claiming near-perfect classification accuracy . However, the usually small number of samples per class in most studies and the highly biased cross validation procedures cast doubt on the classification accuracy in terms of their statistical significance . This statistical constraint creates a further challenge for DNA-array technology where the number of features in arrays is in thousands while tissue samples are available in limited number. This causes high probability for any classification to be correct by chance alone. Thirdly, although it has been recently established that genes segregate into clusters of interacting networks  instead of acting as one single entity, most cancer DNA-array studies have only investigated single gene aberration (up/down-regulated) when comparing tumor expression profiles to their corresponding normal tissue controls. In an interesting study, B∅ and Jonassen tried to circumvent some of these difficulties by investigating genes in pairs. They demonstrated that gene pairs can be used to improve discrimination between different tissue classes . This idea of studying genes in pairs, or even in higher order clusters, should be explored further to reveal new features of complex expression profiling datasets.
In this study, we introduced a novel data transformation meant to investigate relationships between pair-wise gene expression ratios and tissue phenotype within a given experiment. With this procedure, we aimed to discover strong cancer-related signals (features) that exist in the form of pair-wise ratios (or higher order relationship when we extend to N-feature model classifier for N>2) in a given sample, while improving the signal to noise ratio of the dataset by minimizing its coefficient of variation (CV). The underlying concept for adopting pair-wise gene expression ratios as the discriminating axes for tissue type classification is that an experiment is self-consistent (in terms of factors affected either by the biology of the phenomenon of interest, or of the experimental setting, or both). With this approach we could "subtract" correlated variations by considering the sample as a whole, without making inferences such as those needed for normalization. Basically, we avoided studying gene expression in an absolute term because this requires robust normalization method to account for arrays from different experiments, different platforms and different profiling technologies. By resorting to analyze features in the form of ratios, we attempted to minimize the effect of normalization and look for co-varying signals in each experiment.
Colon and prostate cancer datasets
The 62 colon cancer sample dataset is composed of measurements for 1,988 gene probes, of which 40 were labelled as tumor and 22 were labelled as normal. The samples were collected from patients, their RNAs were extracted and hybridised to Affymetrix Hum6000 arrays. Please refer to paper . The normalized dataset can be downloaded at http://microarray.princeton.edu/oncology/-affydata/index.html.
The 102 prostate cancer sample dataset is composed of measurements for 12,600 gene probes, of which 52 were labelled as tumor and 50 were labelled as normal. The samples were collected from patients, their RNAs were extracted and hybridised to Affymetrix U95Av2 arrays. Please refer to paper . The normalised dataset can be downloaded at http://www-genome.wi.mit.edu/MPR/Prostate.
Both datasets were pre-processed to eliminate those probe pairs that showed significant fluctuation in their hybridisation signals (those greater than 3 standard deviation away from the mean for their ESTs, and the probes pairs that showed an overall higher intensity in their mismatch probe cells (MM) than their corresponding perfect match probe cells (PM); these probe pairs indicate non-specific hybridisation by background RNAs). Both datasets used average intensity as quantitative measurements of the level of gene expression. Base-10 logarithmic transformations were performed for each dataset.
Initial gene selection
For downstream classification analysis, we extracted only the genes whose expression pattern correlated strongly to the tissue phenotype. To achieve this, we first calculated the correlation coefficient r i (Equation 1) for each gene i using the full dataset, and ranked the genes according to their correlation coefficient r i. For the calculation of r, we assigned a number to each tissue phenotype: 1 for normal tissue and 10 for cancer tissue. After obtaining the correlation coefficients for all genes, we used a simple threshold value (|r|>0.4) to select the set of cancer-related genes. There were two reasons for set the threshold value at 0.4. When lower thresholds were used, we incorporated many genes that were not known to be cancer-related (data not shown). Furthermore, too many genes will later cause computer tractability problem when we calculate their pair-wise gene expression ratio for each tissue sample and later the N-feature model classifier. At |r|>0.4, we were able to account for most of previously known cancer related genes.
where V 1 is a vector representing the gene expression pattern for gene #1; V sample is the dichotomous representation of tissues; S V1 and S sample standard deviation of V 1 , V sample ; , are the mean of V 1 , V sample .
Transforming the gene expression data to investigate the expression equilibrium between genes pairs
The raw expression data within a sample tissue was transformed into measurement of the pair-wise gene expression ratio for any combinatorial pairs of genes. For the 1,988 gene expression intensities for each sample (e 1 , e 2 ...e 1988 ), there are 1988C2 combinations (e 1 /e 2 , e 1 /e 3 ...) of pair-wise gene expression ratios (Figure 1). This transformed matrix is referred to as M. Each row/column corresponds to a specific gene and the entry at the intersection of row X and column Y corresponds to the expression equilibrium between gene X and gene Y. Such matrix has a diagonal entry of value 1 because e1/e1 equals to unity.
Feature partitioning method  for classification of normal/tumor tissues using single gene expression
Regarding the Feature Partitioning Method (FPM), in order to discriminate between the normal/tumor tissues based on specific feature i (single gene expression), the first step is to determine the threshold value, T i , that can optimally splits all the tissue samples into tumor and normal tissue. The FPM algorithm has a recursive version , in which a decision tree depicting the classification rules for tissue samples was generated recursively. Both methods differ in the way T i s are derived. Nonetheless, they are very intuitive and non-parametric in nature. Also, they restrict no priori distribution patterns for features used. We adopted the simple FPM for tissue classification where each feature was treated individually. There are two criteria for deriving a valid threshold value T i for each feature. First, it has to delineate correctly (discriminating efficiency = 100%) the one-dimensional region (Rfeature_i) for either all the normal/tumor tissues using all tissue samples. Secondly, it has to minimize the percentage of false prediction for the other tissue type. Take gene #1659 for example. To fulfill the two aforementioned criteria, it was determined that the region greater than 63.7 (R#1659) incorporates all the tumor samples (Figure 2). It classified correctly all tumors (discriminating efficiency = 100%) with an overall false prediction of 13.9% in the normal set. This was performed repeatedly for all features until all the threshold values (T i...all features) were determined.
Now, to classify an unknown sample using 2-feature model classifier, a combination of any two features and their corresponding pre-determined threshold values T i s (selected from T i...allfeatures for each dataset) were recruited. The outcome of the tissue class will be determined depending on whether one/both the expression values of the unknown sample fall completely in either the normal/cancer region (Rfeature_i). This is to say that if any of the two features from the unknown sample meets the criteria (Rfeature_i) to be either normal/tumor tissue type (based on our definition, Rfeature_i is a region with 100% discriminating efficiency for a specific tissue type), the unknown sample will be assigned to be normal/tumor respectively. This is repeated exhaustively for all possible combinations constituting of any two features. The procedure will be repeated for all tissue samples to evaluate the overall classification accuracy for 2-feature model classifier. In total, we evaluated the classification of tissue samples based on different combinations of N genes and investigated the classifiers up to 10-feature model classifier.
Classification of normal/tumor tissues using transformed datasets
The classification procedures and the two criteria for determining the threshold value were the same as explained in previous paragraph. The only difference here is that the definition of "feature" refers to pair-wise gene expression ratio derived from lower/upper triangular matrix of M. Take the ratio #1537/#1831 for example. To fulfill the two aforementioned criteria, it was determined that the region greater than 0.755 (R#1537/#1831) incorporates all the tumor tissue samples (Figure 2). It classifies correctly all tumor tissue samples with a false prediction of 6.4%. This is performed repeatedly for all entries in M until all the threshold values are determined.
Now, to classify an unknown sample using 2-feature model classifier, a combination of any two features (pair-wise gene expression ratio) and their corresponding pre-determined threshold values T i s (selected from T i...allfeatures for each dataset) were recruited. The outcome of the tissue class will be determined depending on whether one/both the expression values of the unknown sample fall completely in either the normal or cancer region (Rfeature_i). This is to say that if any of the two features (pair-wise gene expression ratio) from the unknown sample meets the criteria (Rfeature_i) to be either normal/tumor (based on our definition, Rfeature_i is a region with 100% discriminating efficiency for a specific tissue type), the unknown sample will be assigned to be normal/tumor respectively. This is repeated exhaustively for all possible combinations constituting of two features. The procedure will be repeated for all tissue samples to evaluate the overall classification accuracy for 2-feature model classifier. In total, we evaluated the classification of tissue samples based on different combinations of N genes and investigated the classifiers up to 10-feature model classifier.
Constructing the relationship tree for the top 25 genes
We calculated the cross correlation coefficient r (Equation 1) for all pair combinations of the top 25 genes listed in Table 6 and Table 7. Prior to the construction of a relationship tree for the top 25 genes for colon and prostate cancer, the cross-correlation coefficient was used to construct the pair-wise distance matrix D. Each entry in the pair-wise distance matrix was measured by the value of (1-r). Each row/column corresponds to a specific gene and an entry at the intersection of row X and column Y corresponds to the distance of gene expression between gene #X and gene #Y. Such matrix has a diagonal entry of value 0. Only the lower/upper triangular matrix of D is required to construct the relationship tree. After obtaining lower/upper triangular matrix of D, the neighbor-joining method (NJ) algorithm was used to construct the relationship tree .
Computer hardware and software
A Sun Fire 6800 Server http://www.bioinfo.hku.hk with 24 CPUs (each running with a clock speed of 900 MHz) was employed throughout this study. The computation of correlation coefficient and classification procedures were implemented using the Matlab Technical Programming language (Matlab programs can be downloaded at http://web.hku.hk/~daniely/microarray.
After initial gene selection, respectively 82 and 262 genes (|r|>0.4) were selected from the colon and prostate dataset for downstream analysis (Table 1 and Table 2). Topping the list in both tables were genes that have been found to be either over-expressed/under-expressed in tumors . The first three genes most correlated to cancer in the colon dataset were heavy chain of non-muscle myosin, human monocyte-derived neutrophil-activating protein (MONAP) and human desmin genes. This agrees with the findings from [12, 13] that used other statistical tests (z-score, t-test) in a comparable analysis. The heavy chain of non-muscle myosin, denoted as the embryonic smooth muscle myosin heavy chain (SMemb), was found to be down-regulated in cancer. It was also determined experimentally to be a target for the protein encoded by the metastasis-related mts-1 gene . Furthermore, it was demonstrated recently by 5'RACE analysis that heavy chain of non-muscle myosin interacts with ALK genes that have tyrosine kinase activity and oncogenic properties . The human monocyte-derived neutrophil-activating protein (MONAP, interleukin-8), was second on the list. It was significantly up-regulated in the tumor compared to the normal samples. This protein has been linked to the progression of several human cancer types . It was believed that over-expression of MONAP plays an important role in tumor angiogenesis and tumor aggression. The human desmin gene is the third on the list, and it was found to be down-regulated in tumor. Interestingly, this gene also showed significantly reduced expression in other cancer types such as the melanoma cell line .
From the prostate dataset, the most cancer-correlated gene is the human hepatoma gene coding for serine protease hepsin. Brief literature search in PubMed showed that hepsin is a well-characterized transmembrane protease that is expressed at high level in tumor. Three separate studies identified hepsin as a significant cancer biomarker that can be used for cancer diagnosis . The second gene on the list was the human mitochondrial matrix protein P1. This gene has been correlated to different cancer types with consistent up-regulation in tumor . The third gene is the carcinoma-associated antigen GA733-2, which was among the 216 cancer markers identified by Ernst's group in Germany .
Effect of data transformation on coefficient of variation
To date, reliable markers with low coefficient of variation (CV) are generally lacking. Discovering robust cancer marker is crucial for the purpose of successful cancer diagnosis. We investigated the CV between samples after data transformation: the lowest CVs decreased to 16.5% in the colon dataset while it increased to 25.8% for the prostate dataset (Table 3 and Table 4). Topping the list for both dataset were the pair-wise gene expression ratio for genes #119/#54 (elongation factor 1-delta and 40S ribosomal protein S24) and #10614/#5871 (zq58b03.r1 Homo sapiens cDNA and nuclear matrix protein NXP2), which revealed informative pair-wise gene interaction in relation with their corresponding tissue phenotypes. They reflected how cell adjusts to their pair-wise product in response to physiological changes. Based on these observations, we found that the relative abundance between the numerator and denominator exhibited a strong mutual dependency, and had strong correlation to tissue phenotype. For pair-wise gene expression ratio #119/#54, the elongation factor 1-delta is involved in a sequence of events during the decoding of mRNA on the ribosome . For the ratio of #10614/#5871, it corresponds to novel genes that do not yet have known function. A search in the DNA non-redundant (nr) database for gene #10614 yielded 83% DNA identity to a segment on chromosome 9. On the other hand, a search in non-redundant (nr) database for #5871 revealed 72.3% DNA identity to the cDNA of mouse that incorporates proteins involved in chromosome partitioning and cell decision .
Prior to data transformation the lowest coefficients of variations for single gene expression were 45.3% and 24.5% for colon and prostate datasets respectively. When using the data transformation we proposed, significant improvement was achieved in the colon dataset. Interestingly, this was followed by an improved data correlation to the tissue phenotype as well as to the classification efficiency. We did not observe a similar improvement of the CV, data correlation to tissue classes or classification efficiency in the prostate dataset.
Correlations of the single gene expression and pair-wise gene expression ratio
The distribution of correlation coefficients between genes and tissue phenotypes for the colon and prostate datasets is shown in Figure 3. The distributions are positively and negatively skewed for both datasets. The two red lines separate genes with |r| >0.4 from the bulk (Table 1 and 2). They retained respectively 82 and 262 genes from the colon and prostate datasets. To study the possible interaction between pair-wise genes, we estimated the statistical correlation of gene expressions. Both the distributions for the correlation coefficient and the extreme cases are shown in Figures 4 and 5. Both figures emphasize the true nature of gene-gene co-regulations – a complex biological mechanism, that most often has been over-simplified when we treat the gene expression as an independent variables . For example, Figure 4 and Figure 5 suggested that the expressions of genes belonging to a common subset are most likely correlated to each other (e.g.: Gene #31 vs #119 in colon cancer (r = 0.95306) and gene #7775 vs #10749 in prostate cancer (r = 0.92922)). It should be pointed out that the two humps in the probability density function are not zero-centered, but concentrated at non-zero correlation r. For colon dataset, positive correlation was the dominant type. For prostate dataset, a balanced distribution in their gene correlation was observed. We determined that some improvement in tissue classification is achieved when pair-wise gene expression ratio was used as discriminating axes instead of using a single gene expression (Figure 2). The reason is that pair-wise gene expression ratio has higher correlation to tissue phenotype with lower CV (Table 5).
Gene expression and tissue type correlation
Several previous studies have already endeavored to identify correlations between specific gene expression and cancerous transformation [4, 13, 23]. In the present study, we identified several novel target genes that clearly distinguish the two different tissue phenotypes with high discriminating efficiency (>74%) (Table 6 and Table 8). Some of those have previously been documented in studies that did not involve expression profiling as cancer related genes (Human monocyte-derived neutrophil-activating protein (MONAP) and Human hepatoma mRNA for serine protease hepsin), others (Human gene for heterogeneous nuclear ribonucleoprotein (hnRNP), P24480 CALGIZZARIN, Human mitochondrial matrix protein P1, Human mRNA for aldose reductase and human adipsin) have not been identified from in-silico studies of tissue DNA-array expression data. The cancer related genes for colon and prostate cancer were ranked according to their discriminating predictive power. The list should provide hints for researchers during selection of molecular target for diagnostic, prognostic or attempts to cure the disease. Overall classification results and accuracies for each N-feature model classifier across two datasets were reported in Table 6, 7 and 8. In the following section, we will discuss a few important genes or pair-wise gene expression ratios from Table 6 and Table 7 that resulted in the optimum classification accuracy (Table 8B). They are the most efficient combination of discriminating axes for classifying tissue types because they delineate correctly all the normal/tumor tissues with the lowest percentage of false prediction.
For the sake of brevity, we will discuss three single gene expressions and two pair-wise gene expression ratios from colon cancer. For prostate cancer, two single gene expressions and two pair-wise gene expression ratios will be discussed.
For colon cancer single gene expression, three axes for discriminating tissue types are: 1) Human monocyte-derived neutrophil-activating protein (MONAP); 2) Human desmin gene and 3) Human cysteine-rich protein (CRP) gene. Their threshold values were determined to be 62.73, 2787.0 and 749.4 respectively.
For colon cancer pair-wise gene expression ratio, the two axes for discriminating tissue types are: 1) #1831/#1537 and 2) #753/#768. Their threshold values were reported to be 1.32 and 1.85 respectively.
For prostate cancer individual gene expression, the two axes for discriminating tissue types are: 1) Human hepatoma mRNA for serine protease hepsin and 2) Human adipsin. Their threshold values were reported to be 115.0 and 182.0 respectively.
For prostate cancer pair-wise gene expression ratio, the two axes for discriminating tissue types are: 1) #6185/#5840 and 2) #6185/#6749. Their threshold values were reported to be 2.69 and 2.55 respectively.
To illustrate graphically the result of tissue classification, two examples, each based on three genes or pair-wise gene expression ratios that altogether yielded the optimum classification efficiency for the prostate cancer are shown (Figure 6, Figure 7).
Constructing the relationship tree for top 25 gene for colon and prostate cancer
The relationship tree for top 25 genes listed in Table 6 and Table 7 were constructed based on the cross-correlation between gene expressions (Figure 8). We employed the established 'neighbor-joining' clustering method  to group different genes based on their correlated expression patterns across all tissue samples (meaning that genes expression that are correlated will appear in the same branch of the clustering tree), using a novel distance measurement to quantify how change in the expression for one gene interfered with that of another gene. The principle of this method is to cluster pairs of operational taxonomic units (OTUs [=neighbors of similar gene expression]) that minimize the total branch length at each stage of clustering of OTUs starting with a star-like tree. Figure 8 revealed two major clusters of genes. The first cluster corresponded to down-regulated genes, the second cluster represented up-regulated genes. Also, the most efficient discriminating axes (feature genes) reside at the basal position for each cluster. In bacteria many genes are co-expressed as single transcription units. This was used as a control study to validate the methodology of grouping genes, we implemented this distance measurement on bacteria gene arrays (B. subtilis and E. coli) and successfully determined the co-regulated operon gene structures (supplementary file #1).
Data transformation to investigate pair-wise gene expression ratios
As the expression profiling technologies mature, the identification of significant cancer-related signals from noisy datasets (characterized by a high CV) remains a major challenge. In particular, a robust normalization method is critical to ascertain that arrays from two experiments are comparable with minimum noise prior downstream analysis. However, the existing normalization methods pose limitations due to the lack of good models to account for sources of experimental and biological variations . Hoffmann et al.  employed different normalization methods to analyse the same dataset, and demonstrated that the numbers of genes detected as differentially expressed differed by a huge factor depending on which normalization methods used. The problem is exacerbated further by the presence of different array formats, experimental designs and methods.
Here, instead of resolving to single gene expression, that depends heavily on normalization, for tissue classification, we presented a transformation method that uses pair-wise gene expression ratios within the same experiment as the discriminating axes. By doing so, we aimed to minimize the influence of different normalization methods considering that an experiment is self-consistent with the same factors affecting all genes in the same fashion. The rationale is that even when the normalization methods differ between two array experiments, their pair-wise gene expression ratios within the same experiment will remain relatively stable. If reliable cancer-related signal, exist in the form of pair-wise gene expression ratio, were indeed discovered successfully, they will be relatively independent from the normalization method used on a dataset.
The improvement in CV (Table 3) and overall classification accuracy (Table 7) for colon dataset after introduction of data transformation signifies two implications: First, the transformation is able to increase the signal to noise ratio (SNR) of the cancer related signal because the resulted pair-wise gene expression ratios correlate stronger to tissue phenotype. Second, because the pair-wise gene expression ratios are less dispersed than single gene expression, using the pair-wise gene expression ratios to classify tissue types will be much more reliable and accurate (Table 8). Despite the benefits mentioned, this data transformation introduced a computational limitation due to the enormous amount of feature combinations to be processed, especially when N-feature model classifiers for N>4 are considered (If 100 features are selected, and 10-feature model classifier is investigated, the search space will be 100C10= 1.731030945644000 × 1013 different combination of features). As a result, more computation time will be required to search all possibilities. As an example, the discriminating axes that accounted for the optimum accuracy in 1 to 3-feature model classifier are reported in Table 9.
Regarding the high classification accuracy reported in Table 8, it should be stressed that this was achieved by involving all tissue samples during the derivation of the threshold value, T i , in the feature selection procedure. In other word, instead of adopting the more conservative classification accuracy test where only a subset of tissue samples are used to derive a set of classification criteria (threshold values), we adjusted our methodology to use all tissue samples so that our results are unbiased (when comparing the outcome from single gene and pair-wise gene ratio) and in-line with our objective that is to compare the classification efficiency between single gene and pair-wise gene ratio. Admittedly, we have a noisy dataset whereby selecting a subset of tissue samples that are a representable population for the entire dataset remains a challenge  (given that we have a small and unbalanced dataset, particularly the colon dataset). Eventually, we might run into ambiguous/contradicting results using a different population subset of tissue samples. Furthermore, we might miss important features (single gene expression/ pair-wise gene expression ratio) because of the biased training dataset. By including all tissue samples for both studies (single gene and pair-wise gene ratio), we aimed to derive the most reliable threshold values and classified tissue samples based on them. Since the same methodology was applied for both studies, the comparison of classification efficiency is valid and will reflect how well each feature (single gene and pair-wise gene ratio) can be used to delineate tissue samples.
The implication derived from the classification results
For colon dataset, three axes for discriminating tissues are: 1) Human monocyte-derived neutrophil-activating protein (MONAP); 2) Human desmin gene and 3) Human cysteine-rich protein (CRP) gene. The association of the first two genes and cancer biology had been discussed earlier. We will discuss the Human cysteine-rich protein gene. The expression and induction of this protein has been associated with protection against DNA damage, oxidative stress and apoptosis . In the colon dataset, we observed down-regulation of this protein in tumor. This suggested lack of protection against DNA damage.
For colon cancer pair-wise gene expression ratio, the two axes for discriminating tissues are: 1) #1831/#1537 and 2) #753/#768. Using these two axes, 98.4% of the tissue samples can be classified correctly. The expression ratio between #1831 (gelsolin precursor) and #1537 (vascular endothelial growth factor) was able to discriminate 93.6% of the total tissue data. The vascular endothelial growth factor was determined recently to be a plausible biomarker for colon cancer . Gelsolin had been found to suppress tumorigenicity in different cancer samples, including lung, bladder and breast . When they were used individually as a discriminating axis, they were only able to classify correctly 66.1% and 67.7% of all tissue samples. Furthermore, the expression ratio between #753 (Human cysteine-rich protein) and #768 (the macrophage migration inhibitory factor) was able to discriminate 90.3% of total tissue type. The human cysteine-rich protein was discussed in the previous section. The macrophage migration inhibitory factor (MIF) functions as a pluripotent cytokine involved in broad-spectrum pathophysiological events in association with inflammation and immune responses. Several reports, including ours, have suggested that MIF is also involved in tumorigenesis . When they were used individually as single discriminating axis, they were only able to classify correctly 83.9% and 66.1% of all tissues.
For prostate cancer single gene expression, the two axes for discriminating tissues are: 1) Human hepatoma mRNA for serine protease hepsin, and 2) Human adipsin. The first gene was discussed in the previous paragraph. For the second gene, adipsin had also been suggested by Chow et al.  as a good cancer marker for studying the basic biology of cancer.
For prostate cancer pair-wise gene expression ratio, the two axes for discriminating tissues are: 1) #6185/#5840 and 2) #6185/#6749. Using these two axes, all tissue samples can be classified correctly. The expression ratio between #6185 (Human hepatoma mRNA for serine protease hepsin) and #5840 (Homo sapiens mRNA for KIAA1109 protein) was able to discriminate 92.2% of total tissues. The human hepatoma mRNA for serine protease hepsin had been determined to be an important marker for cancer cell development [11, 18]. The KIAA1109 protein is an unknown protein in human chromosome four . A homology search against the non-redundant databases yielded no significant hit to known genes. When they were used individually as a discriminating axis, they were only able to classify correctly 86.3% and 61.8% of all tissues. On the other hand, the expression ratio between #6185 (Human hepatoma mRNA for serine protease hepsin) and #6749 (Homo sapiens mRNA for KIAA1055 protein) was able to discriminate 90.10% of total tissues. The human hepatoma mRNA for serine protease hepsin was discussed in the previous section. The KIAA1055 protein is an unknown protein in human chromosome 15 [21, 31]. A homology search against the non-redundant databases yielded 40.7% DNA identity to a novel human cDNA that had been found to function as a cancer inhibiting protein . When they were used individually as a discriminating axis, they were only able to classify correctly 86.3% and 62.8% of all tissues.
By comparing the tissue classification methods based on the single gene expression and the pair-wise gene expression ratio in two microarray datasets, we reached the following conclusions:
1. The minimum coefficient of variation decreased from 45.33% to 16.53% for colon dataset but increased marginally from 24.54% to 25.78% in prostate dataset.
2. The correlation coefficient, r, of the discriminating axis that correlates maximally to the tissue phenotype improves from 0.63 to 0.79 and 0.71 to 0.75 in colon and prostate dataset respectively.
3. The optimum accuracy for 1-feature model classifier (using single gene or pair-wise gene expression ratio as discriminating axis) improved from 87.1% to 93.55% in colon dataset. In prostate dataset, nine out of the top 10 discriminating axes showed significant improvement. The mean accuracy for 1-gene classifier improved from 76.8% to 91.2% and 75.8% to 81.9% in both datasets.
4. The comparable classification accuracy achieved after data transformation suggested that there exist some cancer-related signals in the form of pair-wise gene expression ratio, especially prominent in the colon dataset.
5. Through the single gene analysis, we identified key biomarkers that agree with the findings by other researchers. In addition, study on gene-to-gene correlation and the classification outcome based on the pair-wise gene expression ratio suggested that genetic network within a cluster of cancer-related genes should be explored further.
Hanahan D, Weinberg RA: The hallmarks of cancer. Cell. 2000, 100: 57-70. 10.1016/S0092-8674(00)81683-9.
Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst. 2003, 95: 14-18. 10.1093/jnci/95.1.14.
Krajewski P, Bocianowski J: Statistical methods for microarray assays. J Appl Genet. 2002, 43: 269-278.
Zhang H, Yu CY, Singer B, Xiong M: Recursive partitioning for tumor classification with gene expression microarray data. Proc Natl Acad Sci U S A. 2001, 98: 6730-6735. 10.1073/pnas.111153698.
Ambroise C, McLachlan GJ: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci U S A. 2002, 99: 6562-6566. 10.1073/pnas.102102699.
Gardner TS, di Bernardo D, Lorenz D, Collins JJ: Inferring genetic networks and identifying compound mode of action via expression profiling. Science. 2003, 301: 102-105. 10.1126/science.1081900.
Bo T, Jonassen I: New feature subset selection procedures for classification of expression profiles. Genome Biol. 2002, 3: RESEARCH0017-
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A. 1999, 96: 6745-6750. 10.1073/pnas.96.12.6745.
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002, 1: 203-209. 10.1016/S1535-6108(02)00030-2.
Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4: 406-425.
Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S, Kurachi K, Pienta KJ, Rubin MA, Chinnaiyan AM: Delineation of prognostic biomarkers in prostate cancer. Nature. 2001, 412: 822-826. 10.1038/35090585.
Li L, Darden TA, Weinberg CR, Levine AJ, Pedersen LG: Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. Comb Chem High Throughput Screen. 2001, 4: 727-739.
Kishino H, Waddell PJ: Correspondence analysis of genes and tissue types and finding genetic links from microarray data. Genome Inform Ser Workshop Genome Inform. 2000, 11: 83-95.
Kriajevska MV, Cardenas MN, Grigorian MS, Ambartsumian NS, Georgiev GP, Lukanidin EM: Non-muscle myosin heavy chain as a possible target for protein encoded by metastasis-related mts-1 gene. J Biol Chem. 1994, 269: 19679-19682.
Lamant L, Gascoyne RD, Duplantier MM, Armstrong F, Raghab A, Chhanabhai M, Rajcan-Separovic E, Raghab J, Delsol G, Espinos E: Non-muscle myosin heavy chain (MYH9): a new partner fused to ALK in anaplastic large cell lymphoma. Genes Chromosomes Cancer. 2003, 37: 427-432. 10.1002/gcc.10232.
Xu L, Xie K, Mukaida N, Matsushima K, Fidler IJ: Hypoxia-induced elevation in interleukin-8 expression by human ovarian carcinoma cells. Cancer Res. 1999, 59: 5822-5829.
Gutgemann A, Golob M, Muller S, Buettner R, Bosserhoff AK: Isolation of invasion-associated cDNAs in melanoma. Arch Dermatol Res. 2001, 293: 283-290. 10.1007/s004030100232.
.;, .;, .;, .;, .;, .;, .;, .;, .;, ;, , Stephan C, Yousef GM, Scorilas A, Jung K, Jung M, Kristiansen G, Hauptmann S, Kishi T, Nakamura T, Loening SA, Diamandis EP: Hepsin is highly over expressed in and a new candidate for a prognostic indicator in prostate cancer. J Urol. 2004, 171: 187-191. 10.1097/01.ju.0000101622.74236.94.
Ernst T, Hergenhahn M, Kenzelmann M, Cohen CD, Ikinger U, Kretzler M, Hollstein M, Grone HJ: [Gene expression profiling in prostatic cancer]. Verh Dtsch Ges Pathol. 2002, 86: 165-175.
Valle M, Zavialov A, Li W, Stagg SM, Sengupta J, Nielsen RC, Nissen P, Harvey SC, Ehrenberg M, Frank J: Corrigendum: Incorporation of aminoacyl-tRNA into the ribosome as seen by cryo-electron microscopy. Nat Struct Biol. 2003, 10: 1074-10.1038/nsb1003.
Kikuno R, Nagase T, Waki M, Ohara O: HUGE: a database for human large proteins identified in the Kazusa cDNA sequencing project. Nucleic Acids Res. 2002, 30: 166-168. 10.1093/nar/30.1.166.
Nakayama M, Kikuno R, Ohara O: Protein-protein interactions between large proteins: two-hybrid screening using a functionally classified library composed of long cDNAs. Genome Res. 2002, 12: 1773-1784. 10.1101/gr.406902.
Bektic J, Wrulich OA, Dobler G, Kofler K, Ueberall F, Culig Z, Bartsch G, Klocker H: Identification of genes involved in estrogenic action in the human prostate using microarray analysis. Genomics. 2004, 83: 34-44. 10.1016/S0888-7543(03)00184-8.
Quackenbush J: Microarray data normalization and transformation. Nat Genet. 2002, 32 Suppl: 496-501. 10.1038/ng1032.
Hoffmann R, Seidl T, Dugas M: Profound effect of normalization on detection of differentially expressed genes in oligonucleotide microarray data analysis. Genome Biol. 2002, 3: RESEARCH0033-
Cherian MG, Jayasurya A, Bay BH: Metallothioneins in human tumors and potential roles in carcinogenesis. Mutat Res. 2003, 533: 201-209. 10.1016/j.mrfmmm.2003.07.013.
Saad RS, Liu YL, Nathan G, Celebrezze J, Medich D, Silverman JF: Endoglin (CD105) and vascular endothelial growth factor as prognostic markers in colorectal cancer. Mod Pathol. 2004, 17: 197-203. 10.1038/modpathol.3800034.
Haga K: [The mechanism for reduced expression of gelsolin, tumor suppressor protein, in bladder cancer]. Hokkaido Igaku Zasshi. 2003, 78: 29-37.
Campa MJ, Wang MZ, Howard B, Fitzgerald MC, Patz E. F., Jr.: Protein expression profiling identifies macrophage migration inhibitory factor and cyclophilin a as potential molecular targets in non-small cell lung cancer. Cancer Res. 2003, 63: 1652-1656.
Chow ML, Moler EJ, Mian IS: Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. Physiol Genomics. 2001, 5: 99-111.
Nagase T, Ishikawa K, Miyajima N, Tanaka A, Kotani H, Nomura N, Ohara O: Prediction of the coding sequences of unidentified human genes. IX. The complete sequences of 100 new cDNA clones from brain which can code for large proteins in vitro. DNA Res. 1998, 5: 31-39.
Ghigna C, Moroni M, Porta C, Riva S, Biamonti G: Altered expression of heterogenous nuclear ribonucleoproteins and SR factors in human colon adenocarcinomas. Cancer Res. 1998, 58: 5818-5824.
Chaurand P, DaGue BB, Pearsall RS, Threadgill DW, Caprioli RM: Profiling proteins from azoxymethane-induced colon tumors at the molecular level by matrix-assisted laser desorption/ionization mass spectrometry. Proteomics. 2001, 1: 1320-1326. 10.1002/1615-9861(200110)1:10<1320::AID-PROT1320>3.3.CO;2-7.
Nicholson KM, Anderson NG: The protein kinase B/Akt signalling pathway in human malignancy. Cell Signal. 2002, 14: 381-395. 10.1016/S0898-6568(01)00271-6.
Cheong HK, Park JY, Kim EH, Lee C, Kim S, Kim Y, Choi BS, Cheong C: Structure of the N-terminal extension of human aspartyl-tRNA synthetase: implications for its biological function. Int J Biochem Cell Biol. 2003, 35: 1548-1557. 10.1016/S1357-2725(03)00070-0.
DeFatta RJ, Chervenak RP, De Benedetti A: A cancer gene therapy approach through translational control of a suicide gene. Cancer Gene Ther. 2002, 9: 505-512. 10.1038/sj.cgt.7700469.
Derenzini M, Trere D, Pession A, Montanaro L, Sirri V, Ochs RL: Nucleolar function and size in cancer cells. Am J Pathol. 1998, 152: 1291-1297.
Eray M, Tuomikoski T, Wu H, Nordstrom T, Andersson LC, Knuutila S, Kaartinen M: Cross-linking of surface IgG induces apoptosis in a bcl-2 expressing human follicular lymphoma line of mature B cell phenotype. Int Immunol. 1994, 6: 1817-1827.
Sreedharan SP, Huang JX, Cheung MC, Goetzl EJ: Structure, expression, and chromosomal localization of the type I human vasoactive intestinal peptide receptor gene. Proc Natl Acad Sci U S A. 1995, 92: 2939-2943.
.;, .;, .;, .;, .;, .;, .;Currie, .;, ., Shailubhai K, Yu HH, Karunanandaa K, Wang JY, Eber S. L, Wang Y, Joo NS, Kim HD, Miedema BW, Abbas SZ, Boddupalli SS, Currie MG, Forte LR: Uroguanylin treatment suppresses polyp formation in the Apc(Min/+) mouse and induces apoptosis in human colon adenocarcinoma cells via cyclic GMP. Cancer Res. 2000, 60: 5151-5157.
Gali H, Sieckman GL, Hoffman TJ, Kiefer GE, Chin DT, Forte LR, Volkert WA: Synthesis and in vitro evaluation of an 111In-labeled ST-peptide enterotoxin (ST) analogue for specific targeting of guanylin receptors on human colonic cancers. Anticancer Res. 2001, 21: 2785-2792.
Adhami VM, Ahmad N, Mukhtar H: Molecular targets for green tea in prostate cancer prevention. J Nutr. 2003, 133: 2417S-2424S.
Melendez JA, Davies KJ: Manganese superoxide dismutase modulates interleukin-1alpha levels in HT-1080 fibrosarcoma cells. J Biol Chem. 1996, 271: 18898-18903. 10.1074/jbc.271.31.18898.
Costantino L, Ferrari AM, Gamberini MC, Rastelli G: Nitrophenyl derivatives as aldose reductase inhibitors. Bioorg Med Chem. 2002, 10: 3923-3931. 10.1016/S0968-0896(02)00318-8.
Nithipatikom K, Isbell MA, Lindholm PF, Kajdacsy-Balla A, Kaul S, Campell WB: Requirement of cyclooxygenase-2 expression and prostaglandins for human prostate cancer cell invasion. Clin Exp Metastasis. 2002, 19: 593-601. 10.1023/A:1020915914376.
Zhang JP, Ying K, Xiao ZY, Zhou B, Huang QS, Wu HM, Yin M, Xie Y, Mao YM, Rui YC: Analysis of gene expression profiles in human HL-60 cell exposed to cantharidin using cDNA microarray. Int J Cancer. 2004, 108: 212-218. 10.1002/ijc.11405.
Ricci G, Caccuri AM, Lo Bello M, Parker MW, Nuccetelli M, Turella P, Stella L, Di Iorio EE, Federici G: Glutathione transferase P1-1: self-preservation of an anti-cancer enzyme. Biochem J. 2003, 376: 71-76. 10.1042/BJ20030860.
Eder IE, Haag P, Basik M, Mousses S, Bektic J, Bartsch G, Klocker H: Gene expression changes following androgen receptor elimination in LNCaP prostate cancer cells. Mol Carcinog. 2003, 37: 181-191. 10.1002/mc.10136.
Stahl JA, Leone A, Rosengard AM, Porter L, King CR, Steeg PS: Identification of a second human nm23 gene, nm23-H2. Cancer Res. 1991, 51: 445-449.
Carollo M, Parente L, D'Alessandro N: Dexamethasone-induced cytotoxic activity and drug resistance effects in androgen-independent prostate tumor PC-3 cells are mediated by lipocortin 1. Oncol Res. 1998, 10: 245-254.
Matsui H, Kubochi K, Okazaki I, Yoshino K, Ishibiki K, Kitajima M: Collagen biosynthesis in gastric cancer: immunohistochemical analysis of prolyl 4-hydroxylase. J Surg Oncol. 1999, 70: 239-246. 10.1002/(SICI)1096-9098(199904)70:4<239::AID-JSO8>3.3.CO;2-D.
Chesi M, Bergsagel PL, Shonukan OO, Martelli ML, Brents LA, Chen T, Schrock E, Ried T, Kuehl WM: Frequent dysregulation of the c-maf proto-oncogene at 16q23 by translocation to an Ig locus in multiple myeloma. Blood. 1998, 91: 4457-4463.
Postel EH, Berberich SJ, Flint SJ, Ferrone CA: Human c-myc transcription factor PuF identified as nm23-H2 nucleoside diphosphate kinase, a candidate suppressor of tumor metastasis. Science. 1993, 261: 478-480.
Huang KS, Wallner BP, Mattaliano RJ, Tizard R, Burne C, Frey A, Hession C, McGray P, Sinclair LK, Chow EP: Two human 35 kd inhibitors of phospholipase A2 are related to substrates of pp60v-src and of the epidermal growth factor receptor/kinase. Cell. 1986, 46: 191-199. 10.1016/0092-8674(86)90736-1.
The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2407/4/72/prepub
Indispensable support was provided by the doctoral fellowship from The University of Hong Kong (HKU) and well as the Hong Kong Innovation and Technology Fund (ITF), BIOSUPPORT Programme. Finally, we wish to thank Dr Ralf Altmeyer for his critical interest for this work as he came at the head of the HKU-Pasteur Research Centre.
The authors declare that they have no competing interests.
YLY proposed the idea, participated in the design, performed the statistical analysis and wrote the first draft of the manuscript.
AD participated in the design and overall coordination of this study as well as in the writing of the manuscript.
XWZ participated in the design of the study.
YCW, XHW and MTL participated during the revision phase of this study.
All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.