Classification between normal and tumor tissues based on the pair-wise gene expression ratio

Background Precise classification of cancer types is critically important for early cancer diagnosis and treatment. Numerous efforts have been made to use gene expression profiles to improve precision of tumor classification. However, reliable cancer-related signals are generally lacking. Method Using recent datasets on colon and prostate cancer, a data transformation procedure from single gene expression to pair-wise gene expression ratio is proposed. Making use of the internal consistency of each expression profiling dataset this transformation improves the signal to noise ratio of the dataset and uncovers new relevant cancer-related signals (features). The efficiency in using the transformed dataset to perform normal/tumor classification was investigated using feature partitioning with informative features (gene annotation) as discriminating axes (single gene expression or pair-wise gene expression ratio). Classification results were compared to the original datasets for up to 10-feature model classifiers. Results 82 and 262 genes that have high correlation to tissue phenotype were selected from the colon and prostate datasets respectively. Remarkably, data transformation of the highly noisy expression data successfully led to lower the coefficient of variation (CV) for the within-class samples as well as improved the correlation with tissue phenotypes. The transformed dataset exhibited lower CV when compared to that of single gene expression. In the colon cancer set, the minimum CV decreased from 45.3% to 16.5%. In prostate cancer, comparable CV was achieved with and without transformation. This improvement in CV, coupled with the improved correlation between the pair-wise gene expression ratio and tissue phenotypes, yielded higher classification efficiency, especially with the colon dataset – from 87.1% to 93.5%. Over 90% of the top ten discriminating axes in both datasets showed significant improvement after data transformation. The high classification efficiency achieved suggested that there exist some cancer-related signals in the form of pair-wise gene expression ratio. Conclusion The results from this study indicated that: 1) in the case when the pair-wise expression ratio transformation achieves lower CV and higher correlation to tissue phenotypes, a better classification of tissue type will follow. 2) the comparable classification accuracy achieved after data transformation suggested that pair-wise gene expression ratio between some pairs of genes can identify reliable markers for cancer.


Background
Tumor development is a process in which gene expression is modified, causing abnormal cell behaviour [1]. Many techniques have been developed to identify abnormalities of gene expression, as reflected by abundance of mRNA transcripts between normal and tumor. The completion of the Human Genome Project and advances in DNA-array technology have allowed highly parallel genetic analyses to take place on a genome-wide scale. They have revolutionized the way tumors are studied, and promised to provide a better and more thorough understanding of the underlying mechanisms for tumorigenesis. Eventually, they will lead to more comprehensive diagnosis/prognosis of tumor with more effective therapeutic interventions.
Despite its advantages, the DNA-array technology poses three major challenges that render the interpretation of expression data less efficient than expected. Firstly, the gene expression data is inherently variable due to various factors that either depend on biological factors that remain difficult to control (cross-contaminated samples of tumor and normal cells), or depend on difficulties in setting up of the experiment (RNA extraction) [2]. These drawbacks interfere with the subsequent array analysis aimed to identify reliable markers that best correlate with the tissue phenotypes. Efforts have been devoted to address these drawbacks by incorporating various raw data scaling, data filtering, normalization and improvement of the classifier algorithm [3]. Promising results have been reported claiming near-perfect classification accuracy [4]. However, the usually small number of samples per class in most studies and the highly biased cross validation procedures cast doubt on the classification accuracy in terms of their statistical significance [5]. This statistical constraint creates a further challenge for DNAarray technology where the number of features in arrays is in thousands while tissue samples are available in limited number. This causes high probability for any classification to be correct by chance alone. Thirdly, although it has been recently established that genes segregate into clusters of interacting networks [6] instead of acting as one single entity, most cancer DNA-array studies have only investigated single gene aberration (up/down-regulated) when comparing tumor expression profiles to their corresponding normal tissue controls. In an interesting study, B∅ and Jonassen tried to circumvent some of these difficulties by investigating genes in pairs. They demonstrated that gene pairs can be used to improve discrimination between different tissue classes [7]. This idea of studying genes in pairs, or even in higher order clusters, should be explored further to reveal new features of complex expression profiling datasets.
In this study, we introduced a novel data transformation meant to investigate relationships between pair-wise gene expression ratios and tissue phenotype within a given experiment. With this procedure, we aimed to discover strong cancer-related signals (features) that exist in the form of pair-wise ratios (or higher order relationship when we extend to N-feature model classifier for N>2) in a given sample, while improving the signal to noise ratio of the dataset by minimizing its coefficient of variation (CV). The underlying concept for adopting pair-wise gene expression ratios as the discriminating axes for tissue type classification is that an experiment is self-consistent (in terms of factors affected either by the biology of the phenomenon of interest, or of the experimental setting, or both). With this approach we could "subtract" correlated variations by considering the sample as a whole, without making inferences such as those needed for normalization. Basically, we avoided studying gene expression in an absolute term because this requires robust normalization method to account for arrays from different experiments, different platforms and different profiling technologies. By resorting to analyze features in the form of ratios, we attempted to minimize the effect of normalization and look for co-varying signals in each experiment.

Colon and prostate cancer datasets
The 62 colon cancer sample dataset is composed of measurements for 1,988 gene probes, of which 40 were labelled as tumor and 22 were labelled as normal. The samples were collected from patients, their RNAs were extracted and hybridised to Affymetrix Hum6000 arrays. Please refer to paper [8]. The normalized dataset can be downloaded at http://microarray.princeton.edu/oncol ogy/-affydata/index.html.
The 102 prostate cancer sample dataset is composed of measurements for 12,600 gene probes, of which 52 were labelled as tumor and 50 were labelled as normal. The samples were collected from patients, their RNAs were extracted and hybridised to Affymetrix U95Av2 arrays. Please refer to paper [9]. The normalised dataset can be downloaded at http://www-genome.wi.mit.edu/MPR/ Prostate.
Both datasets were pre-processed to eliminate those probe pairs that showed significant fluctuation in their hybridisation signals (those greater than 3 standard deviation away from the mean for their ESTs, and the probes pairs that showed an overall higher intensity in their mismatch probe cells (MM) than their corresponding perfect match probe cells (PM); these probe pairs indicate non-specific hybridisation by background RNAs). Both datasets used average intensity as quantitative measurements of the level of gene expression. Base-10 logarithmic transformations were performed for each dataset.

Initial gene selection
For downstream classification analysis, we extracted only the genes whose expression pattern correlated strongly to the tissue phenotype. To achieve this, we first calculated the correlation coefficient r i (Equation 1) for each gene i using the full dataset, and ranked the genes according to their correlation coefficient r i . For the calculation of r, we assigned a number to each tissue phenotype: 1 for normal tissue and 10 for cancer tissue. After obtaining the correlation coefficients for all genes, we used a simple threshold value (|r|>0.4) to select the set of cancer-related genes. There were two reasons for set the threshold value at 0. 4. When lower thresholds were used, we incorporated many genes that were not known to be cancer-related (data not shown). Furthermore, too many genes will later cause computer tractability problem when we calculate their pair-wise gene expression ratio for each tissue sample and later the N-feature model classifier. At |r|>0.4, we were able to account for most of previously known cancer related genes.
where V 1 is a vector representing the gene expression pattern for gene #1; V sample is the dichotomous representation of tissues; S V1 and S sample standard deviation of V 1 , V sample ; , are the mean of V 1 , V sample .

Transforming the gene expression data to investigate the expression equilibrium between genes pairs
The raw expression data within a sample tissue was transformed into measurement of the pair-wise gene expression ratio for any combinatorial pairs of genes. For the 1,988 gene expression intensities for each sample (e 1 , e 2 ...e 1988 ), there are 1988 C 2 combinations (e 1 /e 2 , e 1 /e 3 ...) of pair-wise gene expression ratios ( Figure 1). This transformed matrix is referred to as M. Each row/column corresponds to a specific gene and the entry at the intersection of row X and column Y corresponds to the expression equilibrium between gene X and gene Y. Such matrix has a diagonal entry of value 1 because e 1 /e 1 equals to unity.

Feature partitioning method [4] for classification of normal/tumor tissues using single gene expression
Regarding the Feature Partitioning Method (FPM), in order to discriminate between the normal/tumor tissues based on specific feature i (single gene expression), the first step is to determine the threshold value, T i , that can optimally splits all the tissue samples into tumor and normal tissue. The FPM algorithm has a recursive version [4], in which a decision tree depicting the classification rules for tissue samples was generated recursively. Both methods differ in the way T i s are derived. Nonetheless, they are very intuitive and non-parametric in nature. Also, they restrict no priori distribution patterns for features used. We adopted the simple FPM for tissue classification where each feature was treated individually. There are two criteria for deriving a valid threshold value T i for each feature.
First, it has to delineate correctly (discriminating efficiency = 100%) the one-dimensional region (R feature_i ) for either all the normal/tumor tissues using all tissue samples. Secondly, it has to minimize the percentage of false prediction for the other tissue type. Take gene #1659 for example. To fulfill the two aforementioned criteria, it was determined that the region greater than 63.7 (R #1659 ) incorporates all the tumor samples ( Figure 2). It classified correctly all tumors (discriminating efficiency = 100%) with an overall false prediction of 13   Now, to classify an unknown sample using 2-feature model classifier, a combination of any two features and their corresponding pre-determined threshold values T i s (selected from T i. ..all features for each dataset) were recruited. The outcome of the tissue class will be determined depending on whether one/both the expression values of the unknown sample fall completely in either the normal/ cancer region (R feature_i ). This is to say that if any of the two features from the unknown sample meets the criteria (R feature_i ) to be either normal/tumor tissue type (based on our definition, R feature_i is a region with 100% discriminating efficiency for a specific tissue type), the unknown sample will be assigned to be normal/tumor respectively. This is repeated exhaustively for all possible combinations constituting of any two features. The procedure will be repeated for all tissue samples to evaluate the overall classification accuracy for 2-feature model classifier. In total, we evaluated the classification of tissue samples based on different combinations of N genes and investigated the classifiers up to 10-feature model classifier.

Classification of normal/tumor tissues using transformed datasets
The classification procedures and the two criteria for determining the threshold value were the same as explained in previous paragraph. The only difference here is that the definition of "feature" refers to pair-wise gene expression ratio derived from lower/upper triangular matrix of M. Take the ratio #1537/#1831 for example. To fulfill the two aforementioned criteria, it was determined that the region greater than 0.755 (R #1537/#1831 ) incorporates all the tumor tissue samples ( Figure 2). It classifies correctly all tumor tissue samples with a false prediction of 6.4%. This is performed repeatedly for all entries in M until all the threshold values are determined. Now, to classify an unknown sample using 2-feature model classifier, a combination of any two features (pairwise gene expression ratio) and their corresponding predetermined threshold values T i s (selected from T i. ..all features for each dataset) were recruited. The outcome of the tissue class will be determined depending on whether one/both the expression values of the unknown sample fall completely in either the normal or cancer region (R feature_i ). This is to say that if any of the two features (pair-wise gene expression ratio) from the unknown sample meets the criteria (R feature_i ) to be either normal/tumor (based on our definition, R feature_i is a region with 100% discriminating efficiency for a specific tissue type), the unknown sample will be assigned to be normal/tumor respectively. This is repeated exhaustively for all possible combinations constituting of two features. The procedure will be repeated for all tissue samples to evaluate the overall classification accuracy for 2-feature model classifier. In total, we evaluated the classification of tissue samples based on different combinations of N genes and investigated the classifiers up to 10-feature model classifier.

Constructing the relationship tree for the top 25 genes
We calculated the cross correlation coefficient r (Equation 1) for all pair combinations of the top 25 genes listed in Table 6 and Table 7. Prior to the construction of a relationship tree for the top 25 genes for colon and prostate cancer, the cross-correlation coefficient was used to construct the pair-wise distance matrix D. Each entry in the pair-wise distance matrix was measured by the value of (1-r). Each row/column corresponds to a specific gene and an entry at the intersection of row X and column Y corresponds to the distance of gene expression between gene #X and gene #Y. Such matrix has a diagonal entry of value 0. Only the lower/upper triangular matrix of D is required to construct the relationship tree. After obtaining lower/upper triangular matrix of D, the neighbor-joining method (NJ) algorithm was used to construct the relationship tree [10]. However, we observed that the pair-wise gene expression ratio (#1537/#1831-ratio between vascular endothelial growth factor and gelsolin precursor) has better discriminating efficiency as tabulated in Table 7. ('*' and 'o' represent normal and cancer tissue type respectively). *discriminating efficiency using only single gene as discriminating axis

Results
After initial gene selection, respectively 82 and 262 genes (|r|>0.4) were selected from the colon and prostate dataset for downstream analysis (Table 1 and Table 2). Topping the list in both tables were genes that have been found to be either over-expressed/under-expressed in tumors [11]. The first three genes most correlated to cancer in the colon dataset were heavy chain of non-muscle myosin, human monocyte-derived neutrophil-activating protein (MONAP) and human desmin genes. This agrees with the findings from [12,13] that used other statistical tests (z-score, t-test) in a comparable analysis. The heavy chain of non-muscle myosin, denoted as the embryonic smooth muscle myosin heavy chain (SMemb), was found to be down-regulated in cancer. It was also determined experimentally to be a target for the protein encoded by the metastasis-related mts-1 gene [14]. Furthermore, it was demonstrated recently by 5'RACE analysis that heavy chain of non-muscle myosin interacts with ALK genes that have tyrosine kinase activity and oncogenic properties [15]. The human monocyte-derived neutrophil-activating protein (MONAP, interleukin-8), was second on the list. It was significantly up-regulated in the tumor compared to the normal samples. This protein has been linked to the progression of several human cancer types [16]. It was believed that over-expression of MONAP plays an important role in tumor angiogenesis and tumor aggression. The human desmin gene is the third on the list, and it was found to be down-regulated in tumor. Interestingly, this gene also showed significantly reduced expression in other cancer types such as the melanoma cell line [17].
From the prostate dataset, the most cancer-correlated gene is the human hepatoma gene coding for serine protease hepsin. Brief literature search in PubMed showed that hepsin is a well-characterized transmembrane protease that is expressed at high level in tumor. Three separate studies identified hepsin as a significant cancer biomarker that can be used for cancer diagnosis [18]. The second gene on the list was the human mitochondrial matrix protein P1. This gene has been correlated to different cancer types with consistent up-regulation in tumor [13]. The third gene is the carcinoma-associated antigen GA733-2, which was among the 216 cancer markers identified by Ernst's group in Germany [19].

Effect of data transformation on coefficient of variation
To date, reliable markers with low coefficient of variation (CV) are generally lacking. Discovering robust cancer marker is crucial for the purpose of successful cancer diagnosis. We investigated the CV between samples after data transformation: the lowest CVs decreased to 16.5% in the colon dataset while it increased to 25.8% for the prostate dataset (Table 3 and Table 4). Topping the list for both dataset were the pair-wise gene expression ratio for genes #119/#54 (elongation factor 1-delta and 40S ribosomal protein S24) and #10614/#5871 (zq58b03.r1 Homo sapiens cDNA and nuclear matrix protein NXP2), which revealed informative pair-wise gene interaction in relation with their corresponding tissue phenotypes. They reflected how cell adjusts to their pair-wise product in response to physiological changes. Based on these observations, we found that the relative abundance between the Prior to data transformation the lowest coefficients of variations for single gene expression were 45.3% and 24.5% for colon and prostate datasets respectively. When using the data transformation we proposed, significant improvement was achieved in the colon dataset. Interestingly, this was followed by an improved data correlation to the tissue phenotype as well as to the classification efficiency. We did not observe a similar improvement of the  CV, data correlation to tissue classes or classification efficiency in the prostate dataset.

Correlations of the single gene expression and pair-wise gene expression ratio
The distribution of correlation coefficients between genes and tissue phenotypes for the colon and prostate datasets is shown in Figure 3. The distributions are positively and negatively skewed for both datasets. The two red lines separate genes with |r| >0.4 from the bulk (Table 1 and 2). They retained respectively 82 and 262 genes from the colon and prostate datasets. To study the possible interaction between pair-wise genes, we estimated the statistical correlation of gene expressions. Both the distributions for the correlation coefficient and the extreme cases are shown in Figures 4 and 5. Both figures emphasize the true nature of gene-gene co-regulations -a complex biological mechanism, that most often has been over-simplified when we treat the gene expression as an independent variables [22]. For example, Figure 4 and Figure 5 suggested that the expressions of genes belonging to a common subset are most likely correlated to each other (e.g.: Gene #31 vs #119 in colon cancer (r = 0.95306) and gene #7775 vs #10749 in prostate cancer (r = 0.92922)). It should be pointed out that the two humps in the probability density function are not zero-centered, but concentrated at nonzero correlation r. For colon dataset, positive correlation was the dominant type. For prostate dataset, a balanced distribution in their gene correlation was observed. We determined that some improvement in tissue classification is achieved when pair-wise gene expression ratio was used as discriminating axes instead of using a single gene expression ( Figure 2). The reason is that pair-wise gene expression ratio has higher correlation to tissue phenotype with lower CV (Table 5).

Gene expression and tissue type correlation
Several previous studies have already endeavored to identify correlations between specific gene expression and cancerous transformation [4,13,23]. In the present study, we identified several novel target genes that clearly distinguish the two different tissue phenotypes with high discriminating efficiency (>74%) ( Table 6 and Table 8). Some of those have previously been documented in studies that did not involve expression profiling as cancer related genes (Human monocyte-derived neutrophil-activating protein (MONAP) and Human hepatoma mRNA for serine protease hepsin), others (Human gene for heterogeneous nuclear ribonucleoprotein (hnRNP), P24480 CALGIZZARIN, Human mitochondrial matrix protein P1, Human mRNA for aldose reductase and human adipsin) have not been identified from in-silico studies of tissue DNA-array expression data. The cancer related genes for colon and prostate cancer were ranked according to their discriminating predictive power. The list should provide hints for researchers during selection of molecular target for diagnostic, prognostic or attempts to cure the disease.
Overall classification results and accuracies for each N-feature model classifier across two datasets were reported in Table 6, 7 and 8. In the following section, we will discuss a few important genes or pair-wise gene expression ratios from Table 6 and Table 7 that resulted in the optimum classification accuracy (Table 8B). They are the most efficient combination of discriminating axes for classifying tissue types because they delineate correctly all the normal/tumor tissues with the lowest percentage of false prediction.
For the sake of brevity, we will discuss three single gene expressions and two pair-wise gene expression ratios from colon cancer. For prostate cancer, two single gene  29.14% expressions and two pair-wise gene expression ratios will be discussed. For prostate cancer pair-wise gene expression ratio, the two axes for discriminating tissue types are: 1) #6185/ #5840 and 2) #6185/#6749. Their threshold values were reported to be 2.69 and 2.55 respectively.
To illustrate graphically the result of tissue classification, two examples, each based on three genes or pair-wise gene expression ratios that altogether yielded the optimum The histogram for correlation of coefficient r between single gene expression and the tissue types for the colon and pros-tate tissue cancer Figure 3 The histogram for correlation of coefficient r between single gene expression and the tissue types for the colon and prostate tissue cancer. The distribution shows coefficient of correlation between single gene expression and cancer phenotype. Their extrema of correlation coefficient |r|>0.4 (represented in red lines) were extracted for downstream data analysis.
The distribution of cross-correlation between two single gene expression patterns in colon dataset Figure 4 The distribution of cross-correlation between two single gene expression patterns in colon dataset. The distribution shows the coefficient of correlation between expression patterns for any pair of gene markers. Their extrema scenarios were also plotted with their corresponding r value.
The distribution of cross-correlation between two genes expression patterns in prostate dataset Figure 5 The distribution of cross-correlation between two genes expression patterns in prostate dataset. The distribution shows coefficient of correlation between any pair of gene markers. Their extrema plots of correlation coefficient were also plotted with corresponding r value.  classification efficiency for the prostate cancer are shown ( Figure 6, Figure 7).

Constructing the relationship tree for top 25 gene for colon and prostate cancer
The relationship tree for top 25 genes listed in Table 6 and  Table 7 were constructed based on the cross-correlation between gene expressions ( Figure 8). We employed the established 'neighbor-joining' clustering method [10] to group different genes based on their correlated expression patterns across all tissue samples (meaning that genes expression that are correlated will appear in the same branch of the clustering tree), using a novel distance measurement to quantify how change in the expression for one gene interfered with that of another gene. The principle of this method is to cluster pairs of operational taxonomic units (OTUs [=neighbors of similar gene expression]) that minimize the total branch length at each stage of clustering of OTUs starting with a star-like tree. Figure 8 revealed two major clusters of genes. The first cluster corresponded to down-regulated genes, the second cluster represented up-regulated genes. Also, the most efficient discriminating axes (feature genes) reside at the basal position for each cluster. In bacteria many genes are co-expressed as single transcription units. This was used as a control study to validate the methodology of grouping genes, we implemented this distance measurement on bacteria gene arrays (B. subtilis and E. coli) and successfully determined the co-regulated operon gene structures (supplementary file #1).

Data transformation to investigate pair-wise gene expression ratios
As the expression profiling technologies mature, the identification of significant cancer-related signals from noisy datasets (characterized by a high CV) remains a major challenge. In particular, a robust normalization method is critical to ascertain that arrays from two experiments are comparable with minimum noise prior downstream analysis. However, the existing normalization methods pose limitations due to the lack of good models to account for sources of experimental and biological variations [24]. Hoffmann et al. [25] employed different normalization methods to analyse the same dataset, and demonstrated that the numbers of genes detected as differentially expressed differed by a huge factor depending on which normalization methods used. The problem is exacerbated further by the presence of different array formats, experimental designs and methods.
Here, instead of resolving to single gene expression, that depends heavily on normalization, for tissue classification, we presented a transformation method that uses pair-wise gene expression ratios within the same experiment as the discriminating axes. By doing so, we aimed to  Table 7.
minimize the influence of different normalization methods considering that an experiment is self-consistent with the same factors affecting all genes in the same fashion. The rationale is that even when the normalization methods differ between two array experiments, their pair-wise gene expression ratios within the same experiment will remain relatively stable. If reliable cancer-related signal, exist in the form of pair-wise gene expression ratio, were indeed discovered successfully, they will be relatively independent from the normalization method used on a dataset.
The improvement in CV (Table 3) and overall classification accuracy (Table 7) for colon dataset after introduction of data transformation signifies two implications: First, the transformation is able to increase the signal to noise ratio (SNR) of the cancer related signal because the resulted pair-wise gene expression ratios correlate stronger to tissue phenotype. Second, because the pair-wise gene expression ratios are less dispersed than single gene expression, using the pair-wise gene expression ratios to classify tissue types will be much more reliable and accurate (Table 8). Despite the benefits mentioned, this data transformation introduced a computational limita-Inter-relationship of gene expression gene expression for top 25 prostate cancer genes extracted from Table 8  Figure 8 Inter-relationship of gene expression gene expression for top 25 prostate cancer genes extracted from Table  8. The tree structure was derived using neighbor-joining algorithm [10]. Two clusters of gene expression were observed, namely the up-regulated (#6185) and down -regulated (#8554) genes in cancer tissues.
tion due to the enormous amount of feature combinations to be processed, especially when N-feature model classifiers for N>4 are considered (If 100 features are selected, and 10-feature model classifier is investigated, the search space will be 100 C 10 = 1.731030945644000 × 10 13 different combination of features). As a result, more computation time will be required to search all possibilities. As an example, the discriminating axes that accounted for the optimum accuracy in 1 to 3-feature model classifier are reported in Table 9.
Regarding the high classification accuracy reported in Table 8, it should be stressed that this was achieved by involving all tissue samples during the derivation of the threshold value, T i , in the feature selection procedure. In other word, instead of adopting the more conservative classification accuracy test where only a subset of tissue samples are used to derive a set of classification criteria (threshold values), we adjusted our methodology to use all tissue samples so that our results are unbiased (when comparing the outcome from single gene and pair-wise gene ratio) and in-line with our objective that is to compare the classification efficiency between single gene and pair-wise gene ratio. Admittedly, we have a noisy dataset whereby selecting a subset of tissue samples that are a representable population for the entire dataset remains a challenge [5] (given that we have a small and unbalanced dataset, particularly the colon dataset). Eventually, we might run into ambiguous/contradicting results using a different population subset of tissue samples. Furthermore, we might miss important features (single gene expression/ pair-wise gene expression ratio) because of the biased training dataset. By including all tissue samples for both studies (single gene and pair-wise gene ratio), we aimed to derive the most reliable threshold val-ues and classified tissue samples based on them. Since the same methodology was applied for both studies, the comparison of classification efficiency is valid and will reflect how well each feature (single gene and pair-wise gene ratio) can be used to delineate tissue samples.

The implication derived from the classification results
For colon dataset, three axes for discriminating tissues are: 1) Human monocyte-derived neutrophil-activating protein (MONAP); 2) Human desmin gene and 3) Human cysteine-rich protein (CRP) gene. The association of the first two genes and cancer biology had been discussed earlier. We will discuss the Human cysteine-rich protein gene. The expression and induction of this protein has been associated with protection against DNA damage, oxidative stress and apoptosis [26]. In the colon dataset, we observed down-regulation of this protein in tumor. This suggested lack of protection against DNA damage.
For colon cancer pair-wise gene expression ratio, the two axes for discriminating tissues are: 1) #1831/#1537 and 2) #753/#768. Using these two axes, 98.4% of the tissue samples can be classified correctly. The expression ratio between #1831 (gelsolin precursor) and #1537 (vascular endothelial growth factor) was able to discriminate 93.6% of the total tissue data. The vascular endothelial growth factor was determined recently to be a plausible biomarker for colon cancer [27]. Gelsolin had been found to suppress tumorigenicity in different cancer samples, including lung, bladder and breast [28]. When they were used individually as a discriminating axis, they were only able to classify correctly 66.1% and 67.7% of all tissue samples. Furthermore, the expression ratio between #753 (Human cysteine-rich protein) and #768 (the macrophage migration inhibitory factor) was able to dis- * : best accuracy based on the specified number of gene/gene ratio as discriminating axes ****Please do not delete from here on, needed for the correct order of reference list******  criminate 90.3% of total tissue type. The human cysteinerich protein was discussed in the previous section. The macrophage migration inhibitory factor (MIF) functions as a pluripotent cytokine involved in broad-spectrum pathophysiological events in association with inflammation and immune responses. Several reports, including ours, have suggested that MIF is also involved in tumorigenesis [29]. When they were used individually as single discriminating axis, they were only able to classify correctly 83.9% and 66.1% of all tissues.
For prostate cancer single gene expression, the two axes for discriminating tissues are: 1) Human hepatoma mRNA for serine protease hepsin, and 2) Human adipsin. The first gene was discussed in the previous paragraph. For the second gene, adipsin had also been suggested by Chow et al. [30] as a good cancer marker for studying the basic biology of cancer.
For prostate cancer pair-wise gene expression ratio, the two axes for discriminating tissues are: 1) #6185/#5840 and 2) #6185/#6749. Using these two axes, all tissue samples can be classified correctly. The expression ratio between #6185 (Human hepatoma mRNA for serine protease hepsin) and #5840 (Homo sapiens mRNA for KIAA1109 protein) was able to discriminate 92.2% of total tissues. The human hepatoma mRNA for serine protease hepsin had been determined to be an important marker for cancer cell development [11,18]. The KIAA1109 protein is an unknown protein in human chromosome four [31]. A homology search against the nonredundant databases yielded no significant hit to known genes. When they were used individually as a discriminating axis, they were only able to classify correctly 86.3% and 61.8% of all tissues. On the other hand, the expression ratio between #6185 (Human hepatoma mRNA for serine protease hepsin) and #6749 (Homo sapiens mRNA for KIAA1055 protein) was able to discriminate 90.10% of total tissues. The human hepatoma mRNA for serine protease hepsin was discussed in the previous section. The KIAA1055 protein is an unknown protein in human chromosome 15 [21,31]. A homology search against the nonredundant databases yielded 40.7% DNA identity to a novel human cDNA that had been found to function as a cancer inhibiting protein [21]. When they were used individually as a discriminating axis, they were only able to classify correctly 86.3% and 62.8% of all tissues.