We have identified 1,233 genes that are significantly differentially expressed by 3-fold or more in clear cell tumors relative to adjacent normal tissue isolated from the same surgical samples. Our ability to identify so many differentially expressed genes by measuring mRNA hybridization intensities from just ten surgical samples was strongly effected by our data analysis strategy.
Perhaps most importantly, we found that an unpaired t-test strategy was more powerful than a paired-sample t-test for identifying differentially expressed genes. This suggests that the variability between patients might be less than our hybridization-intensity measurement error. We propose that in two-channel hybridization experiments (like many of the previous renal cell tumor expression profiling experiments) it may also be beneficial to average hybridization intensities across many samples rather than average the ratio of hybridization intensities – despite the added inter-array normalization requirements – to arrive at more accurate estimates of the difference in expression between groups.
Another key to the sensitivity of our analysis was removing samples that showed subtle technical defects. We did this by looking for samples with an unusually large number of intensity measurements in which the ratio of sequence-specific to non-specific hybridization is low, and identified one sample, N032, that was a clear outlier in this regard. Had we included this sample in our analysis we would have only identified about 6,800 probesets that display evidence of significant differential expression (p < 0.03) compared with the nearly 7,700 probesets we identified when not including this sample. In light of the dramatic effect of excluding this sample, we also examined the consequences of having included other samples that might be atypical for technical or non-technical reasons.
One such sample is the adjacent normal tissue from patient 035. This sample shows a pattern of gene expression that by principal-components analysis appears to be a mixture of the patterns seen in the normal tissue and clear cell tumor samples. Since we have no compelling reason to think that the pattern of gene expression we observe in sample N035 results from something other than normal biological variability, we have retained this sample in our analysis. Had we excluded N035, the number of probesets with evidence of significant differential expression would have increased to almost 8,100.
The other atypical sample that we have retained in our analysis is the tumor sample from patient 005. This sample has a pattern of gene expression that is the largest source of variation among the patterns of gene expression seen in the tumor samples as measured by principal components analysis. Excluding this sample would have increased the number of differentially expressed probesets by 89. Taken together, the effect of removing N035 and C005 suggests that our estimate of being able to identify greater than 95% of the genes with average signal intensity that are changed three-fold or more needs to interpreted somewhat cautiously as our data is unlikely to be uniformly normal and our power to detect differential expression might therefore be lower.
We validated our approach of using the number of uniquely poor sequence-specific probeset hybridizations to eliminate sample N032 with a more generalizable three-step approach of first identifying probesets that contain an intensity measurement that significantly deviates from the mean; next counting how many of these outlier values are contributed by each sample; and then determining if the fraction contributed by any one of the samples is significantly different from the other samples. This analysis also identifies sample N032 as a significant outlier (data not shown). But the probability of N032 being an outlier is lower with this analysis than with our analysis of hybridization sequence-specificity and is more complex. The scaling factor N032 hybridized to the U133A array is also exceptionally large suggesting that the defect we detected in our "uniquely absent" analysis might also be reflected in the scaling factor.
Another aspect of our data analysis strategy involved minimizing the number of false positives – genes that appear to be differentially expressed in our particular samples but are in fact expressed similarly in tumor and normal tissue. One approach we have taken to reduce the absolute number of false positives has been to reduce the number of hypotheses of differential expression we have tested. To do this, we chose not to test the hypothesis that any individual probeset is differentially expressed in the absence of strong evidence of sequence-specific hybridization to this probset in at least one of the tumor or normal samples. By including only those probesets where there is significant sequence-specific hybridization in one or more sample, we reduced the number of probesets we tested for differential expression from about 45,000 to just over 27,600. Our rationale for requiring significant sequence-specific hybridization is that it would be otherwise impossible to interpret the meaning of a differentially expressed probeset if that differential expression couldn't then be associated with a particular sequence and thereby a particular transcript. We set our threshold for evidence of significant differential expression such that the number of probesets that we would expect to exceed this threshold by chance alone is about 11% of the total number of probesets that we observed to exceed this threshold. It is quite likely that after applying our secondary threshold – of only considering those probesets that are differentially expressed by three-fold or more – that the fraction of false-positives is less than 11% but this cannot be determined directly.
By further characterization of the multiple probesets that interrogate some of the genes represented on the Affymetrix arrays, it may be possible to further reduce the number of hypotheses tested by testing for evidence of differential gene-level rather than probset-level hybridization. We found that the 27,600 probesets we analyzed for differential expression only interrogate 20,000 distinct unigene clusters. Of the 12,500 probesets that map to unigene clusters that are interrogated by multiple probesets, 5.8% are significantly differentially expressed by more than 3-fold, compared to 6.6% of the 14,600 probesets that map to a unique unigene cluster (chi-square test, p = 5 × 10-4). This suggests either that a subset of the multiple probesets that interrogate a particular gene have a reduced ability to detect differential expression or that it is more difficult to reliably detect the expression of those genes that are interrogated by multiple probesets. With a large enough collection of array experiments it might be possible to extend the probeset filtering protocol to determine if any of the multiple probesets that detect a single gene have a reduced ability to detect differential expression using one portion of the dataset, and eliminate these probeset measurements from the remaining data which would be used for the analysis of differential expression. Another strategy would be to filter out the probesets using the same data for filtering and analysis in a post-hoc scheme, but this type of approach could easily result in over-filtering. We took a conservative approach to the issue of multiple probesets by averaging both the significance and the fold-change across the probesets and this led us to discard about 16% of the three-fold changed genes that we would have retained using the post-hoc filtering strategy. An important issue associated with all three of these approaches for handling multiple probesets for any individual gene is that if heterogeneous evidence for differential expression across multiple probesets is the result of alternative splicing or incorrect unigene cluster assignment, any conclusion about the expression of that gene as a single entity will be invalid. Thus, while the most conservative approach would be to eliminate genes for which there is heterogeneous evidence of differential expression across multiple probesets, we chose the slightly less conservative approach of averaging the probesets since many of the probsets for a single unigene cluster had similar values.
Given the differences in gene-expression measurement platforms, data-analysis strategies, and reporting techniques, we were pleased to find that over a hundred of the twelve-hundred genes that we identified as being differentially expressed by three-fold or more in renal cell carcinoma had also been identified in two or more of the five previous studies of RCC gene expression with which we compared our results. We failed to identify only four genes that had been identified in three or more previous studies. For all four genes, we found that each had statistically significant evidence of a change in expression but that our requirement for a three-fold or larger change prevented us from scoring these genes as being differentially expressed. The fact that we identified so many of the genes that had been identified in multiple other studies suggests that our list contains an extensive catalog of three-fold differentially expressed genes. Of the twelve-hundred genes that we identify as being differentially expressed in RCC, 870 have not been previously reported.
Several factors related to the reporting of the data may have reduced our ability to identify additional shared conclusions about genes that are differentially expressed between the studies. First, Gieseg et al. only report 38% of the genes they identified as being differentially expressed. Second, for four of the datasets, we were only able to match 72 – 79% of the reported gene identifiers with genes in our dataset, and there might be additional agreements among those genes that we were unable to compare directly. Third, most previous studies of RCC gene expression have identified many fewer differentially expressed genes than we report here and this decreases the probability of finding agreements between our study and at least two additional studies.
We saw the highest level of agreement (83%) between our data set and the differentially expressed genes reported by Takahashi et al. Several factors likely contribute to our having identified differential expression for such a large fraction of the genes identified in this study. First, Takahashi et al. use a three-fold-change threshold similar to our analysis. Second, Takahashi et al. use a large number of patient samples in their analysis and this increases their power to detect differentially expressed genes. Third, while Takahashi et al. do not use an analysis strategy that calculates a probability of differential expression, they do use a very stringent sample-wise voting strategy (requiring greater than three-fold differential expression in more than 75% of the patient to normal comparisons) that may in fact be more stringent than the parametric statistical methods we used. It appears from highlighting the Takahashi et al. genes on the "volcano plot" of our dataset (genes plotted as a function of fold-change and statistical significance) that of the fourteen genes that are differentially expressed in the Takashi et al. dataset that are not differentially expressed in our dataset, there appear to be two classes of genes: six that we observe to be unchanged and eight that are changed but where the differences do not meet our criteria for differential expression. We suggest that the six Takahashi et al. genes that are unchanged in our dataset may reflect differences between the probes used for detecting the expression of these genes in the two distinct microarray systems.
When highlighting the differentially expressed genes from our re-analysis of the Higgins et al. dataset on the volcano plot of our dataset, we observe a similar apparent bifurcation of our observations into a group of genes that show very little change and another group of genes that is more highly changed. Despite using a similar analysis strategy for analyzing our dataset and our reanalysis of the Higgins et al. dataset we only find evidence of differential expression for 47% of the genes identified in the re-analysis of the Higgins et al. dataset. If we are correct that the genes that show very little change result from probe differences, it is possible that the rate of shared conclusions is influenced by a large number of probe-specific platform differences as well as the false-discovery rates of each analysis and/or an overestimation of our power to detect three-fold changed genes (especially for genes with low hybridization intensity which tend to have higher measurement error).
It is interesting that our observations of the differentially expressed genes identified in the remaining three studies do not show the same degree of apparent bifurcation between genes that are unchanged and genes that are changed but which did not meet our criteria for differential expression. These studies all use some sort of sample-voting analysis strategy though it appears that the fold-change and/or percent-vote thresholds are less stringent than those used by Takahashi et al.
Despite the differences in analysis strategies, the degree of shared conclusions about specific genes that are differentially expressed in RCC tumors is much higher than would be expected by chance alone. We were particularly interested to find that among the genes identified as differentially expressed in three or more studies, sixteen have no known function or informative homology to other genes with known function. This argues strongly for the ability of microarray-based expression profiling to identify genes that are differentially expressed in a manner that is not informed by our current understanding of underlying biological processes.
To begin the process of understanding what differential gene expression can tell us about the biological processes underlying renal cell carcinogenesis, we have taken a directed approach of asking whether specific keywords related to biological processes known to be important in RCC and cancer are associated with a larger number of the genes we have identified as being either induced or repressed in RCC than would be expected by chance alone. We found that genes associated with the keywords hypoxia, angiogenesis, tumor-necrosis factor, apoptosis, interferon, drug- and radiation-resistance, and metastasis are enriched among our list of induced genes. Genes associated with either the keyword kidney or renal are enriched among our list of repressed genes.
Among the eleven differentially expressed genes related to the keyword metastasis we were intrigued to find that some are connected directly or indirectly to MAP kinase and TNF-alpha. Expression of TIMP1 (induced 3.2-fold in RCC) and TNFAIP6 (induced 15.7 fold) are both induced by treatment with TNF-alpha [21, 22]. GPR54 (induced 8.2-fold) is the receptor for metastin and is upstream of MAPK  as is endothelin1 (EDN1, induced 3.2 fold) . Endothelin1 and VEGF (both induced 3.2 fold) promote the migration of endothelial cells  and angiogenesis .
ITGA5, MCAM, FXYD5, FUT3 and CHI3L1 – all associated with metastasis – are involved in cell adhesion like TIMP1. A number of other genes involved in cell adhesion, cell migration, and/or cytoskeletal organization – but not specifically known to be involved in metastasis – are also changed in RCC.
A number of the genes associated with metastasis that are induced in RCC also regulate or are regulated through HIF1 in response to hypoxia. Transcription of VEGF and endothelin1 are both stimulated by HIF1  and endothelin1 promotes the stabilization and accumulation of HIF1-alpha . EGLN3 a homolog of C. elegans EGL9 which is a prolyl hydroxylase that targets HIF1-alpha for destruction  is induced 12.5-fold in RCC. HIF1-alpha induces expression of carbonic anhydrase IX (induced 17 fold)  which is the renal cell carcinoma-associated antigen G250  and is induced in many cancer types. HIF1-alpha is also responsible for the hypoxia-induced expression of angiopoietin-like 4 (induced 24.2-fold), insulin-like growth factor binding protein 3 (induced 10.0-fold), RTP801 (induced 4.2-fold) and PLOD2 (induced 3.5-fold) . These changes in gene expression are consistent with the model that HIF1-alpha accumulation is a hallmark of renal cell carcinogenesis. The increase in HIF1-alpha-dependent gene expression despite the strong induction of EGLN3 suggests that HIF1-alpha is resistant to degradation perhaps as a result of defects in VHL-mediated proteolysis.
We were also intrigued to see the induction of the apoptosis-inhibitor BIRC3 (3.3-fold) which is induced by hypoxia through a HIF1-independent mechanism  as well as the induction of the chemokine receptor CCL5 (3.5-fold) which is repressed under hypoxic conditions . HIG2 (induced 7.4-fold) and ADORA3 (induced 3.4-fold) are also known to be induced in response to hypoxia [34, 35] though it is unclear if this is a HIF1-dependent process.
We found a number of genes with oncogenic potential that are induced in RCC that have not been noted in other microarray studies of RCC gene expression. One such gene is Axl (induced 3.5-fold), a receptor tyrosine kinase that causes transformation when overexpressed in NIH 3T3 cells . APOBEC3G (induced 4.0-fold) is highly similar to the catalytic subunit of the RNA editing enzyme APOBEC1 – overexpression of which causes elevated rates of carcinoma in transgenic mice . Both APOBEC3G and APOBEC1 have a potent DNA-mutator phenotype when expressed in E. coli . We observed that IMUP, a possible transcription factor that is up-regulated in SV40-transformed cells , is induced 5.1-fold.
Additional genes identified as upregulated in renal cancer are interesting from a tissue perspective. Several groups of genes suggest increases in tumor vasculature and inflammation. Endothelial cell specific genes found at increased abundance include von Willebrand factor (8.2-fold increased) and endothelial cell specific molecule-1 (ESM1) (3.0-fold increased) . Platelet/endothelial cell adhesion molecule-1 (PECAM1 or CD31) (4.1-fold increased) is expressed highly on endothelium and in leukocytes. Consistent with an overall inflammatory response, several T cell expressed genes are increased, which likely reflect T cell invasion. Moreover, classic major histocompatibility complex, class II DQ beta 1 (4.2-fold increased) and DP beta 1 (3.0 fold increased) are also increased. Perhaps consistent with tumor necrosis, several toll-like receptors are upregulated as well, including TLR7 (4.7-fold increased), TLR3 (4.5-fold increased) and TLR2 (3.4-fold increased).
A suprising number of G protein coupled receptors (GPR) and G protein signaling molecules are upregulated. These include GPR4 (3.4-fold increased), GPR54 (8.2-fold increased), GPR92 (3.4-fold increased), Rho GTPase activating protein 9 (3.3-fold increased), and particularly regulators of G protein signaling RGS1 (12.8-fold increased) and RGS5 (7.8-fold increased). Although activity of these pathways might be expected to increase with cellular transformation, their increased expression suggests an additional higher level of control of these pathways.
The majority of differentially expressed genes were downregulated. Moreover, many of the genes expressed at lower levels in the renal cancers are ones normally expressed at high levels in differentiated kidney tissue, such as renal epithelial transporters. In addition, some lower expressed genes are cell type-restricted and are therefore excluded from the cancer, such as glomerular podocyte-specific genes encoding the Wilms' tumor suppressor WT1, nephrin and podocin. The notion that primarily "kidney" genes are expressed at lower levels is also supported by the keyword results.
Nevertheless, a small number of the downregulated genes may be important in renal cancer biology due to their tumor suppressor phenotypes. For example, the FHIT tumor suppressor gene is frequently lost in renal cancer. Reduced FHIT expression noted here might also reflect this phenomenon. Surprisingly, the sequence-specific hybridization to the VHL tumor suppressor gene probeset was not detected in normal kidney tissue on the Affymetrix U133B array, which may indicate a problem with the VHL probeset. Thus, the group of downregulated genes might still include some additional important growth suppressors. For example, DLEC1, a gene deleted or mutated in a variety of cancers which inhibits cell growth when reintroduced into DLEC1-defective tumor cell lines , is repressed 6.9-fold. GAS1, overexpression of which causes growth arrest in p53-positive cell lines [42, 43], is repressed 5.4-fold. SSAT2, repressed 4.8-fold, is highly similar to murine SAT which catalyzes the rate-limiting step in polyamine biosynthesis. Overexpression of SAT causes growth arrest in human breast carcinoma cells . CALML3 which is induced in terminally differentiated epithelial cells  and which may have reduced expression in lung carcinoma  is repressed 4.0-fold. GADD45A, a p53-dependent DNA-damage inducible gene that inhibits mitotic CDK activity  and is required for DNA-damage induced growth arrest, is repressed 3.9-fold. AUH, which is involved in degrading AU-rich mRNA's – including many proto oncogenes  – is repressed 3.9-fold. HRASLS2 (down 3.9-fold) is highly similar to HRASL3 which is an anti-apoptotic tumor suppressor . ARHI (repressed 3.3-fold) is a member of the Ras homolog gene family. Expression of ARHI is lost in many ovarian and breast cancers . We also found three tumor suppressor genes to be upregulated in RCC: the BRCA1-binding protein BARD1, the CDK4 inhibitor CDKN2B, and Cystatin A.
Several genes involved in DNA synthesis (MCM5, TOP2A), G2 (BUB2, HEC) and mitosis (DOCK2, PRC1) are also induced in RCC. This could be the result of the higher mitotic index of the tumor cells or could reflect tumor-specific alterations to the cell-cycle machinery.
Among the genes that have not been previously noted to be differentially regulated in RCC, we found that aldeyhyde oxidase and sulfite oxidase but not xanthine dehydrogenase (the three human enzymes that use a molybdenum cofactor ), are down regulated. We also found that MOCS1 and molybdopterin synthase  as well as gephyrin , three of the four enzymes involved in molybdenum cofactor biosynthesis, are also down regulated. It is unclear if or how the differential expression of these molybdenum-related genes relates to RCC disease.
We also found that a component of the variation in gene expression among RCC tumors is significantly correlated with the Fuhrman grade of the tumor, and that this variation is more continuous than distinct between the Fuhrman grades. Since our study was designed to identify genes that are differentially regulated in RCC, we did not analyze enough samples of each of the different tumor grades to identify genes that vary as a function of Fuhrman grade with a rigorous false-positive threshold. Analysis of additional tumors from the different Fuhrman grades should allow us to identify these genes and may provide additional insight into the biological processes underlying RCC progression.