Gene expression meta-analysis identifies metastatic pathways and transcription factors in breast cancer

Background Metastasis is believed to progress in several steps including different pathways but the determination and understanding of these mechanisms is still fragmentary. Microarray analysis of gene expression patterns in breast tumors has been used to predict outcome in recent studies. Besides classification of outcome, these global expression patterns may reflect biological mechanisms involved in metastasis of breast cancer. Our purpose has been to investigate pathways and transcription factors involved in metastasis by use of gene expression data sets. Methods We have analyzed 8 publicly available gene expression data sets. A global approach, "gene set enrichment analysis" as well as an approach focusing on a subset of significantly differently regulated genes, GenMAPP, has been applied to rank pathway gene sets according to differential regulation in metastasizing tumors compared to non-metastasizing tumors. Meta-analysis has been used to determine overrepresentation of pathways and transcription factors targets, concordant deregulated in metastasizing breast tumors, in several data sets. Results The major findings are up-regulation of cell cycle pathways and a metabolic shift towards glucose metabolism reflected in several pathways in metastasizing tumors. Growth factor pathways seem to play dual roles; EGF and PDGF pathways are decreased, while VEGF and sex-hormone pathways are increased in tumors that metastasize. Furthermore, migration, proteasome, immune system, angiogenesis, DNA repair and several signal transduction pathways are associated to metastasis. Finally several transcription factors e.g. E2F, NFY, and YY1 are identified as being involved in metastasis. Conclusion By pathway meta-analysis many biological mechanisms beyond major characteristics such as proliferation are identified. Transcription factor analysis identifies a number of key factors that support central pathways. Several previously proposed treatment targets are identified and several new pathways that may constitute new targets are identified.


Background
Breast cancer is the most common cancer among women and the leading cause of cancer related death. Metastasis is the main cause of death of the disease but the knowledge of biological mechanisms in metastasis is still fragmentary. Risk of recurrence is evaluated by clinical and pathological criteria. However, the performance of this method is far from optimal. Gene expression profiling of tumors has been used for classification of cancer outcome in several studies with promising results for improvement of risk prediction [1][2][3][4][5][6][7][8][9]. Despite these promising clinical results, little insight to biological mechanisms has been obtained from the large amount of gene expression data. Pathway or gene ontology analyses in these studies are limited to genes in the outcome classifier. The main mechanism observed is up regulation of cell cycle while other pathways are more inconsistent between the studies [5,[10][11][12]. Another potential use of these data is to map deregulation of transcription factors and miRNA having impact on metastasis. If sufficient knowledge about sequence motifs for these regulators existed, groups of genes with common motifs might display concordant deregulation in metastasizing tumors. An extensive knowledge of transcription factor binding elements is gathered in TransFac database http://www.gene-regulation.com and bioinformatic prediction of target genes has been useful, while prediction of miRNA target genes is still in its infancy [13]. However, systematic investigation with these methods in prognostic gene expression data sets has not been performed for breast cancer. The combined knowledge of pathways and regulators of gene transcripts may deepen the understanding of biological mechanisms in metastasis.
Prognostic data set have been used for validation of prognostic significance of gene sets e.g. representing a certain pathway. For example Bild et al. developed cell models representing candidate metastatic pathways, defined characteristic gene sets by microarray gene expression analysis, and demonstrated prognostic value in the prognostic data sets performed with gene expression [14]. However, these candidate pathways presumably only represent a fraction and not necessarily the most important of the pathways involved in metastasis. We hypothesize that many biological pathways and transcriptional regulators are involved in metastasis of breast cancer and that they are reflected in gene expression patterns of primary tumors. Our aim is to elucidate metastatic pathways and secondly transcriptional regulators and to integrate the knowledge of these informations. Pathway analysis programs typically compare level of gene expression in two sets of samples by ranking genes according to a statistic model and apply a cut-off value for genes resulting in a group of significantly up-and downregulated genes. Overrepresentation analysis then identifies pathways significantly regulated in these groups. We have supplemented this kind of analysis with a more global approach by gene set enrichment analysis (GSEA), to identify pathways that are differentially regulated in metastasizing and non-metastasizing tumors. By this method, the imbalance of pathway gene-sets is examined in the entire list of ranked genes. In a single data set this analysis will generally not result in significant findings beyond major pathways like cell cycle. By performing meta-analysis of 8 data sets, we have increased the power to identify pathways and transcription factors involved in metastasis of breast cancer.

Data sets
Eight publicly available data sets of gene expression at RNA level in primary tumors were included in the analysis. These studies are performed with different platforms, different populations etc. as depicted in table 1. The outcome differs in that local and regional recurrences are included in some studies. However, non-metastatic relapse constitute a minority in clinical cohorts. There may be overlap in the samples in the different data set e.g. samples from Uppsala in Sotiriou 2006 and Uppsala data sets, but the total number of different tumor samples is at least 1200.
The normalizations performed in the studies were retained because the authors found these methods optimal for the data sets, and because pathway analysis was performed separately in each data set.

Gene set enrichment analysis of pathways, transcription factors and miRNA
GSEA v 2.0 [15] was used with 450 curated gene sets representing individual pathways. These pathway gene sets are adopted from KEGG http://www.genome.ad.jp/ KEGG, GenMapp http://www.genmapp.org, Biocarta http://www.biocarta.com etc. and gathered in the Molecular Signature Database implemented in GSEA. For the analysis of promoter and miRNA response elements 837 sequence elements predicted or validated to bind transcription factors or miRNA's were downloaded from MsigDB. The motifs are collected from TransFac http:// www.gene-regulation.com and mirBase http://micro rna.sanger.ac.uk and target gene sets defined by bioinformatic prediction of target genes [13,15]. The GSEA program ranks genes according to a signal-to-noise value: (X A -X B )/(s A +s B ), where X is the mean and s is the standard deviation for the two classes A and B (metastases and nonmetastases). When several probes recognized the same gene, the probe with max expression value was extracted using the "collapse to gene set" function. Gene sets represented by less than 15 genes in a data set was excluded except for the Sotiriou 2003 data set where this threshold was set to 10 genes because of the low number of genes on that chip.
The output from GSEA is an enrichment score, describing the imbalance in the distribution of ranks of gene expression in each gene set between metastasizing and nonmetastasizing tumors. The enrichment score is normal-ized according to size of the gene sets. Then, gene set were ranked according to the normalized enrichment score with gene sets up-regulated in metastasizing tumors on the top and down-regulated gene sets in the bottom.

GenMAPP pathway analysis
To perform pathway analysis with an independent method, GenMapp 2.0 software was used, by applying a significance cut-off for genes between metastasizing and non-metastasizing tumors of p = 0.05 in Students T-test. The software uses Fisher exact test to examine overrepresentation of up-regulated or down-regulated genes among 203 pathways. The output is a ranked list of up-regulated and down-regulated pathways respectively for each data set.

Meta-analysis
The ranked lists of gene sets for each analysis generated by GSEA or GenMAPP from the 8 data sets were integrated so that only gene sets represented in output from all data sets were included. The initial 450 pathway gene sets in MSigDb for the GSEA pathway analysis were reduced to 223 gene sets passing the threshold (at least 10 or 15 genes in gene sets) in all data sets. For analysis of transcription factor and miRNA binding sites, 837 motif gene sets from MSigDb were reduced to 761 motif gene sets. 203 GenMAPP pathways were reduced to 177 pathways present in all data sets. For each data set, individual gene sets were assigned a ranking value from 1 to the maximum number of gene sets, according to the ranking performed by GSEA or GenMAPP. The mean ranking value for each gene set was calculated across the data sets and finally the gene sets were ranked according to this value.
Our null-hypothesis is that the expressions of genes in pathway gene sets are unrelated to metastasis. This means that the ranking value for a given gene set in a given data set is expected to be a random value between 1 and the maximum number of gene sets analyzed. To simulate the distribution of mean ranking values across the 8 data sets fulfilling the null-hypothesis, random drawing of 8 ranking values were performed 10 6 times and the mean value was calculated each time. A null distribution of mean ranking values was generated from these results. To test the significance for a given gene set, the observed mean ranking value was compared to the null distribution. To fulfill the null-hypothesis an observed mean ranking value should be within 95% interval of the null-distribution. This calculation, estimation of p-value and correction of p-value by false discovery rate (FDR) was

Pathway analysis
Data from more than 1200 breast cancer patients were collected (table 1). GSEA only identified few significant pathways within each data set (data not shown). However, by performing meta-analysis of gene sets ranked by enrichment score, several gene sets turned out to have low ranking number in the majority of data sets indicating up-regulation of corresponding pathway in metastasizing tumors compared to non-metastasizing tumors. Similar, gene sets with a high mean ranking value indicated low expression in metastasizing tumors compared to nonmetastasizing tumors (table 2). False discovery rates indicated 38 of these gene sets to be significantly differentially expressed; 26 up-regulated and 12 down-regulated. The most striking pathways are DNA replication and cell cycle that are both up-regulated.
An independent method, GenMAPP, with a very different approach, including only the most significant genes, was applied to the same data (table 3). In this approach, analysis of up-and down-regulated gene sets are performed independently, resulting in two ranked output lists, with up-or down-regulated pathways on the top respectively. For this reason up-regulated as well as down-regulated pathways have low mean ranking numbers (table 3). Eleven pathways are up-regulated and 8 are down-regulated. Again cell cycle (KEGG and G1_to_S_control_reactome) and DNA replication reactome, but also electron transport chain, are strikingly upregulated.

Regulatory motifs
GSEA was also applied to gene sets having in common recognition elements for transcription factors within promoter regions or miRNA recognition elements in 3'UTR's (table 4). Gene sets with low mean ranking number and FDR below 0.05 were interpreted as significantly up-regulated and gene sets with high mean ranking number were considered down-regulated. In this analysis it is striking that all significant gene sets are up-regulated and E2F family members are strongly overrepresented. Several different gene sets predicted to be recognized by the same transcription factor are included in the analysis because different variants of binding motifs have been reported to TransFac database. Elements with unknown transcription factor are predicted by bioinformatic comparison of promoter regions [13].

Discussion
We have performed meta-analysis of 8 publicly available gene expression data sets to identify common biological mechanisms involved in metastasis of breast cancer. The identified pathways can be grouped into a limited number of categories: cell cycle and proliferation, growth factor pathways, metabolism, angiogenesis, gleevec, migration, signal transduction, proteasome pathway, immune system, and DNA damage sensing and repair, which are discussed in section 1 below. Some of the pathways are supported by key transcription factors identified by motif analysis. In section 2, the methods used for pathway analysis and the advantage of meta-analysis are discussed.

Cell cycle and proliferation
Among the pathways most up-regulated in metastasizing tumors, identified by GSEA, are cell cycle pathways represented by five cognate pathways (CELL_CYCLE_KEGG, CELLCYCLEPATHWAY, G1PATHWAY, G1_TO_S_CELL_REACTOME, and G2PATHWAY). This is supported by GenMAPP analysis identifying two of these pathways to be most differently regulated between metastasizing and non-metastasizing tumors (CELL_CYCLE_KEGG and G1_TO_S_CELL_ REAC-TOME). Up-regulation of cell cycle and proliferation is a hallmark of cancer cells compared to normal cells. Our observation, that cell cycle genes are up-regulated in cells   with metastatic potential compared to non-metastatic cells, is in agreement with some previously studies demonstrating cell cycle as the major up-regulated pathway in metastasizing breast cancer [5,10]. Up-regulation of DNA_REPLICATION_REACTOME is required for cell division and is sustained by both GSEA and GenMAPP. Up-regulation of Purine and pyrimidine metabolism pathways most likely also reflect elevated biosynthesis of DNA, and pentose phosphate pathway serves to generate ribose 5-phosphate that is a precursor in nucleotide synthesis and NADPH that donates electrons for biosynthesis of several molecules. Furthermore, pathways involved in transcription (RNA_TRANSCRIPTION_REACTOME, mRNA_PROCESSING_REACTOME) and translation (AMINOACYL_TRNA_BIOSYNTHESIS) are up-regulated. The crucial role of cell cycle and proliferation genes in metastasis is strongly supported by several up-regulated gene sets with recognition sites for E2F identified by motif analysis (table 4). This family of transcription factors have several targets involved in cell cycle e.g. cyclin E [16] and expression of some E2F family members have been associated with poor prognosis in breast carcinomas [17]. Furthermore, NRF1 a transcription factor that co regulates a large number of E2F target genes [18] and YY1 associated with unchecked cellular proliferation [19] are also up-regulated according to the motif analysis.

Growth factors
Two growth factor pathways are down-regulated in tumors with metastatic potential: Epidermal growth factor (EGF) and platelet-derived growth factor (PDGF) pathways. This is surprising and one might expect the opposite findings in cells with elevated proliferation. EGFR is known to transmit mitogenic signals from EGF and TGFα to several downstream signaling cascades: Phospholipase C, RAS, phosphatidylinositol-3 kinase (PI-3K), and STAT's. However, emerging evidence suggests an alternative mechanism that involves transport of activated EGFR from the cell membrane to the nucleus and direct association with gene promoters. Among the targets for nuclear EGFR is cyclin D1 and iNOS [20]. The present results indicate down-regulation of conventional mechanism which might reflect activation of nuclear localization pathway. The majority of genes in the EGF gene set are functioning in the second messenger system while the nuclear localization pathway only involves EGFR. EGFR itself is not differentially regulated (data not shown), but it might be activated by other mechanisms. However, the activity of the transcription factor ELK1 that is downstream target in the conventional pathway is up-regulated according to motif analysis indicating complex alterations of the EGF pathway. Conflicting results have been reported for the prognostic significance of EGF expression in breast cancer (reviewed by Chan et al [21]). PDGF signaling has previously been linked with breast cancer metastasis in a mouse model. Furthermore, increased expression of PDGFR and other key molecules has been measured in invasive carcinomas compared to intra ductal carcinomas and normal tissue [22]. This is conflicting with present results showing overall down-regulation of PDGF pathway. The oncogene RAS which gene product is activated by among others PDGF is however, up-regulated in 5 of 7 data sets (data not shown). The explanation for this discrepancy is unclear. The results indicate differential expression of several genes between metastasizing and non-metastasizing human breast tumors, while other mechanisms may be present in mice models and in the non-metastatic progression of tumors. Other mechanism than the traditional signal transduction pathways might also be involved. Furthermore, methodological problems for example in the definition of gene sets may have impact. Other mechanisms than gene regulation are surely involved in these pathways.

Metabolism
Several pathways involving basic energy metabolizing is activated in metastasizing tumor cells: Glycolysis and glyconeogenesis, citric acid cycle, oxidative phosphorylation, and electron transport chain. The pathway glycolysis and glyconeogenesis cover presumably only glycolysis, because glyconeogenesis is restricted to liver and certain other organs. Glycolysis pathway converts glucose to pyrovate which is subsequently decarboxylated in several reactions in citric acid cycle, a pathway that is up-regulated according to GenMAPP. NADH and FADH2, generated in citric acid cycle, contain electrons that are transferred to molecular oxygen in electron transport chain and oxidative phosphorylation. This results in generation of a proton gradient across mitochondrial inner membrane, driving synthesis of ATP. Fructose and mannose metabolism pathway, up-regulated according to GSEA, presumably cover the entry of fructose into glycolysis by hexokinase, a general mechanism in adipose tissue, where fructose level is high [23]. Hexokinase is believed to be central for maintaining a high glycolytic phenotype that characterizes cancer cells compared to normal cells especially under conditions of hypoxia often present in cancer cells [24]. This is supported by motif analysis indicating activation of NFY and SP1 transcription factors which has been shown to activate hexokinase [25]. Furthermore, pathways generating energy from fatty acid (fatty acid and clycerolipid metabolism) and amino acids (valine, leucine, isoleucine, beta-alanine, and histidine degradation) are down-regulated (GSEA).
These findings indicate that tumor cells with a metastatic potential derive energy, to a higher extend, from carbohydrates and to a lesser extend fatty acids and amino acids compared to non-metastasizing tumor cells. Furthermore, elevated synthesis and consumption of ATP appears to be a feature of metastasizing tumor cells. Higher glucose metabolism is also a characteristic of cancer cells, compared to normal cells, used for detection of tumors and metastases by PET scanning [26]. Glucose metabolism has been hypothesized to improve survival of cancer cells under hypoxia which is often observed in tumors [27].
Here we demonstrate prognostic disadvantage of higher glucose metabolism in primary breast tumors. Furthermore, targets for the transcription factor hypoxia inducible factor 1 (HIF1) is induced according to motif analysis, strongly supporting the recent theory that HIF1 is an inducer of glycolysis in response to hypoxia [27]. To our knowledge no previous report has linked both elevated glycolysis and decreased amino acid and fatty acid degradation to breast cancer metastasis. Furthermore, the metabolic shift is reflected in many sustaining pathways supporting the results.

Angiogenesis
Increased angiogenesis is believed to be an important qualification for survival of breast cancer cells during hypoxia often featuring breast tumors. Vasculature develops in response to growth factors like VEGF released by the tumor. Furthermore, VEGF receptors on tumor cells initiate a autocrine signaling response that facilitates survival in hypoxia and in response to other apoptotic stimuli [28]. This is supported by up-regulation of VEGF pathway in present data sets. VEGF-inhibitors are being tested as breast cancer drugs [29]. Furthermore, PPAR pathway is down-regulated in metastasizing tumor cells. PPARgamma is a ligand-activated transcription factor that has been associated with decreased angiogenesis and invasiveness and with increased patient survival [30][31][32] in agreement with present results.

Gleevec pathway
In chronic myelogenous leukemia (CML) the tyrosine kinase ABL is often activated by a chromosomal translocation, t(9;22), resulting in fusion of BCR and ABL genes. The kinase activity of ABL in the BCR-ABL fusion is activated and up-regulated; driving the uncontrolled cell growth observed in CML. Gleevec is developed to inhibit ABL and also inhibits the PDGF receptor tyrosine kinase and the c-kit tyrosine kinase. The gene set included in the present analysis only includes BCR-ABL pathway. Gleevec has been introduced as anti-cancer drug not only against cancer cells with BCR-ABL translocation, but also in some cancers not having the translocation [33]. However, the BCR-ABL and PDGF pathways inhibited by gleevec is down-regulated in metastasizing tumors according to our analysis, which might indicate that inhibition of ABL is not relevant for treatment of breast cancer.

Migration
Mammalian cell motility requires actin polymerization in the direction of movement to change membrane shape and extend cytoplasma into lamellipodia. Migration is believed to be central for primary tumor cells to reach blood vessels or lymphatics in order to metastasize [34]. In agreement with this, we find actin pathway is up-regulated in metastasizing tumors.

Signal transduction
S1P pathway is up-regulated according to GSEA. Sphingosine-1 phosphate (S1P) is a lysolipid, acting via cell surface coupled G-protein coupled receptors, and required for migration, proliferation and survival of breast cancer cells. Ectopic expression of the enzyme responsible for S1P production, SphK1, increases tumorigenity and angiogenesis and protect cell from apoptosis [35]. This supports our result, that increased expression of S1P pathway members is involved in metastasis.
Several other signal transduction pathways are regulated. The gene set GPCRDB_other, covers a group of G-protein coupled receptors targeted by a broad range of ligands, is down-regulated. Interleukin pathways (Il-3, Il-6, and Il-7) are also down-regulated possibly indicating decreased inflammatory response in metastasizing cells. Progesterone signaling pathway (MPRPATHWAY) is increased and ER transcriptional activity is enhanced according to motif analysis confirming that elevated sex-hormone load may lead to worse prognosis of breast cancer patients, which is supported by the widely use and benefit of anti-hormonal treatment of breast cancers retaining hormone receptors. On the other hand nuclear hormone receptors (NUCLEAR_RECEPTORS, table 2) theme self are downregulated in metastasizing tumors in agreement with loss of ER and PgR as potent negative prognostic markers.

Proteasome pathway
Another pathway targeted for inhibitory treatment is proteasome, a strategy that is supported by our finding that the pathway is up-regulated in metastasizing tumors. Proteasomes normally perform controlled degradation of proteins and proteins selected for degradation by tagging with a poly ubiquitin chain. In addition, proteasomes are important regulators of several key regulatory proteins including p53, cyclins, CDK inhibitors and NF-kB. The proteasome inhibitor bortezomib has been suggested for treatment of several cancers including breast [36].

Immune system
Immune function is also impaired; tollpathway, forming part of innate immune system, is down-regulated in metastasizing cells (table 2). Toll-like receptors (TLR) are activated by pathogen expressed molecules and initiate immune response by release of pro-inflammatory cytokines. Activation of TLR's with synthetic agonists, have shown promising results for treatment of several cancers, by inducing apoptosis and elevating sensitivity of to cancer cells to chemotherapy [37,38]. The present results may support this strategy in breast cancer [37].

DNA damage sensing and repair
ATRBRCA pathway is up-regulated in poor outcome tumors. Several genes in this pathway are mutated in hereditary breast cancer (BRCA1, BRCA2, ATM, p53, Fanconi anemia genes, and CHEK2) and the pathway is believed to prevent cancer development by inducing cell cycle arrest, DNA repair and apoptosis after DNA-damage [39]. Up-regulation in metastatic cells may indicate a compensatory mechanism or malfunction of this pathway. Motif analysis indicate that BACH1, a transcription factor that interacts directly with BRCA1, is up-regulated [40]. Increased DNA repair has also been reported from Wang et al when inspecting their prognostic classifier genes [5].

Gene expression based pathway and transcription factor analysis
The present analysis of gene expression profiles of primary tumors identifies several pathways associated to clinical outcome. The identified pathway gene sets display significant imbalanced expression between metastasizing and non-metastasizing breast tumors across several data sets. This supports our hypotheses that several pathways are involved in breast cancer metastasis and that they are reflected in expression profiles of primary tumors. Furthermore, motif analysis demonstrates many significantly deregulated gene sets with common transcription factor binding sites. The transcription factor binding sites are identified by bioinformatics analysis and are for the most part not functionally validated [13]. The finding that these gene sets are significantly differentially expressed indicates that the individual genes in the gene sets are actually responding to the relevant transcription factors. Several of the motif gene sets associated to metastasis have been linked to certain pathways as discussed in the pathway sections. Some of the transcription factors have also previously been linked directly to survival. A prominent example is estrogen receptor targeted genes that are upregulated in metastasizing tumors. Another well known transcription factor is Myc for which gene amplification is associated to poor prognosis [41]. Also transcription factor YY1 has been associated with metastatic potential in several cancers [19]. A number of predicted elements with unknown transcriptions factors are present in promoters of up-regulated genes. Further studies are needed to identify these factors and their functions. Interestingly, only up-regulated gene sets are identified in motif analysis, in agreement with our previous observation of a majority of up-regulated genes in metastasizing breast tumors [1]. Predominance of up-regulated pathways with both GSEA (26 out of 38 pathways) and GenMAPP (11 out of 19 pathways) meta-analysis supports this. The false discovery rate used for correction for multiple testing is dependent on the data structure and might bias this conclusion. However, the imbalanced distribution of significant pathway and motif gene sets is confirmed using Bonferroni method (data not shown). No miRNA targeted gene sets are identified in motif analysis which may reflect poor algorithms for prediction of binding sites.
Different outcomes are used in the included studies: relapse, metastasis, distant metastasis and death from breast cancer. This may bias the results because relapse (3 data sets) includes local recurrences and these may be result of suboptimal surgery or mechanism of spreading different from distant metastasis. Regional metastasis, i.e. recurrence in lymph nodes may also be the result of different metastasis mechanism compared to distant metastasis. However, local and regional recurrences constitute a minor fraction of a typical tumor-bank cohort, resulting in minor bias. To illustrate this, we examined a large Danish tumor-bank of primary tumors collected at Funen from 1989 to 1999. Among patients that experienced recurrences, 17% got local recurrence, 4% got regional metastasis, and 79% got distant metastasis (data from DBCG, http://www.dbcg.dk).
A varying fraction of the tumors in the individual data sets have disseminated cells to the lymph nodes. Classification of lymph node positive patients without recurrence as non-metastasis may be controversial. This may bias the results towards the metastatic mechanisms following primary spread to lymph node. Furthermore, a minor group of included patients have received adjuvant treatment that will bias the clinical outcome because a fraction of the patients are responding to the therapy. One data set, Huang, display conflicting results in pathway and motif analysis (table 2, 3, 4). Noteworthy, the patients in this study were all lymph node positive and received chemotherapy, which may explain the different results. Other subgroups of tumors e.g. molecular subtypes are also equalized by our approach resulting in identification of general mechanisms involved in metastasis.
Overall, the included studies mainly contain lower risk node negative untreated groups and the main outcome is distant metastasis. The noise from other groups and outcomes represented in the data sets may reduce the significance of identified pathways. This actually means that our statistical method is rather conservative. Furthermore, the statistical test is even more conservative because only ranking information from individual GSEA and Gen-MAPP analyses are included. Using actual statistical parameters from the analyses would increase the gathered significance. However, due to the different distribution of the parameters it was impossible to use them in metaanalysis.
Few other studies have investigated biological mechanisms in breast cancer gene expression data sets comparing metastatic and non-metastatic outcome. The approaches that have been used mainly concern the genes in outcome classifiers (typically below 100 genes) often including a large fraction of genes with unknown function. These studies have mainly reported cell cycle, growth and proliferation to be upregulated, but some have reported angiogenesis, invasion and signal transduction [10], while others have identified cell death, DNA replication, recombination and repair genes [5] or motility [11]. However, the concordance in these findings besides cell cycle is low. We have in a previous study performed meta-analysis of genes present in classifiers and found cell cycle related gene ontology categories to be strongly overrepresented, while other categories were not present in more than one gene set [12]. Furthermore, classifier genes are very dependent of classification algorithm, definition of training and testing set, and many different classifiers may result in same classification performance [42]. Yu et al. [43] recently performed pathway analysis of 500 gene signatures generated by re-sampling training sets from the Rotterdam data set. In this way they minimized the bias from definition of training set, however, they collected the 100 most significant genes each time, ignoring biological mechanisms beyond the most strongly deregulated pathways.
Very different from this restricted approach this study utilizes GSEA including all genes in data sets, offering the possibility to reveal mechanism beyond cell cycle associated to metastasis. Furthermore, meta-analysis identifies gene sets with remarkably concordance between data sets. The findings are strengthened by concordant findings with GenMAPP utilizing only the most significant genes. Compared to the overrepresentation analysis often performed for classifiers, GenMAPP includes app. 500 significant genes improving the likelihood of identification of relevant biological functions. Disagreement between the two programs for some gene sets identified by one program but not the other may be explained by the different algorithms. GenMAPP may tend to ignore gene sets that are composed of genes with weak individual contribution, while GSEA can give these a high combined enrichment score if the fraction of these genes in a gene set is high. Another explanation for discordant finding is different definition of gene sets. Several GSEA gene sets are adopted from GenMAPP, but there are several other contributers [15].
Sorlie at al. has also used global gene expression patterns of large sample sets of breast tumors to investigate biolog-ical mechanisms in breast cancer [44]. However, they used a very different approach by performing unsupervised clustering of samples to identify new subgroups that subsequently were found to be associated with clinical outcome. Their results point at different progenitor cell types and estrogen and HER2 receptor status, but do not elucidate the different biochemical pathways involved in metastasis. Very different from this, we have not aimed at sub grouping tumors but instead intended to investigate general metastatic pathways. Several pathways are also supported by motif analysis indicating plausibly causal explanation for observed gene expression differences.
Many of the identified pathways and transcription factor have previously been identified with different focused techniques. By our global approach we have validated these findings. Furthermore, a number of new pathways and transcription factors and promoter elements not previous linked to metastasis are identified. Further validation and functional testing is required to confirm their role in metastasis.

Conclusion
By performing pathway meta-analysis, we have identified several pathways involved in breast cancer metastasis. Cancer is a genetic disease, and somatic mutations and genomic instability are features of cancer development and progression. In agreement with this, we find DNA damage and repair pathway up-regulated in metastasizing breast tumors. Uncontrolled cell cycle, a feature of cancer cells compared to normal cells, also characterize metastatic cells compared to non-metastatic cells. Growth factors are often hypothesized to play central role in cancer proliferation and progression. We identify several changes in growth factor pathways: PDGF and EGF pathways are reduced, while signaling by VEGF, estrogen and progesterone, is enhanced. The high proliferation rate in tumors may lead to hypoxia, and in order for cancer cells to survive this, defense mechanisms are required and attraction of blood delivery is central. Changed metabolism towards glucolysis at expense of amino acids and lipids, help cancer cells to survive hypoxia, and angiogenesis improve blood delivery thereby reducing hypoxia. Angiogenesis also ensures a shorter distance for cells to the blood stream, and enhanced migration is essential to cover this distance. Immune system may protect against cancer progression supported by our finding that innate immune system is impaired in metastasizing tumors. Finally, proteasome pathway, an important regulator of proliferation and apoptosis, is impaired.