Skip to main content
  • Research article
  • Open access
  • Published:

Prediction of breast cancer by profiling of urinary RNA metabolites using Support Vector Machine-based feature selection



Breast cancer belongs to the most frequent and severe cancer types in human. Since excretion of modified nucleosides from increased RNA metabolism has been proposed as a potential target in pathogenesis of breast cancer, the aim of the present study was to elucidate the predictability of breast cancer by means of urinary excreted nucleosides.


We analyzed urine samples from 85 breast cancer women and respective healthy controls to assess the metabolic profiles of nucleosides by a comprehensive bioinformatic approach. All included nucleosides/ribosylated metabolites were isolated by cis-diol specific affinity chromatography and measured with liquid chromatography ion trap mass spectrometry (LC-ITMS). A valid set of urinary metabolites was selected by exclusion of all candidates with poor linearity and/or reproducibility in the analytical setting. The bioinformatic tool of Oscillating Search Algorithm for Feature Selection (OSAF) was applied to iteratively improve features for training of Support Vector Machines (SVM) to better predict breast cancer.


After identification of 51 nucleosides/ribosylated metabolites in the urine of breast cancer women and/or controls by LC- ITMS coupling, a valid set of 35 candidates was selected for subsequent computational analyses. OSAF resulted in 44 pairwise ratios of metabolite features by iterative optimization. Based on this approach ultimately estimates for sensitivity and specificity of 83.5% and 90.6% were obtained for best prediction of breast cancer. The classification performance was dominated by metabolite pairs with SAH which highlights its importance for RNA methylation in cancer pathogenesis.


Extensive RNA-pathway analysis based on mass spectrometric analysis of metabolites and subsequent bioinformatic feature selection allowed for the identification of significant metabolic features related to breast cancer pathogenesis. The combination of mass spectrometric analysis and subsequent SVM-based feature selection represents a promising tool for the development of a non-invasive prediction system.

Peer Review reports


Among all cancer diseases, breast cancer is worldwide the most frequent cause of death for women between 30 and 60 years, responsible for approximately 500,000 casualties per year in 2002 [1]. The treatment of cancer diseases is inherently linked to early stage diagnosis. The determination of tumor markers represents an integral part of clinical therapy concepts. Unfortunately, the established markers of breast cancer (e.g. CA-15-3 and CEA) offer only unsatisfactory prediction accuracy and therefore are not recommended for early diagnosis and therapy surveillance [2].

New technological and biological developments have the potential to increase the likelihood of discovering new biomarker candidates. In the systems biology context, novel targets have been identified on the genome-, transcriptome- and proteome level. Recently, the metabolome, representing the end products of physiological processes, has experienced increasing clinical attention.

Cell proliferation can also be controlled by metabolites in a way similar to direct gene regulation. By triggering concentration-dependent state changes in the expression of transcription factors or induction of epigenetic processes, metabolites are able to influence cancer pathogenesis and therefore may play a critical role during tumor progression.

Modified nucleosides, which are degradation products of the cellular RNA metabolism, are suggested to be important as possible tumor markers. In addition to the primary constituents adenosine, guanosine, uridine and cytidine, series of derived modified analogs are well known. These modifications (e.g. methylation, sulfur/oxygen-substitution, hypermodification) are posttranscriptionally implemented in the polynucleotide macromolecules and are considered to increase efficiency, activity and integrity of RNA function [3]. Currently more than 100 modified structures are known for various RNA types [4].

During RNA turnover, hydrolytic enzymes catabolize polynucleotides to the ribonucleoside level. The common ribonucleosides and corresponding nucleobases can partly be recycled to rebuild intracellular RNA in the salvage pathway. Due to the lack of specific phosphorylases, modified nucleosides cannot enter this recycling passage and therefore are excreted quantitatively as biochemical end products [5]. Any disease or metabolic imbalance affecting RNA turnover consequently results in altered nucleoside excretion patterns, leading to the hypothesis that RNA metabolites may be useful as tumormarkers. Supporting this idea, significantly increased amounts of modified nucleosides were found in urine from patients suffering from breast carcinoma [6], leukemia [7] and lung carcinoma [8].

In terms of analytics, the coupling of liquid-, gas- or capillary liquid chromatography with mass spectrometric techniques like ESI-IT MS [9] and ESI tandem MS [10] has been established as method of choice. In addition, systems such as ESI-TOF MS [11], MALDI-TOF MS [12] and especially FTICR MS [13] are valuable tools for the elucidation of chemical structures.

The aim of our study was to classify patients with breast cancer compared to healthy volunteers, based on LC-MS analysis of urinary nucleosides using machine learning techniques, which extract patterns from data and build predictors. For instance principal component analysis (PCA) is a commonly used method which was applied by Yang et al. for classification of liver cancer patients by means of HPLC-UV analysis. Based on a set of 15 nucleosides, 83% of the tumor patients were correctly classified [14]. Artificial neural network (ANN) analysis of urinary nucleosides was used by Seidel et al. to distinguish between healthy controls and patients suffering from various cancer diseases, yielding a sensitivity of 97% and a specifity of 85% [15], respectively.

Recently the support vector machine (SVM) became increasingly popular due to its kernel approach and high practical robustness. This technique has been applied in various clinical research projects, analyzing tumor-associated variances in the genomic profile [16], in addition to protein expression [17] and metabolic [18] patterns. Modified nucleosides have also been the target for SVM approaches. For example Mao et al. [19] utilized CE-MS measurements of RNA metabolites for classification of bladder cancer patients (sensitivity 90%, specifity 100%), whilst previous work in our research group also revealed the classification potential of modified nucleosides (sensitivity 94%, specifity 86% [20] and sensitivity 88%, specifity 90% [21]).

Whereas clinical metabolomics often analyzes absolute concentration values of a restricted set of metabolites [15, 19, 21], the present work follows an extended approach. According to the network characteristics in metabolism, we additionally analyzed compounds from pathways, interconnected to cellular RNA catabolism such as histidine metabolism, purine biosynthesis and methionine/polyamine cycle as well as from the nicotinate/nicotinamide metabolism (Figure 1). Furthermore, we used pairwise encoded metabolite ratios in order to assess tumor-associated shifts between substrates in the metabolic flux.

Figure 1
figure 1

Some metabolite structures. Structures of some previously unknown urinary metabolites included in this study. M-4: structure proposal based on combined FT MS and IT MSn analysis. Others: identified in previous works [13].



Methanol LiChroSolv, hypergrade, purchased from Merck/VWR (Darmstadt, Germany) was used for liquid chromatography. Water was taken from an in-house double distillation system. All other chemicals used were of analytical grade.

Standard compounds available as reference for HPLC separation and/or compound identification [13] were dihydrouridine (DHU), pseudouridine (Ψ), cytidine (C), pyridine, 3-hydroxypyridine, uridine (U), 3-methylcytidine (m3C), 1-ribosyl-4-carbamoyl-5-aminoimidazole (AICA riboside), 1-methyladenosine (m1A), 7-methylguanosine (m7G), inosine (I), 3-methyluridine (m3U), adenylosuccinic acid (phosphorylated analog of N6-succinyloadenosine), xanthosine (X), S-adenosylhomocysteine (SAH), 1-methylinosine (m1I), 1-methylguanosine (m1G), N4-acetylcytidine (ac4C), N2-methylguanosine (m2G), N2, N2,7-trimethylguanosine (m2,2,7G), N2, N2-dimethylguanosine (m2 2G), N6-threonylcarbamoyl-adenosine (t6A), 5'-deoxy-5'-methyl-thioadenosine (MTA).

All standards were from Sigma (Taufkirchen, Germany) except m2 2G, m2,2,7G and t6A, obtained from Biolog (Bremen, Germany), 1-methyl-L-histidine, purchased from Calbiochem/Merck (Nottingham, UK) and pyridine from Gruessing (Filsum, Germany). The internal standard isoguanosine was kindly donated by Prof. J.H. Kim of Seoul University, South Korea. Affigel boronate was purchased from Biorad (Richmond, USA).

Urine samples

Spot urine samples were collected from 85 female breast cancer patients (primarily in early tumor stage T1) at the Department of Gynecology and Obstetrics, University Hospital Tuebingen and from 85 female healthy volunteers in a private clinical office (i.e. women accompanying their children to the clinical office). Tumor stage and age distributions are given in figure 2 and 3. The clinical trial has been approved by the local ethics committee of University Hospital Tübingen. In order to minimize possible endo- and exogenous perturbations on the urinary metabolite pattern, we defined precise criteria for patient recruitment. The samples were taken preoperatively and neoadjuvant endocrine therapy, irradiation or chemotherapy were not allowed. Patients taking immunomodulating drugs, antibiotics, mistletoe preparations, virustatics, allopurinol and dipyridamol were not included in this study. Pregnancy, immune mediated diseases, HIV, acute and chronic hepatitis, chronic renal failure, acute infection of the urinary tract as well as the patients' participation in a clinical drug trial were defined as exclusion criteria. All samples were stored at -80°C until extraction.

Figure 2
figure 2

Tumor stage distribution. Histogram of the tumor stage distribution. The major fraction of patients had breast cancer in the T1 stadium. The remaining patients were mostly T2 with the exception of 11.7% that divide up into the T3 and the Tis stadium.

Figure 3
figure 3

Age distribution. Histogram of the age distributions for cancer and control patients.

Sample preparation

The metabolites were isolated from urine samples by cis-diol specific affinity chromatography with 500 mg affigel boronate per column (column dimensions: 150 × 15 mm). A volume of 1 mL urine was spiked with 50 μL of internal standard solution (0.1 mM isoguanosine in water), mixed with 9 mL ammonium acetate solution (0.25 mM, pH 8.8) and then put on the column following preconditioning with 45 mL ammonium acetate solution (0.25 mM, pH 8.8). Because of the high backpressure from the affigel boronate material, compressed air was applied throughout the extraction procedure to maintain a moderate flow rate at 3–4 mL/min. Ribosylated compounds are bound reversibly and specifically at the 2',3'-cis-diol group. After washing with 10 mL ammonium acetate solution (0.25 mM, pH 8.8) and 4 mL ammonium acetate solution (0.25 mM, pH 8.8)/methanol (9.5:0.5, v/v), elution was carried out with 6 mL methanol/water (2:8, v/v) and 50 mL 0.2 M formic acid in methanol/water (1:1, v/v). The column was reconditioned with 25 mL methanol/water (2:8, v/v) and 45 mL ammonium acetate solution (0.25 mM, pH 8.8) for the next sample. After each second extraction, a blank sample (10 mL ammonium acetate solution (0.25 mM, pH 8.8)) was analogously proceeded to remove impurities from the column and to avoid possible carry-over effects. The solvent from the sample eluate was removed using a rotary evaporator and the residuum was dissolved again in 0.5 mL ammonium formate solution (5 mM, pH 5). A volume of 10 μL was injected into the HPLC-MS system.


The chromatographic separation of the urinary metabolites was performed on an Agilent 1100 Series HPLC system (Agilent, Waldbronn, Germany) consisting of a Solvent Degasser (G 1379 A), a binary capillary pump (G 1389), an autosampler (G 1313 A), a column oven operated at 25°C (G 1316 A) and a DAD (G 1315 B). The chromatographic system consisted of a Merck LiChroCART Superspher 100 RP-18 endcapped column (125 × 2.0 mm i.d., 4 μm (Merck, Darmstadt, Germany)) and a solvent system of 5 mM ammonium formate buffer, pH 5.0, and methanol-water (3:2, v:v + 0.1% formic acid) at a flow rate of 125 μL/min [9]. The LC-system was coupled to an Esquire HCT-Ion trap mass spectrometer (Bruker Daltonics, Bremen, Germany), equipped with an ESI source and operated in positive ion detection mode.

The capillary voltage was set to 4 kV, the drying gas temperature in the electrospray source was set to 350°C, the nebulizer gas was set to 45 psi and the drying gas to 9.0 L/min. The data were acquired in standard enhanced scan mode (8,100 m/z per second) over a mass range of m/z 200–600 via Bruker EsquireControl version 5.1. For post processing, Bruker DataAnalysis version 3.1 was used.

Integration procedure

Semiquantitative concentration values were obtained via integration of Extracted Ion Chromatograms (EIC). Due to significant alkali affinity of certain analyzed metabolites, we generally summarize the corresponding [MH]+, [MNa]+ and [MK]+ traces. The EICs were processed with a Gauss function smoothing algorithm contained in the DataAnalysis software. For analytical and physiological normalization, the integrated peak areas were related to the internal standard and the urinary creatinine level (in mg/dl):


Bioinformatic data analysis

For bioinformatic feature selection we encoded pairwise combinations of semiquantitative concentrations of the 35 included metabolites (x, y). This resulted in 35 × 34 = 1190 encodings per sample. We used the encoding function


and defined the case , when y = 0 was not detected.

Two problems are solved using this encoding, which should be considered as a normalization step. Firstly, a consistent value for the case where a value was below the detection threshold is obtained, and secondly, this encoding maps e(x, y) and e(y, x) onto different codomain ranges conserving argument order information. For more information [see Additional file 1, 2 and 3].

Next, we applied Linear Discriminant Analysis (LDA) [22] to visualize our encoding. As in the case of Principal Component Analysis (PCA) a linear model is used to visualize the data. In contrast to PCA the aim of LDA is to find a linear model that maximally separates the classes on a straight line.

To compare the encodings, we computed the LDA projections to visualize the data set with and without arctan-encoding (see figure 4). Because of the risk of overfitting using more than 1190 features, nonparametric feature selection is needed to reduce the number of features used for prediction.

Figure 4
figure 4

LDA analysis. Projection of the class distribution onto a straight line by Linear Discriminant Analysis for the discrete and the arctan-encoded metabolite ratios. As can be seen, the pairwise-encoding offers a better partitioning of cancer and healthy collectives by a linear model than the sole concentration features.

Therefore we used the Oscillating Search Algorithm for Feature Selection (OSAF) [23] in combination with a SVM to select a reduced set of optimized features for classification. The OSAF wrapper method applies an efficient strategy to select combinations from the power set of features and uses the SVM as a black box to assess the information content [24].

Our implementation operates in up- and down-swing phases which are based on Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS). The SFS greedily includes the feature, which maximally improves the prediction error, while SBS removes the feature which minimally reduces the error.

Having selected a feature set, the algorithm uses the SVM to train a predictor for estimation of the generalization performance. We evaluated the 10-fold cross validated (CV) Matthews Correlation Coefficient (MCC) as measure.

Given the true positives (TP), the true negatives (TN), the false positives (FP) and the false negatives (FN), the MCC is computed as


This results in a value between +1.0 for perfect predictions and -1.0 for maximal false predictions.

Furthermore, we computed the Leave-one-out (LOO) estimate, which is an almost unbiased estimate for the true generalization error [25].

During each evaluation of a feature set, the SVM model parameters were chosen by grid search. We used a modified implementation of LibSVM [26] that reports all statistics needed for the computation of the MCC, together with an OSAF wrapper written in Perl. The LDA analysis was performed in R and the mutual information below was computed using Matlab.

To remove redundancy in the list of features, we classified the selected features into tumor and non-tumor relevant according to current literature knowledge. Then we applied SBS to remove all features which had no impact on the MCC and where not tumor relevant.

To visualize the importance of each selected feature, we computed the mutual information [22], defined as


This value represents a quantity which measures the mutual dependence between two variables (here class label and metabolite encodings). Although prediction performance is obtained from a complex set of variables, even those variables with small information content may be essential in combination with others (see [24]).


Generating a valid metabolic profile

Based on a set of 51 detectable urinary cis-diol metabolites in the applied sample volume of 1 mL urine, we first attempted to define a valid metabolic profile. To this end, two main criteria were established for the inclusion of compounds. First, the respective metabolite should meet the analytical criteria of linearity and reproducibility. Second, the biochemical origin of the compound should constitute a possible tumor-associated background.

In this manner, 16 compounds were excluded due to poor linearity/reproducibility and/or missing pathophysiological relevance. In the latter case, we eliminated potential secondary metabolites from endosymbiontic bacteria, metabolites influenced by nutrition [13] and compounds originated or influenced in sample preparation. The resulting metabolic profile for SVM training is shown in Table 1.

Table 1 Included metabolites

Proof of reproducibility

For proof of reproducibility, 10 mL of a spot urine sample were spiked with 500 μL internal standard (0.1 mM isoguanosine in water). The obtained solution was separated in ten aliquots of 1 mL. Each aliquot was mixed with 9 mL ammonium acetate solution (0.25 mM, pH 8.8) to give a sample volume of 10 mL, vortexed and proceeded as described in sections "extraction procedure" and "integration procedure". The obtained values for reproducibility are shown in Table 1. A compound was considered to be reproducible for RSD values ≤ 15%.

Proof of linearity

For proof of linearity, two different spot urine samples were separated in specimens of 0.25 mL, 0.5 mL, 1 mL, 2 mL and 4 mL. Each sample was spiked with 50 μL internal standard and mixed with 9 mL ammonium acetate solution (0.25 mM, pH 8.8) to give a sample volume of 10 mL. The obtained solutions were proceeded as described. The obtained values are shown in Table 1. Linearity was considered for regression coefficients ≥ 0.95.

Feature selection with best classification performance

As can be seen in figure 4 both collectives (cancer/healthy) are clearly separable using the arctan-encoding, while the usage of the semiquantiative concentrations yields a poor separation performance. Therefore a learning algorithm can construct better separating models on the transformed data using the arctan-encoding than on the raw feature encoding.

The application of the OSAF yielded a set of 59 feature combinations with best classification performance. The successive pruning step with SBS left a set of 44 mainly pathophysiological relevant feature combinations, without degrading classification performance (Table 2). Final performance was a sensitivity of 83.5% and a specifity of 90.6% with a p-value << 0.05 (Two-sided Fisher's exact test, Table 3) for 10-fold cross validation. The leave-one-out validation yielded 83.5% sensitivity and a specifity of 85.9% also having p-value << 0.05. Figure 5 shows the mutual information of the selected combination. In comparison to prior work [20] the mutual information identifies more informative features that are obtained by using a pairwise-encoding.

Table 2 Selected feature set
Table 3 Generalization performance
Figure 5
figure 5

Mutual Information Content. This figure shows the mutual information content of the selected metabolite ratios. On the x-coordinate all pairwise encoded features are listed with their indexes in table 2 in brackets.


The obtained feature selection reflects characteristic, tumor-associated shifts in the analyzed metabolite patterns.

The action of methyltransferases plays a key role in the aberrant RNA metabolism in tumor genesis [27]. In this context, the selected feature combinations of methylated nucleosides No. 13 (m1A/m3C), 15 (m1A/m3U), 20 (m3U/m3C), 26 (m1I/m2 2G), 27 (m1G/m3U), 35 (m2,2,7G/m6t6A) and 36 (m2,2,7G/ms2t6A) show pathophysiologically motivated pattern shifts. Tsutsui et al. already reported on significant alterations in the ratios of the monophosphorylated, methylated nucleosides m6A, m5C, m2G and m2 2G from tRNA in normal hepatocytes and Novikoff-Hepatoma cells [28]. Changes in the enzyme specifity, resulting in an enlarged set of possible modification sites in the polynucleotide molecule, were postulated as biochemical background. Analogous alterations in the methylation capacity have also been reported in breast cancer [29]. The observed shifts in the excretion ratios of certain methylated nucleosides can be generally traced back to this phenomenon.

A metabolic pathway with considerable classification potential was found to be the methionine-/polyamine cycle. Striking analogies have been found to our previous projects, dealing with metabolic profiling in cell culture supernatants of breast cancer cell line MCF-7. In this work, characteristic tumor-associated alterations in the methionine/polyamine cycle had been observed for the excretion behavior of the corresponding degradation products [30]. In particular, these were metabolites from the ubiquitous enzymatic co-substrate SAM. Figure 6 shows a connectivity map of the corresponding biochemical pathways.

Figure 6
figure 6

Connectivity map. Connectivity map of SAM and related metabolites.

The ribosyl-conjugated methionine scaffold of SAM provides functional groups for various enzymatic reactions. The biosynthesis of the modified uridine derivative proceeds via selective transfer of the carboxyaminopropyl moiety on uridine positions in the RNA molecule [31]. In this context, the feature combination 10 (U/SAH) is of great importance. The high information content in the classification process is probably based on alterations in the competing reaction pathways SAM → SAH and SAM → U → acp3U. In cancer diseases, the elevated cellular methylation capacities lead to higher synthesis and thus excretion of SAH, consequently resulting in altered SAH/U ratios. This presumption is supported by the fact that ratio No. 3 (DHU/SAH) is also differing between breast cancer patients and healthy control subjects. DHU is a uridine derivative, modified through enzymatic reduction of uridine.

The most characteristic indication for tumor-associated alterations in the reaction of SAM-induced carboxyaminopropyltransfer and SAM-induced methyltransfer is reflected by feature combination 7 (acp3U/m3U). Both modified nucleosides represent the primary metabolites of uridine in the mentioned reaction pathways and contribute to the resulting classification performance of the SVM.

Distinctive metabolite ratios within the sets of modified uridines such as No. 4 (Ψ/m3U) and No. 38 (mcm5s2U/DHU), adenosines (No. 43, m6t6A/m1A) and cytidines (No. 29, ac4C/m3C) were selected in the course of the performed feature selection due to their high information content. Alterations in the concentration ratios within one nucleoside group can be attributed to tumor-associated changes in expression and activities of the involved modifying enzyme systems.

The deregulation of SAM-induced methyltransfer reactions in tumor genesis is reflected by three additional feature combinations No. 17 (m7G/SAH), 21 (m3U/SAH) and 37 (m2 2G/SAH). The methylated nucleosides m7G, m3U and m2 2G are posttranscriptionally synthesized via transfer of the activated SAM methyl function on defined positions in the polynucleotide molecules. The SAM cosubstrate involved is converted to SAH. As a consequence, the elevated methylation capacities in tumor cells result in higher levels of methylated nucleosides and thus an increased degradation of SAM yielding SAH. The latter is known as a potent inhibitor of methyltransferases [32]. An elevated level of excretion has already been observed in our studies on metabolite excretion in cell culture supernatants of tumor cell line MCF-7 compared to breast epithelial cell line MCF-10A [30]. As a main conclusion, tumor cells most likely avoid the aforementioned inhibitory effects by active excretion of surplus SAH, resulting in ratio shifts to methylated nucleosides.

In this context, the feature combinations No. 41 (MTA/m6 1A) and 42 (MTA/m6t6A) should also be mentioned. MTA is the primary degradation product of SAM in case of transfer reactions of the aminocarboxypropyl moiety on uridines in the RNA macromolecules. Furthermore it is built by transfer of propylamino groups on the polyamine compounds putrescine and spermidine via the decarboxylated byproduct of SAM, dcSAM. Polyamines are known to be involved in important cell growth and development processes, which thereby also have great impact in tumor genesis [33]. The tumor-associated, deregulated influence on the metabolic flow of the methionine/polyamine cycle probably leads to an accumulation of MTA. Due to its well-known inhibitory effects on methyltransfer reactions, a simultaneous elevated excretion might take place in tumor genesis [30]. Due to the contrarily proceeding SAM-induced methyltransfer reactions leading to m6 1A and m6t6A, shifts in the metabolite ratios involving MTA were observed.

Another interesting metabolite ratio is No. 28, featuring cytidine and its acetylated derivative ac4C. The latter is built in rRNA and tRNA by means of an acetyltransferase system and most probably acetyl-CoA as donor of the acetyl function [4].

In eukaryotic tRNA, ac4C is exclusively implemented on position 12 in the D-loop. The exact biological function is still unknown. A general stabilization of the tRNA structure has been discussed in [34]. Elevated amounts of acetylated cytidine have been described in numerous reports, dealing with the altered excretion of modified nucleosides in cancer diseases [35]. The selection of the C/ac4C combination in our classification approach appears in analogy to the results of our previous work with cell culture supernatants, which showed distinctive alterations in the excretion of cytidine in breast cancer cells [30].

Selection of feature combination No. 32 also reflects relevant attributes of tumor-associated alterations of RNA metabolism. The monomethylated guanosine derivative m2G and its dimethylated analogon m2 2G derive from eukaryotic tRNA and rRNA [4] and have both been postulated as potential tumor markers [36]. During biosynthesis of the methylated guanosines, the precursor molecule m2G is converted to m2 2G via the tRNA-N2, N2-dimethyltransferase [37]. In tumors of liver and kidney, a distinctively elevated activity of the involved enzyme system has been observed by Craddock [38]. The resulting elevated biosynthesis of m2 2G explains the observed tumor-associated shifts in the m2G/m2 2G ratio.


In conclusion, we found a reasonable set of 44 tumor-related metabolite pairs measured by LC-IT MS with a SVM prediction performance of 83.5% sensitivity and 90.6% specifity (p-value << 0.05). We demonstrate that semiquantitative measurements are valuable for pattern detection using nonparametric machine learning algorithms. Our results constitute the basis for the development of a noninvasive and efficient screening method. Although we have analyzed a balanced dataset of 170 urine samples and estimated the prediction performance using the nearly unbiased LOO, a validation study remains future work. It is essential to perform a large-scale and multi-centric evaluation study of the method to prove it as valid for clinical testing.


  1. World Health Organization (WHO), Causes of death. 2008, []

  2. Khatcheressian JL, Wolff AC, Smith TJ, Grunfeld E, Muss HB, Vogel VG, Halberg F, Somerfield MR, Davidson NE: American Society of Clinical Oncology 2006 update of the breast cancer follow-up and management guidelines in the adjuvant setting. J Clin Oncol. 2006, 24: 5091-5097.

    Article  PubMed  Google Scholar 

  3. Garcia GA, Goodenough-Lashua DM: Mechanism of RNA-Modifying and -Editing Enzymes. Modification and Editing of RNA. Edited by: Grosjean H, Benne R. 1998, Washington: American Society for Microbiology, 1: 135-168. first

    Chapter  Google Scholar 

  4. The RNA Modification Database. 2008, []

  5. Schram KH: Urinary nucleosides. Mass Spectrom Rev. 1998, 17: 131-251.

    Article  CAS  PubMed  Google Scholar 

  6. Tormey DC, Waalkes TP, Gehrke CW: Biological markers in breast carcinoma–clinical correlations with pseudouridine, N2, N2-dimethylguanosine, and 1-methylinosine. J Surg Oncol. 1980, 14: 267-273.

    Article  CAS  PubMed  Google Scholar 

  7. Itoh K, Konno T, Sasaki T, Ishiwata S, Ishida N, Misugaki M: Relationship of urinary pseudouridine and 1-methyladenosine to activity of leukemia and lymphoma. Clin Chim Acta. 1992, 206: 181-189.

    Article  CAS  PubMed  Google Scholar 

  8. Waalkes TP, Abeloff MD, Ettinger DS, Woo KB, Gehrke CW, Kuo KC, Borek E: Modified ribonucleosides as biological markers for patients with small cell carcinoma of the lung. Eur J Cancer Clin Oncol. 1982, 18: 1267-1274.

    Article  CAS  PubMed  Google Scholar 

  9. Kammerer B, Frickenschmidt A, Muller CE, Laufer S, Gleiter CH, Liebich H: Mass spectrometric identification of modified urinary nucleosides used as potential biomedical markers by LC-ITMS coupling. Anal Bioanal Chem. 2005, 382: 1017-1026.

    Article  CAS  PubMed  Google Scholar 

  10. Dudley E, El-Sharkawi S, Games DE, Newton RP: Analysis of urinary nucleosides. I. Optimisation of high performance liquid chromatography/electrospray mass spectrometry. Rapid Commun Mass Spectrom. 2000, 14: 1200-1207.

    Article  CAS  PubMed  Google Scholar 

  11. Bullinger D, Frickenschmidt A, Pelzing M, Zey T, Zurek G, Laufer S, Kammerer B: Identification of urinary nucleosides by ESI-TOF-MS. LC-GC Europe. 2005, 5: 16-17.

    Google Scholar 

  12. Kammerer B, Frickenschmidt A, Gleiter CH, Laufer S, Liebich H: MALDI-TOF MS analysis of urinary nucleosides. J Am Soc Mass Spectrom. 2005, 16: 940-947.

    Article  CAS  PubMed  Google Scholar 

  13. Bullinger D, Fux R, Nicholson G, Plontke S, Belka C, Laufer S, Gleiter CH, Kammerer B: Identification of urinary modified nucleosides an ribosylated metbaolites in humans via combined ESI-FTICR MS and ESI-IT MS analysis. J Am Soc Mass Spectrom. 2008, 19: 1500-1513.

    Article  CAS  PubMed  Google Scholar 

  14. Yang J, Xu G, Zheng Y, Kong H, Pang T, Lv S, Yang Q: Diagnosis of liver cancer using HPLC-based metabonomics avoiding false-positive result from hepatitis and hepatocirrhosis diseases. J Chromatogr B Analyt Technol Biomed Life Sci. 2004, 813: 59-65.

    Article  CAS  PubMed  Google Scholar 

  15. Seidel A, Brunner S, Seidel P, Fritz GI, Herbarth O: Modified nucleosides: an accurate tumour marker for clinical diagnosis of cancer, early detection and therapy control. Br J Cancer. 2006, 94: 1726-1733.

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Fujarewicz K, Jarzab M, Eszlinger M, Krohn K, Paschke R, Oczko-Wojciechowska M, Wiench M, Kukulska A, Jarzab B, Swierniak A: A multi-gene approach to differentiate papillary thyroid carcinoma from benign lesions: gene selection using support vector machines with bootstrapping. Endocr Relat Cancer. 2007, 14: 809-826.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Oh JH, Nandi A, Gurnani P, Knowles L, Schorge J, Rosenblatt KP, Gao JX: Proteomic biomarker identification for diagnosis of early relapse in ovarian cancer. J Bioinform Comput Biol. 2006, 4: 1159-1179.

    Article  CAS  PubMed  Google Scholar 

  18. Denkert C, Budczies J, Kind T, Weichert W, Tablack P, Sehouli J, Niesporek S, Konsgen D, Dietel M, Fiehn O: Mass spectrometry-based metabolic profiling reveals different metabolite patterns in invasive ovarian carcinomas and ovarian borderline tumors. Cancer Res. 2006, 66: 10795-10804.

    Article  CAS  PubMed  Google Scholar 

  19. Mao Y, Zhao X, Wang S, Cheng Y: Urinary nucleosides based potential biomarker selection by support vector machine for bladder cancer recognition. Anal Chim Acta. 2007, 598: 34-40.

    Article  CAS  PubMed  Google Scholar 

  20. Bullinger D, Fröhlich H, Klaus F, Neubauer H, Frickenschmidt A, Henneges C, Zell A, Laufer S, Gleiter CH, Liebich H, Kammerer B: Bioinformatical evaluation of modified nucleosides as biomedical markers in diagnosis of breast cancer. Analytica Chimica Acta. 2008, 618: 29-34.

    Article  CAS  PubMed  Google Scholar 

  21. Frickenschmidt A, Frohlich H, Bullinger D, Zell A, Laufer S, Gleiter CH, Liebich H, Kammerer B: Metabonomics in cancer diagnosis: mass spectrometry-based profiling of urinary nucleosides from breast cancer patients. Biomarkers. 2008, 13: 435-449.

    Article  CAS  PubMed  Google Scholar 

  22. Duda R, Hart P, Stork G: Pattern Classification. 2001, New York: Wiley Interscience

    Google Scholar 

  23. Somol P, Pudil P: Oscillating search algorithms for feature selection. Proceedings of the International Conference on Pattern Recognition (ICPR'00). 2000, 2: 406-409.

    Article  Google Scholar 

  24. Guyon I, Elisseeff A: An Introduction into Variable and Feature Selection. J Machine Learning Research. 2003, 3: 1157-1182.

    Google Scholar 

  25. Wasserman L: All of nonparametric statistics. 2006, New York: Springer Science and Business Media, LLC

    Google Scholar 

  26. Chang CC, Lin CJ: LIBSVM: a library for support vector machines. 2001, []

    Google Scholar 

  27. Mandel LR, Hacker B, Maag TA: Altered transfer RNA methylase patterns in Marek's disease tumors. Cancer Res. 1971, 31: 613-616.

    CAS  PubMed  Google Scholar 

  28. Tsutsui E, Srinivasan PR, Borek E: TRNA methylases in tumors of animal and human origin. Proc Natl Acad Sci USA. 1966, 56: 1003-1009.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Borek E: Transfer RNA and transfer RNA modification in differentiation and neoplasia. Introduction. Cancer Res. 1971, 31: 596-597.

    CAS  PubMed  Google Scholar 

  30. Bullinger D, Neubauer H, Fehm T, Laufer S, Gleiter CH, Kammerer B: Metabolic signature of breast cancer cell line MCF-7: profiling of modified nucleosides via LC-IT MS coupling. BMC Biochem. 2007, 8: 25-

    Article  PubMed  PubMed Central  Google Scholar 

  31. Fontecave M, Atta M, Mulliez E: S-adenosylmethionine: nothing goes to waste. Trends Biochem Sci. 2004, 29: 243-249.

    Article  CAS  PubMed  Google Scholar 

  32. Kerr SJ: Competing methyltransferase systems. J Biol Chem. 1972, 247: 4248-4252.

    CAS  PubMed  Google Scholar 

  33. Tormey DC, Waalkes TP, Kuo KC, Gehrke CW: Biologic markers in breast carcinoma: clinical correlations with urinary polyamines. Cancer. 1980, 46: 741-747.

    Article  CAS  PubMed  Google Scholar 

  34. Johansson Marcus JO, Bystrom AS: The Saccharomyces cerevisiae TAN1 gene is required for N4-acetylcytidine formation in tRNA. RNA. 2004, 10: 712-719.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Thomale J, Nass G: Elevated urinary excretion of RNA catabolites as an early signal of tumor development in mice. Cancer Lett. 1982, 15: 149-159.

    Article  CAS  PubMed  Google Scholar 

  36. La S, Cho J, Kim JH, Kim KR: Capillary electrophoretic profiling and pattern recognition analysis of urinary nucleosides from thyroid cancer patients. Anal Chim Acta. 2003, 486: 171-182.

    Article  CAS  Google Scholar 

  37. Constantinesco F, Motorin Y, Grosjean H: Characterisation and enzymatic properties of tRNA(guanine 26, N (2), N (2))-dimethyltransferase (Trm1p) from Pyrococcus furiosus. J Mol Biol. 1999, 291: 375-392.

    Article  CAS  PubMed  Google Scholar 

  38. Craddock VM: Increased activity of transfer RNA N2-guanine dimethylase in tumors of liver and kidney. Biochimica et Biophysica Acta, Nucleic Acids and Protein Synthesis. 1972, 272: 288-296.

    Article  CAS  Google Scholar 

Pre-publication history

Download references


MS is supported by the Robert Bosch Foundation, Stuttgart, Germany.

Written consent for publication was obtained from the patients or their relatives.

Author information

Authors and Affiliations


Corresponding authors

Correspondence to Andreas Zell or Bernd Kammerer.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

CH performed bioinformatical data analysis. DB performed sample preparation and LC-MS analysis. NF extracted the urinary samples by boronate affinity chromatography. RF, HS, HN as well as CG designed the concept of the clinical study. SL, MS, AZ and BK supervised the study and critically revised the manuscript. All authors read and approved the final manuscript.

Carsten Henneges, Dino Bullinger contributed equally to this work.

Electronic supplementary material


Additional file 1: Note on the arctan encoding. This additional note contains more information about the ideas of using the arctan function for encoding pairwise relations. (PDF 79 KB)


Additional file 2: Phenotype permutation test. This table contains p-values for a phenotype permutation test performed for the arctan encoded pairwise features. (XLS 18 KB)


Additional file 3: Metabolite variability. This document contains a boxplot and a discussion of the value codomain for each measured metabolite and collective, e.g. patient and control. (DOC 28 KB)

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Henneges, C., Bullinger, D., Fux, R. et al. Prediction of breast cancer by profiling of urinary RNA metabolites using Support Vector Machine-based feature selection. BMC Cancer 9, 104 (2009).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: