MRM screening/biomarker discovery with linear ion trap MS: a library of human cancer-specific peptides

Background The discovery of novel protein biomarkers is essential in the clinical setting to enable early disease diagnosis and increase survivability rates. To facilitate differential expression analysis and biomarker discovery, a variety of tandem mass spectrometry (MS/MS)-based protein profiling techniques have been developed. For achieving sensitive detection and accurate quantitation, targeted MS screening approaches, such as multiple reaction monitoring (MRM), have been implemented. Methods MCF-7 breast cancer protein cellular extracts were analyzed by 2D-strong cation exchange (SCX)/reversed phase liquid chromatography (RPLC) separations interfaced to linear ion trap MS detection. MS data were interpreted with the Sequest-based Bioworks software (Thermo Electron). In-house developed Perl-scripts were used to calculate the spectral counts and the representative fragment ions for each peptide. Results In this work, we report on the generation of a library of 9,677 peptides (p < 0.001), representing ~1,572 proteins from human breast cancer cells, that can be used for MRM/MS-based biomarker screening studies. For each protein, the library provides the number and sequence of detectable peptides, the charge state, the spectral count, the molecular weight, the parameters that characterize the quality of the tandem mass spectrum (p-value, DeltaM, Xcorr, DeltaCn, Sp, no. of matching a, b, y ions in the spectrum), the retention time, and the top 10 most intense product ions that correspond to a given peptide. Only proteins identified by at least two spectral counts are listed. The experimental distribution of protein frequencies, as a function of molecular weight, closely matched the theoretical distribution of proteins in the human proteome, as provided in the SwissProt database. The amino acid sequence coverage of the identified proteins ranged from 0.04% to 98.3%. The highest-abundance proteins in the cellular extract had a molecular weight (MW)<50,000. Conclusion Preliminary experiments have demonstrated that putative biomarkers, that are not detectable by conventional data dependent MS acquisition methods in complex un-fractionated samples, can be reliable identified with the information provided in this library. Based on the spectral count, the quality of a tandem mass spectrum and the m/z values for a parent peptide and its most abundant daughter ions, MRM conditions can be selected to enable the detection of target peptides and proteins.


Background
The identification of novel protein biomarkers for early disease detection, risk assessment, treatment, prediction of therapeutic response or toxicity, will dramatically improve disease outcomes and survivability rates. The discovery process of protein biomarkers relies, essentially, on the detection and quantitation of protein differential expression patterns in diverse samples [1][2][3][4]. Recently, mass spectrometry has evolved into a powerful tool for the analysis of complex proteomic extracts, and various quantitative proteomic approaches (label-free/stable isotope labeling or absolute/relative) have been developed [5][6][7][8][9][10][11][12][13][14][15]. Large-scale quantitation is typically accomplished by comparing the sample of interest to a pre-defined reference sample of similar complexity. The classical datadependent driven MS/MS profiling technique, in which an attempt is made to detect all components in a proteome, has provided limited reproducibility for quantitation purposes and limited capability for detecting low abundant proteins, such as the case of many biomarkers. At the cost of restricting the discovery potential, a targeted screening approach, i.e., multiple reaction monitoring, has been developed to enable the reliable detection and quantitation of representative peptides for selected proteins. While MRM is one of the most sensitive MS scanning modes for peptide identifications, it is best applicable to previously identified peptides with known MS/MS fragmentation pattern [16,17]. An MRM experiment is conducted by selecting representative peptides of a protein with known m/z values (precursor ions), fragmenting them through collision induced dissociation (CID), and monitoring only specific, pre-selected daughter fragments (product ions) that are characteristic to each precursor. The combination of a precursor-product m/z values is known as a 'transition,' and is highly specific for a given peptide amino acid sequence. As only a narrow mass range around the m/z of the daughter ion is monitored by MS, the method provides for a fast and sensitive detection of selected peptides. When combined with methods that rely on the use of stable isotope-labeled peptide standards, this approach can be successfully applied for the absolute and relative quantitation of low abundant components in complex samples. With this method, 47 high/intermediate-abundance proteins were quantified successfully in human plasma (<1 μg/mL level, coefficient of variation, CV = 2-22%) [18], and C-reactive protein [19], human growth hormone [20], and prostatespecific antigen [21] were measured in plasma or serum. Alternatively, MRM-based approaches have been used to identify the presence of phosphorylation on key cell cycle regulatory proteins [22], to quantify multisite phosphorylation [23], and to perform quantitative proteomic analysis of cellular signaling networks [24].
In most MS quantitative studies, the instrument of choice is a triple quadrupole mass spectrometer. Recently, a new type of MS instrument, i.e., the linear ion trap, has gained popularity among proteomics researchers. In a triple quadrupole instrument, CID is accomplished by accelerating the precursor ions in a dc/rf electrical field to induce fragmentation through successive collisions with background gas molecules (multi-step fragmentation). In an ion trap instrument, CID is accomplished by exciting the precursor ions at their resonant frequency. As the product ions have different masses than the precursor ion, they are not in resonance with the excitation frequency, and are not subjected to further ion fragmentation as in a triple quadrupole instrument (single-step fragmentation). Thus, the analysis of large peptides by MRM in linear ion-trap mass spectrometers can be performed with improved detection limits, due to the formation of fewer but more intense product ions in the ion trap vs. the triple quadrupole [25,26]. By using one-dimensional chromatographic separations and linear ion trap MS detection, the quantitation of 5 intermediate-abundance serum proteins by MRM, with good precision and accuracy, at ~1-30 μg/mL levels, was reported [27].
In order to perform MRM experiments, the m/z of a specific peptide precursor and its selected product ions must be known. Large-scale proteomic analyses on various mass spectrometry platforms have revealed that proteins are consistently identified by only a handful of possible tryptic peptides, and that frequently observed peptides are not necessarily generated from the most abundant proteins. The peptides that are preferentially observed for a protein are called "proteotypic" [28][29][30]. For example, Mallick et al. have classified a peptide as being proteotypic if it was observed in >50% of all identifications of a corresponding protein (based on data obtained from large yeast proteomic archives), and evaluated 494 numeric physicochemical property scales for amino acids (e.g., charge, secondary structure, hydrophobicity, etc.) to develop a computational tool that can predict the proteotypic propensity of a peptide [28]. In addition, machinelearning algorithms have been developed to generate information related to the peptide fragmentation pattern [31]. Nevertheless, such computational predictions are often mass spectrometry platform dependent. Tandem mass spectra of proteotypic peptides, most commonly generated on quadrupole or 3D ion trap instruments, have been collected so far in databases such as PeptideAtlas [32], GPM [33] and PRIDE [34]. Due to differences in the CID process, as discussed earlier, triple quadrupoles and ion traps often generate different peptide fragmentation patterns (i.e., different product ion species with different intensities), and to date, very few data generated by linear ion trap instruments have been made available through public repositories. In this work, we provide human breast cancer tandem mass spectrometry data generated on a linear ion trap instrument (LTQ/Thermo) that were collected into a library of 1,572 proteins matched by a list of 9,677 peptides. Among many parameters, the spectral count for each ion species, the best p-value, and the top 10 most intense daughter ions are provided to enable the selection of the most frequently identified peptides for MRM proteomic explorations. Validation of protein identifications, and relative/absolute protein quantitation for biomarker discovery or screening, are envisioned to be the most relevant applications that would benefit from the information provided in this table.

Tandem MS data analysis
Data dependent MS analysis was performed by acquiring one MS scan (5 microscans averaged) followed by one zoom scan (5 microscans averaged) and one MS 2 on the top 5 most intense peaks. The zoom scan width was ± 5 m/z, and the dynamic exclusion was enabled at repeat count 1, exclusion list size 200, exclusion duration 60 s, and exclusion mass width ± 1.5 m/z. Collision induced dissociation was performed by setting the ion isolation width at 3 m/z, normalized collision energy at 35%, activation Q at 0.25, and activation time at 30 ms. The combined results of 48 SCX-LC-MS/MS and 10 LC-MS/MS runs were used to perform protein database searching. Protein identification was performed with the Bioworks 3.3 software (Thermo Electron Corp, San Jose, CA, USA) by using a minimally redundant database downloaded from SwissProt (37,678 entries) on January 2008. The database search parameters were chosen as follows: only fully tryptic fragments were considered in the analysis, the number of allowed missed cleavage sites was 2, the peptide tolerance was 2 amu, and the fragment ion tolerance was 1 amu. Chemical and/or posttranslational modifications were not allowed. The capability to match one peptide sequence to multiple protein references within the database was not enabled. MRM data acquisition was performed using the same CID parameter settings as for data dependent analysis, and included the development of LC-MS/MS runs with 1-6 segments (20-240 min long) and 6-9 scan events/segment. Specific conditions for each transition are discussed in the following sections of the manuscript.

Library construction and content
Large scale proteomic studies on MCF-7 and/or other breast cancer cell lines have resulted in the combined identification of ~1,000-4,000 proteins by using 2D-gel electrophoresis or shotgun analysis protocols (false positive rates of <5%) [35][36][37][38]. In this work, a protein/peptide library was generated from 58 LC-MS/MS data dependent analyses (see Additional file 1: Appendix 1). Tandem MS data were filtered at the peptide level with the Xcorr vs. charge state filter set at Xcorr = 1.5 for z = 1, Xcorr = 2.0 for z = 2, and Xcorr = 3.0 for z ≥ 3, respectively, and at the protein level by considering only proteins with ≥ 2 spectral counts. A total of 2,286 proteins (p < 0.001) were identified. The library comprises 1,572 proteins (all with ≥ 2 spectral counts) matched by 9,677 peptides (all with p < 0.001, p being the probability of a random match as calculated by the Bioworks software). By using such MS data filtering parameters and by selecting only proteins and peptides with p < 0.001, the rate of false positive identifications [39] when searching the data against a forward/ reversed human protein database was ~1.5% and ~4.5% at the peptide and protein levels, respectively. At the protein level, the library provides the p-value, the score, the sequence coverage, the molecular weight, and the number of total and unique peptides observed for each protein.
The total number of observed peptides (or the peptide hits) represents the spectral count. In addition, based on the protein sequences provided in the SwissProt database, we calculated the theoretically observable peptides, i.e., the tryptic peptides with maximum 2 missed cleavages (we note that the raw data were searched against the human database by allowing for such peptides in the search). The ratio of the unique observed to observable peptides is an indicative of the protein abundance, and was previously coined as the protein abundance index-PAI [40,41]. At the peptide level, the library provides the amino acid sequence of each peptide, the charge state, the spectral count of each peptide at each identifiable charge state, the protonated mass (MH + ), the parameters that characterize the quality of a tandem mass spectrum [Del-taM, p-value, Xcorr, DeltaCn, Sp, the # of matching ions (b, y and a) in the tandem mass spectrum], the retention time of the peptide, the length of the LC gradient (10 to 100% B), and 10 product ions from each tandem mass spectrum for MRM analysis. As every peptide sequence generated several tandem mass spectra, the data from Appendix 1 (see Additional file 1) correspond to the spectra with the best (i.e., the lowest) p-value. Four in-house developed Perl-scripts were used to generate the library. The first Perl-script was used to calculate the spectral count (from all 58 LC-MS/MS experiments) for each unique amino acid sequence peptide at a given charge state, and to select the best tandem mass spectrum for this peptide (i.e., the mass spectrum with the lowest p-value).
A second Perl-script was used to select representative ions for MRM analysis. The strategy involved the extraction of the top 10 most intense daughter ions from the DTA file associated with the best tandem mass spectrum of a peptide. Ions in the vicinity of the parent (m/z parent ± 60) were excluded to avoid the selection of adducts or neutral loss ions. In addition, ions in the immediate vicinity of a fragment (m/z fragment ± 3) that was already selected for MRM were excluded, as well, to avoid duplication by the selection of isotopic peaks. The third Perl-script was used to calculate the observable peptides for each protein.
The algorithm involved performing in-silico tryptic digestion for each protein in the SwissProt database, and counting the number of peptides with mass ranging from 500 to 4,000 Da and with 0, 1 or 2 missed cleavages. The fourth Perl-script was used to extract the LC retention time of each peptide from the Sequest result files.

Data evaluation
To obtain a qualitative view of how well this protein pool represented the human proteome, a chart reflecting the experimental frequency of the 1,572 identified proteins as a function of MW (that ranged from ~5,000 to ~1,000,000 Da) was constructed, and compared to a similar chart reflecting the theoretical protein distribution downloaded from the SwissProt/Expasy website http://www.expasy.ch (see Figure 1). The MW was expressed in terms of number of amino acids per protein, by assigning to each amino acid the molecular weight of averagine (i.e., MW = 111.12) [42]. The experimental and theoretical distributions were fairly similar, illustrating that our dataset comprised a representative set of proteins, and that our experimental protocol performed well in sampling the human proteome. A small bias towards proteins with a larger number of amino acids, was, however, observed. It was noticed that proteins with a sequence shorter than 200 amino acids (MW~22,200) were less frequently encountered. The theoretical and experimental protein distributions peaked at proteins containing 140-160 and 180-200 amino acids, respectively. Similar results were obtained if all proteins with p < 0.001, not just the ones with two spectral counts, were considered in the comparison. Assuming that there was no bias introduced by losing peptides belonging to small MW proteins during sample processing (e.g., by protein digestion, recovery of peptides from clean-up cartridges, etc.), we attributed this bias to a lower sampling rate during MS data dependent analysis, as a result of a smaller number of matching tryptic peptides that can be generated from low MW proteins. We would expect that large MW proteins will generate a larger number of peptides, increasing, thus, the likelihood of detection during a data dependent analysis process. For this data set (1,572 proteins), the increase in observable (theoretical) tryptic peptides with the protein MW is shown in Figure 2A, and the ratio of experimental percentage of identified proteins to the theoretical percentage (according to the SwissProt chart) vs. the number of amino acids in a protein, is shown in Figure 2B. The range of 20-1,940 amino acids/protein corresponds to a range of MW of 2,222<MW<215,573. Low MW proteins are clearly under-sampled in our extract, and the number of available peptides/protein for MS detection could provide at least a partial explanation for a more successful mapping, in terms of numbers, of high MW proteins. However, the dynamics of protein turnover is an additional factor that may affect the success of MS detection in complex cellular extracts. Effective sampling of a proteome, in a relevant biological context, will have to take into account correlations between protein function, protein half-live (that can vary from minutes to hours or days), and eventually protein MW.
Protein detectability is not only dependent on the number of observable peptides/protein, but also on the protein abundance and the proteotypic propensity of the matching peptides, and can be assessed in terms of sequence coverage. For this data set, we note that while the overall sequence coverage of the identified proteins was fairly broad (i.e., 0.04%-98.3%), the low MW proteins were clearly indentified with a higher sequence coverage despite the smaller number of unique peptides/protein ( Figure 3A). The observed number of unique or total peptide hits (spectral counts), while dependent on the protein MW, is also a strong indicative of the protein abundance and of the peptide propensity for MS identification. This quantitative relationship is represented in Figure 3B for unique peptides, and in Figure 3C for total spectral counts. To eliminate the bias introduced by high MW proteins generating more peptides, Figure 3B displays the  A B ratio of the experimentally observed unique peptides to the theoretically observable peptides as a function of protein MW. Proteins with MW<50,000 were found to be more abundant in the cellular extract, the most abundant proteins peaking out at MW~20,000-30,000. We must note, however, that many experimental factors can affect the interpretation of results. For example, the extraction, denaturation and tryptic digestion of proteins could correlate negatively with the MW of proteins, resulting, thus, in a lower number of observed peptides/protein. We should also note that peptides with propensity for identification will generate progressively increased spectral counts at higher abundance levels, as they elute as broader chromatographic peaks during LC-MS/MS analysis. The chart that is displayed in Figure 3C eliminates the impact of peptide propensity for detection by providing the number of spectral counts/number of experimentally detected unique peptides as a function of protein MW, and strengthens the conclusion that proteins with MW<50,000 were, overall, more abundant (assuming that the MW of the originating proteins introduced no consistent bias in the proteotypic behavior of peptides). Further work will be, however, necessary to evaluate the impact of protein size, hydrophobic properties and packing on the effectiveness with which large MW proteins are processed and detected experimentally, to enable more general conclusions regarding the abundance of proteins in whole cellular extracts.

MRM analysis
The information provided in the protein/peptide library can be effectively used to perform MRM experiments. The spectral count of each peptide, at the detectable charge state, reflects its propensity for identification (we note that not all peptides with high spectral count are necessarily proteotypic according to the definition provided in reference 28, i.e., that are detectable in >50% of the trials that identified the corresponding protein). The p-value and the other SEQUEST scores reflect the quality of the tandem mass spectrum that led to the identification of the peptide. Up to ten MRM transitions can be set up for each parent ion. By displaying only peptides with p < 0.001 [i.e., -10log(p)>30], it was ensured that the ions selected by the Perl script were mostly a, b, y, H 2 O/NH 3 -neutral loss or multiple loss ions, but not noise or other contaminants. We note, however, that the experimental product ions were generated by enabling the database search with a fragment ion tolerance of 1 amu, thus contaminant product ions within this mass window are possible. Quick manual corroboration with software packages such as Protein Prospector http://prospector.ucsf.edu can confirm the validity of the product ions in the library, and help eliminate contaminant ions that do not belong to the considered peptide. Generally, the lower it is the p-value of a peptide [i.e., the higher the -10log(p)], the less likely it is the presence of extraneous fragment ions in the list.
The applicability of this peptide library for the identification of putative biomarkers in proteomic samples is demonstrated with a few examples that involved the analysis of un-fractionated MCF-7 protein extracts. Whole cellular extracts represent a good testing system for demonstrating the effectiveness of MRM analysis, as due to complexity, the extracts do not facilitate the detection of low abundance components. When using a data dependent acquisition process, such extracts typically enabled the Charts that illustrate protein abundance as a function of pro-tein MW, for the set 1,572 proteins   identification of only ~400-600 proteins with p < 0.001 (~200-300 proteins with 2 spectral counts) per LC-MS/ MS run, i.e., ~5 times less than the SCX prefractionated samples that enabled the identification of ~2,000 proteins [35]. The following scenarios were encountered during data dependent analysis of a whole cellular extract: (1) the protein and all matching peptides from the library were identifiable; (2) the protein was identifiable by some, but not all matching peptides from the library; and (3) the protein was not identifiable by any of the peptides listed in the library. The detection of a set of seven putative biomarker proteins, as previously reported in the literature [43][44][45], was facilitated by enabling MRM transitions for the corresponding peptides that are shown in Table 1.
The proteins and the peptides that were not detectable in the whole extract by data dependent analysis are marked with "no ID." Peptides from the library with the largest number of spectral counts and best SEQUEST scores (most importantly with the lowest p-values) were selected for MRM analysis. The product ions that were monitored for these peptides were the first five most intense. Representative results of extracted ion chromatograms (EIC) for these transitions are summarized in Figure 4. As the LTQ is a relatively low mass accuracy/resolution instrument, the mass window that was monitored around a product ion in the EIC was fairly broad (m/z = ± 1.5), enabling, thus, contaminant fragments to interfere with the MRM analysis. However, the ability to detect all transitions at the retention time of the parent peptide can greatly increase the specificity of detection, as contaminants with the same precursor m/z, same fragment(s) m/z, and same retention time, are highly unlikely.
When a protein was detectable in the whole cell extract by data dependent analysis, very strong product ion peaks were observable in the EIC of each transition that was ena-bled for a peptide (see Figure 4A, peptide at 139.58 min).
In Figure 4A, each transition was enabled for 20 min. Nevertheless, such transitions can be enabled for a much shorter time, when the retention time (t r ) of a peptide is well controlled, or for longer times, or even for the entire length of the experiment, when the t r is not known. Given that the LC-MS/MS analyses that contributed with data to this library were conducted for different lengths of time, the peptide t r (s) in Figure 4 do not correspond to the t r (s) from the library. To enable a rough prediction of a peptide t r , the length of each LC gradient (10-100% B) is also provided. We note that (a) both retention time and gradient length include a ~20 min dead-time corresponding to the elution of non-retained components from the LC column, and (b) the gradient was not linear, as 80-90% of the gradient length was dedicated to increase the % B from 10 to 45%. Later experiments in our lab have confirmed that retention time/gradient length estimates could be obtained within ± 10-25% of the values provided in the library.
The presence of contaminant peptide species with close m/z to the peptide of interest, and with several overlapping transitions, was observed in our MRM studies (see Figure 4B, monitored peptide at 72.77 min, contaminant peptide at 41.06 min). Interference from such peptides can be eliminated by narrowing the m/z window that is used for the generation of the EIC, or by narrowing the time window that is used for monitoring the MRM transition (when the elution time of the peptide is known

936.2)
Peptides or proteins marked with "no ID" were not detectable in the whole protein extract by data dependent MS analysis. Transitions that did not result in peptide identification in the EIC are marked in bold. were still detectable by most, if not all, MRM transitions (see Figure 4C, peptide at 157.43 min). Figure 4C Table 1.
Missed transitions were especially observable in the case of library peptides identified by only one spectral count and p-values that were just above the threshold set for elimination from the list. For example, in the case of PCNA, the identification of peptide NLAMGVNLTSMSK was not conclusive based on the transitions that were provided in the library ( Figure 4D), i.e., consistent transitions at the predicted peptide t r were not observable. Cross validation with Protein Prospector revealed that the first three transitions were probably not even correct for this peptide. The protein was, however, identifiable by MRM transitions enabled for other matching peptides.

Conclusion
In summary, through this work, we make available for public use tandem MS information generated for a list of 1,572 proteins from MCF-7 human breast cancer cells. Unlike publically available empirical databases, our library provides a large set of proteins and peptides that can be identified in human cancerous cells under a consistent set of experimental conditions. As the data were generated with a linear ion trap mass spectrometer, the library strategically complements existing information generated with ESI-Extracted ion chromatograms illustrating five MRM transitions/peptide for the identification of putative cancer biomarkers in whole cellular extracts  Table 1.        D quadrupole (Q), ESI-Q-time-of-flight (TOF), ESI-ion cyclotron resonance (ICR) or matrix assisted laser desorption ionization (MALDI)-TOF instruments. Moreover, the availability of spectral count data provides information related to the abundance and proteotypic propensity of peptides, at given charge states, in the context of complex cellular extracts. The library enables the development of MRM-MS protocols for the identification of possibly hundreds of target proteins with particular relevance to biomarker screening and discovery applications. Key for the identification of a set of protein biomarkers in a complex un-fractionated cellular extract will be the development of MRM strategies that involve the selection of several peptides/protein (possibly with the highest spectral count and best SEQUEST scores) and of multiple transitions/peptide. Custom-prepared isotopically labeled versions of selected peptides could be further used for performing quantitation studies.

Abbreviations
CID: collision induced dissociation; CV: coefficient of variation; DeltaCn: degree by which the lower ranked peptide scores differ from the correlation score of the best match; DeltaM: difference between the theoretical and experimental mass of a peptide; EIC: extracted ion chromatogram; ESI: electrospray ionization; ICR: ion cyclotron resonance; LTQ: linear trap quadrupole; MALDI: matrix assisted laser induced dissociation; MRM: multiple reaction monitoring; MS: mass spectrometry; MW: molecular weight; Q: quadrupole; RPLC: reversed phase liquid chromatography; Sp: preliminary score; SCX: strong cation exchange chromatography; TOF: time of flight; Xcorr: cross correlation score between virtual and experimental spectrum.