Evaluation of protein biomarkers of prostate cancer aggressiveness

Background Prognostic multibiomarker signatures in prostate cancer (PCa) may improve patient management and provide a bridge for developing novel therapeutics and imaging methods. Our objective was to evaluate the association between expression of 33 candidate protein biomarkers and time to biochemical failure (BF) after prostatectomy. Methods PCa tissue microarrays were constructed representing 160 patients for whom clinicopathologic features and follow-up data after surgery were available. Immunohistochemistry for each of 33 proteins was quantified using automated digital pathology techniques. Relationships between clinicopathologic features, staining intensity, and time to BF were assessed. Predictive modeling using multiple imputed datasets was performed to identify the top biomarker candidates. Results In univariate analyses, lymph node positivity, surgical margin positivity, non-localized tumor, age at prostatectomy, and biomarkers CCND1, HMMR, IGF1, MKI67, SIAH2, and SMAD4 in malignant epithelium were significantly associated with time to BF. HMMR, IGF1, and SMAD4 remained significantly associated with BF after adjusting for clinicopathologic features while additional associations were observed for HOXC6 and MAP4K4 following adjustment. In multibiomarker predictive models, 3 proteins including HMMR, SIAH2, and SMAD4 were consistently represented among the top 2, 3, 4, and 5 most predictive biomarkers, and a signature comprised of these proteins best predicted BF at 3 and 5 years. Conclusions This study provides rationale for investigation of HMMR, HOXC6, IGF1, MAP4K4, SIAH2, and SMAD4 as biomarkers of PCa aggressiveness in larger cohorts.

Raw RNA expression data were normalized within each dataset by applying the Robust Multichip Average (RMA) algorithm using Affymetrix Expression Console. A maximum threshold was set at the 99th percentile to mitigate the effects of extreme outliers. To correct for differences in differential intensity profiles between arrays within a dataset, arrays were least squares normalized by multiplying each array by 1/slope of an array constructed from the median expression value of each gene across all samples as described [4]. To provide a relative measure of fold-change which is more generalizable for cross-study analysis, arrays were adjusted by dividing each array was by its median expression value.

Raw cDNA expression data from Lapointe et al was not compatible with Affymetrix
Expression Console software so our normalization technique for the Lapointe dataset was meant to approximate RMA for a two channel non-Affymetrix chipset using four steps: background correction, normalization, log correction, and linear modeling as described [5]. A procedure for the quantile normalization step is outlined by Bolstad et al [6]. The linear modeling step was omitted since cDNA expression data does not need probe set summarization normally required for RNA microarrays.

Gene filtering
Datasets from Singh et al [1] and Yu et al [2] were used as training datasets for building a supervised prediction model. To compare expression data from individual genes with tumor aggressiveness or non-aggressiveness we calculated the signal-to-noise ratio (S x ) for each gene: where, μ is the mean expression value and σ is the standard deviation of the expression values for a given gene x across all non-aggressive (NA) or aggressive (A) specimens in a single dataset as described [4,7]. The 500 genes with the most positive S x ("non-aggressiveness genes") and 500 genes with the most negative S x ("aggressiveness genes") were selected from each dataset to create a group of the 1000 most informative genes in each dataset.
The top 1000 genes ranked by informational content (decreasing absolute value of S x ) in each training dataset are listed in Additional file 2: Table S1 (Singh et al) and Additional file 3: Table   S2 (Yu et al). UniGene names were then mapped from Affymetrix probe set identifiers using GeneSifter (Geospiza, Seattle, WA). A total of 110 genes were shared between both top 1000 lists although complement component genes C2 and C7 were excluded (due to their ubiquitous presence in blood/tissues and unlikely specificity for PCa) resulting in 108 shared genes retained within each dataset for further analysis. Expression values for genes with multiple probe set identifiers were averaged. Signal-to-noise ratios for this selected gene set were averaged and weighted to account for difference in sample size between datasets. The list of genes was truncated at 50 as we aimed to validate these markers by immunohistochemistry; an additional 4 genes (UGT2B11, ITPR1, DNM2, and RFPL3) were subsequently excluded because they were not represented by cDNA probes in the Lapointe et al validation dataset. This final ordered list of 46 genes was ranked by decreasing informational content (absolute value of weighted average S x ) (Additional file 4: Table S3) and used in weighted voting analyses.

Supervised prediction using a weighted voting process
To examine the prognostic value of n-gene signatures derived from our ranked list of 46 genes, we utilized a weighted voting and class prediction process described by Ramaswamy et al [4] and originally developed by Golub et al [7]. Gene expression data and signal-to-noise ratios described above from Singh et al [1] and Yu et al [2] were used as inputs for training the weighted voting algorithm to recognize aggressive from non-aggressive PCa. The independent cDNA expression dataset from Lapointe et al [3] was used for validation testing.
Aggressiveness or non-aggressiveness of a specimen y was predicted by the summation of weighted votes (V) from the n highest quality genes included in the model (consecutively added from the ranked set of top 46 genes in Additional file 4: Table S3). First, a vote (v x ) towards aggressive or non-aggressive was defined as v x = S x (g x y -b x ) and assigned to each ranked gene based on the signal-to-noise quality factor (S x ) and proximity of a given specimen's gene expression level (g x y ) to the gene's average expression level among non-aggressive (μ NA ) or aggressive specimens (μ A ) using its midpoint boundary (b x = (μ NA + μ A ) / 2). Weighted voting calculations are shown in Additional file 5: