Immunohistochemical staining of the breast TMA was performed using a standard two-step indirect avidin-biotin complex method (Vector Laboratories, Burlingame, CA) or a two-step polymer detection method (DakoCytomation, Inc., Carpinteria, CA) as previously described [[19–22]]. The following primary antibodies were used: BS106, BU101, Mammaglobin (Abbot Laboratories, Abbott Park, IL), prolactin-inducible protein (Signet Laboratories, Inc., Dedham, MA), S100A7 (Imgenex Corp., San Diego, CA), 14-3-3 σ (Research Diagnostics, Inc., Concord, MA), Her-2/neu (Zymed Laboratories, Inc., South San Francisco, CA), progesterone receptor, estrogen receptor alpha, p53 (DakoCytomation, Inc., Carpinteria, CA), RIN1, annexin A1, beta-catenin (BD Biosciences Transduction Laboratories, Lexington, KY), Na-KATPase-β1, Na-KATPase-α, GATA3, Smad2 (Santa Cruz Biotechnology, Inc., Santa Cruz, CA), Smad4 (Millipore, Billerica, MA), YY1 (Geneka Biotechnology, Inc., Montreal, Quebec, Canada), TGF β receptor II (Abcam, Inc., Cambridge, MA), H3K4 and H3K18 (Upstate, Lake Placid, NY), and MED28 (gift from Dr. Mai Brooks). Briefly, 4 μm sections were deparaffinized, treated with 0.3% hydrogen peroxide in methanol, blocked with 5% serum, and incubated with primary and secondary antibodies. Diaminobenzidine was used for color detection. A concentration-matched isotype control IgG was used for negative controls. Note that the 26 markers analyzed here were originally chosen for other oncogenic studies conducted in our laboratory.
The level of protein expression in glandular epithelial cells was quantitatively assessed by a pathologist blinded to all clinico-pathological variables. We used the percentage of cells staining, referred to as "pos", as the quantitative measure of protein expression. To arrive at a single staining measure per patient (referred to as "pos.mean"), we averaged the pos measures of multiple cancer spots per patient as described in .
Validation data analysis
To validate our WGCNA* and COX mortality group definitions, we selected all Affymetrix HG-U133A gene expression data sets from the Gene Expression Omnibus (GEO) that were published in 2005 or later. This resulted in three independent data sets published in 2005-2006, that had the following GEO identifiers: Miller 2005 - GSE3494 (251 arrays), Pawitan 2005 -GSE1456 (159 arrays), Sotiriou 2006 - GSE2990 (189 arrays) [[24–26]]. Data sets were pre-processed as described in . Briefly, within each data set we evaluated array quality by comparing inter-array correlations. Arrays with low inter-array correlation were removed according to default recommendations . When expression analysis was distributed across multiple centers, we checked for center-related batch effects. If batch effects were present we removed them using the combat function . The pre-processing steps removed 3-12% of arrays in an unbiased fashion resulting in 222, 146 and 183 arrays for the Miller 2005, Pawitan 2005 and Sotiriou 2006 data sets, respectively. Finally, we removed all samples with missing survival data, resulting in a total of 207, 146 and 173 patients for the Miller, Pawitan and Sotiriou data sets, respectively.
Univariate Cox proportional-hazards models were constructed for the WGCNA* patient groups and COX rule patient groups for each of the three data sets. We used a moderate significance level of 0.1 to allow for expected expression differences between genes and proteins.