Inter-observer reproducibility of HER2 immunohistochemical assessment and concordance with fluorescent in situ hybridization (FISH): pathologist assessment compared to quantitative image analysis

Background In breast cancer patients, HER2 overexpression is routinely assessed by immunohistochemistry (IHC) and equivocal cases are subject to fluorescent in situ hybridization (FISH). Our study compares HER2 scoring by histopathologists with automated quantitation of staining, and determines the concordance of IHC scores with FISH results. Methods A tissue microarray was constructed from 1,212 invasive breast carcinoma cases with linked treatment and outcome information. IHC slides were semi-quantitatively scored by two independent pathologists on a range of 0 to 3+, and also analyzed with an Ariol automated system by two operators. 616 cases were scorable by both IHC and FISH. Results Using data from unequivocal positive (3+) or negative (0, 1+) results, both visual and automated scores were highly consistent: there was excellent concordance between two pathologists (kappa = 1.000, 95% CI: 1-1), between two machines (kappa = 1.000, 95% CI: 1-1), and between both visual and both machine scores (kappa = 0.898, 95% CI: 0.775–0.979). Two pathologists successfully distinguished negative, positive and equivocal cases (kappa = 0.929, 95% CI: 0.909–0.946), with excellent agreement with machine 1 scores (kappa = 0.835, 95% CI: 0.806–0.862; kappa = 0.837, 95% CI: 0.81–0.862), and good agreement with machine 2 scores (kappa = 0.698, 95% CI: 0.6723–0.723; kappa = 0.709, 95% CI: 0.684–0.732), whereas the two machines showed good agreement (kappa = 0.806, 95% CI: 0.785–0.826). When comparing categorized IHC scores and FISH results, the agreement was excellent for visual 1 (kappa = 0.814, 95% CI: 0.768–0.856), good for visual 2 (kappa = 0.763, 95% CI: 0.712–0.81) and machine 1 (kappa = 0.665, 95% CI: 0.609–0.718), and moderate for machine 2 (kappa = 0.535, 95% CI: 0.485–0.584). Conclusion A fully automated image analysis system run by an experienced operator can provide results consistent with visual HER2 scoring. Further development of such systems will likely improve the accuracy of detection and categorization of membranous staining, making this technique suitable for use in quality assurance programs and eventually in clinical practice.


Background
HER2/neu (also known as c-erbB-2) is a member of the ErbB protein family, more commonly known as the epidermal growth factor receptor (EGFR) family. The HER2 protein is a cell membrane surface-bound receptor tyrosine kinase that is involved in signal transduction pathways leading to cell growth and differentiation [1]. HER2 is a proto-oncogene located on the long arm of human chromosome 17 (17q11. 2-q12). Overexpression of the protein, typically caused by amplification of the HER2 gene, leads to constitutive activity of the HER2 receptor and breast tumor development through enhanced cell proliferation, survival, motility and adhesion [2]. HER2 gene amplification has been reported in 10-35% of invasive breast carcinomas, and it is associated with an aggressive disease course, increased disease recurrence, and decreased disease-free and overall survival in lymph nodepositive patients [2][3][4][5]. In addition to its prognostic role, HER2 has now become more important as a predictive marker of treatment response to Trastuzumab, a humanized murine monoclonal antibody to the HER2 protein.
In 1998, Trastuzumab (marketed as Herceptin, Genentech, San Francisco, California, USA) was approved for the targeted therapy of HER2-overexpressing metastatic breast cancer patients by the Food and Drug Administration (FDA) of the USA, and it has also recently been shown to be very effective in the adjuvant setting [2].
The effectiveness of Herceptin therapy depends on accurately evaluating HER2 status, which can be done either by immunohistochemical (IHC) assessment of HER2 protein expression or by evaluating HER2 gene amplification using in situ hybridization (ISH), most commonly, fluorescent ISH (FISH). FISH shows excellent sensitivity and specificity in detecting the HER2 gene amplification [6]. IHC assessment of HER2 status is an inexpensive and relatively standardized method that can be performed in all pathology laboratories. Of the various HER2 antibodies available, the FDA-approved Dako Herceptest (Dako, Glostrup, Denmark) has been considered the most reliable [7]. However, new antibodies such as Ventana PATH-WAY anti-HER2/neu (4B5) rabbit monoclonal antibody also provide excellent sensitivity, specificity, and interlaboratory reproducibility [8]. Based on the determination of staining intensity and percentage of cells with complete membrane staining, the results are scored semiquantitatively on a range of 0 to 3+. According to these four-tier criteria, 0 and 1+ scores are considered negative, 3+ score is positive, while 2+ is equivocal (weakly positive) and requires confirmation by FISH [9][10][11]. The intraobserver reproducibility is generally satisfactory for both the percentage of positive cells and membrane staining [12][13][14][15]. The inter-observer agreement is excellent for scoring classes 0, 1+ and 3+ [11,[16][17][18][19]. However, the determination of staining intensity and percentage of cells with complete membrane staining is subjective. This results in high inter-observer variability in assigning a 2+ score [11,17,20,21] and in discriminating between 2+ and 3+ classes [12]. Consequently this leads to a high rate of false-positives for intermediate IHC scores [22][23][24]. According to the HercepTest guidelines, cases with more than 10% of tumor cells showing strong circumferential membrane staining are classified as 3+. The American Society of Clinical Oncology/College of American Pathologists (ASCO/CAP) guidelines recommend using a 30% cut-off, in order to decrease the incidence of false positive cases [25].
It has been suggested that the use of digital microscopy improves the accuracy and inter-observer reproducibility of HER2 IHC analysis. Digital measurement of staining intensity is more accurate than assessment with a human eye because it is not influenced by factors such as the ambient light or pathologist fatigue [26,27]. We have recently shown that automated quantitation of estrogen receptor (ER) immunostaining yields results that do not differ from human scoring against dextran-coated charcoal biochemical assay and the most important clinicopathologic correlate, patient outcome [28]. Consistent, objective and reproducible results for HER2 assessment can be generated by a number of available automated scoring systems such as the automated cellular imaging system (ACIS) (ChromaVision, Inc, San Juan Capistrano, California, USA) [29,30] optimized for use with Dako HercepTest, Micrometastasis Detection System (MDS, Applied Imaging, San Jose, California, USA) [31], Extended Slide Wizard (Tripath Imaging, Inc. Burlington, North Carolina, USA) and others [32][33][34].
To determine the inter-observer variability, we have compared results of visual and automated scoring of HER2 immunostaining on TMAs constructed from invasive breast carcinomas, with data from 1,413 cases used for FISH analysis. 616 cases were scorable by both methods. We then evaluated the concordance of IHC and FISH results and performed Kaplan-Meier survival analysis to determine the prognostic significance of different analyses of HER2 status.

Methods
In this study, we used IHC data from 1,212 patients and FISH data from 616 patients. The data were derived from a series of 4,046 cases of invasive breast carcinoma diagnosed in 1986-1992, referred to the British Columbia Cancer Agency (BCCA) for treatment, and assembled into 17 tissue microarray (TMA) blocks. Ethical approval for the study was obtained from the Clinical Research Ethics Board of the BCCA [28]. Previously frozen breast cancer tissue samples were fixed in 10% neutral buffered formalin, embedded in paraffin and used to construct TMAs consisting of 0.6 mm tissue cores using a manual arrayer (Beecher Instruments, Inc., Silver Springs, Maryland, USA) as previously described [35,36].
From each TMA block, 4 μm thick sections were cut and immunostained on Ventana Benchmark XT staining system (Ventana Medical Systems, Tucson, Arizona, USA). Sections were deparaffinized in xylene, dehydrated through three alcohol changes and transferred to Ventana Wash solution. Endogenous peroxidase activity was blocked in 3% hydrogen peroxide. Slides were then incubated with Ventana PATHWAY anti-HER2/neu (4B5) rabbit monoclonal antibody at 37°C for 32 min and developed in DAB for 10 min. Finally, sections were counterstained with hematoxylin and mounted. HER2 was scored visually by two independent pathologists (BG, GT) according to the HercepTest guidelines: 0 (negative): no staining is observed, or membrane staining is observed in <10% of the tumor cells; 1+ (negative): a faint/barely perceptible membrane staining is detected in >10% of tumor cells; the cells exhibit incomplete membrane staining; 2+ (weakly positive, equivocal): a weak to moderate complete membrane staining is observed in >10% of tumor cells; and 3+ (strongly positive): a strong complete membrane staining is observed in >10% of tumor cells. Only six 3+ cases (0.5%) showed heterogeneous staining, i.e. would have been interpreted as 2+ by ASCO/CAP guidelines. Therefore, the scoring system used in this study would not impact the results and conclusions. Scores were entered into a standardized Excel worksheet with a sector map matching each TMA section. Cases were not included in the statistical analysis if there was no tumor tissue in the cores or the cores were cut through. Original scoring grids were converted to tables using Deconvoluter 1.10 [37] and combined in a single text file with TMA-Combiner 1.00 [38]. The resulting text files were imported into SPSS 15.0 and R2.4.0 for Windows [39].
The same slides were digitized with a commercial image analysis system Ariol (Applied Imaging Inc., San-Jose, California, USA). For clinical lab applications, Ariol has received FDA clearance as an aid to pathologists in the detection, classification, and counting of cells of a particular color, intensity, size, pattern, and shape. Applied Imaging has received additional FDA 510(k) clearances for specific applications, including immunohistochemical assessment of HER2 in breast cancer. The Ariol system is based on an Olympus microscope with motorized stage and autofocus capabilities, and equipped with a black and white video camera. We regularly performed bright-field calibration using the Calibration slide to ensure accurate scanning and analysis. The system was set to Kohler illumination to capture high quality images. Slides were scanned at 20× objective magnification with three filters: red, green and blue. Ariol software, which converts these three-channel images into color reconstructions, was used for image analysis. The program was trained by a pathologist (DT) using representative cores containing areas that would be scored as 1+ and 3+ visually. Using the color pickup tool within the Ariol image analyzer, we selected membranes with weak positive staining and assigned "1+ intensity"; we then selected the membranes with strong positivity and assigned "3+ intensity". Similarly, we selected counterstained nuclei with the color pickup tool, and adjusted the desired size, roundness and other shape parameters under visual control. Numeric values for colors of the positive objects, i.e. membranes, and negative objects, i.e. nuclei, were stored on the hard drive in a color classifier file. Numeric values for the shape of the nuclei were stored in a separate shape classifier file. The program used these two files for segmentation of the nuclei and the membranes in all other cores, and these two files were sent out to be used in the machine 2. Scores from a "0" to a "3+" were automatically generated by the Ariol image analysis software for each core, based on the intensity and completeness of the positively stained membranes, and the percent of positive cells. The Ariol algorithm applies HercepTest criteria for the score calculations. Visual examples and a graphical explanation are given in Figure 1. The training step increases the specificity of the analysis as it ensures that extracellular matrix and most stromal cells are excluded from image analysis, and it allows the program to calculate percent of positive tumor cells more precisely. After the program training on one of the representative TMA cores, the rest of the analysis was performed without human supervision. All tissue cores were analyzed in toto; no specific pathologist selection of tumor tissue within the cores was made following the training step. For statistical analysis, we selected only cores with at least 50 tumor cells detected, i.e. all cores with less than 50 cells were considered unscorable. To get an estimate of the demands posed on the operator of the Ariol system, the same slides were scanned and processed on an identical Ariol system by an operator with less than one week experience working with this particular Ariol script (KM). The descriptors of the color and shape of the positive and negative tumor cells were transferred from one system to another, therefore variations in the image analysis results depended only on the scanner settings, i.e. brightfield calibration, positioning and white balance, but not on the image analysis settings.
The hematoxylin and eosin and IHC images of all cores used in this study are publicly available at the companion site [40]. The site was constructed using Genetic Pathology Evaluation Centre (GPEC) database and a Java applet provided by Bacus Laboratories, Inc. All slides were scanned with a BLISS scanner (Bacus Laboratories, Inc., Schematic illustration of automated HER2 scoring Figure 1 Schematic illustration of automated HER2 scoring. a) Image analysis system Ariol (Applied Imaging Inc., San-Jose, CA). b) Training window displaying the 3+ membrane and nuclear colors with fill mask. c) Outline of membrane as detected by the color classifier for the 3+ membrane color class. d) The border mask of nuclei as detected by the color classifier for the 3+ nuclei color class.
Lombard, Illinois, USA), and posted on the site. WebSlide Browser for Windows (Bacus Laboratories, Inc., Lombard, Illinois, USA) can be used for viewing preview images of the arrays and images of individual cores.
Six-micron sections of the TMA slides were hybridized with probes to LSI HER2 and CEP17 with the PathVys-ion™ HER2 DNA Probe Kit using a modified protocol, as previously described [41]. Analysis of FISH signals was performed using Metasystems™ automated image acquisition and analysis system, Metafer (Metasystems, Altlussheim, Germany). This automated system scores FISH signals by employing specific measurement algorithms to detect and quantify clustered signals. Average copy number for each probe was calculated and the amplification ratio (ratio between the average copy per cell for Her2 and the average copy for centromere 17) determined (MC). HER2 amplification was defined as a HER2/CEP17 ratio of 2.2 or more. A HER2/CEP17 ratio <1.8 was considered negative for HER2 amplification, and a ratio at or near the cut-off (1.8-2.2) was interpreted as equivocal.
Tumors that failed to hybridize were not included in the analysis. We only accepted scores if >40 tiles were counted. With Metafer system, one tile is considered one cell as the size of a tile is approximately the average size of a nucleus. Normal cells were excluded wherever possible, and the corresponding H&E slides were reviewed when needed.
For statistical analysis, we used data from 1212 patients for the IHC and 616 patients for the IHC/FISH comparisons. Exclusion criteria included core drop-off during processing, insufficient or absent tumor tissue within the cores, and artifactual distortion of the tissue making discrimination of cellular structure impossible. Statistical analysis was performed in SPSS 15.0 for Windows (SPSS Inc., Chicago, Illinois) and R 2.4.0 [39]. All tests were twosided and used a 5% alpha level to determine significance. 95% bootstrapped confidence intervals were calculated using the adjusted bootstrap percentile (bias-corrected and accelerated) method [42]. Breast cancer specific survival was estimated using Kaplan-Meier curves and survival differences were determined by log-rank tests. We used the open-source R 2.4.0 package to calculate differences between kappa statistics from visual to automated scoring comparisons; a permutation test with 10,000 permutations was implemented.

IHC and FISH results
The number of cases scorable by all four observers (visual or machine) on IHC slides, regardless of FISH status was 1,212 (30%). Of 4,046 cases analyzed, FISH was successfully performed in 1413 cases (34.9%). Of 1,413 FISH scorable cases, HER2 was amplified (HER2/CEP17 ratio of 2.2 or more) in 252 cases (17.8%). Borderline HER2 amplification (HER2/CEP17 ratio 1.8-2.2) was seen in 77 cases (5.4%), and 1084 cases (76.7%) were found to be non-amplified (HER2/CEP17 ratio <1.8). The number of cases scorable by both IHC and FISH, including FISH equivocal cases, was 616. Table 1 shows the full breakdown of data by FISH and IHC scored by the four observers.

Analysis of HER2 IHC inter-observer variability by Kappa statistics
Inter-observer variability was estimated by comparing the visual scores of two pathologists, and the automated scores generated by two operators on two different Ariol hardware systems. Comparison of categorized variables ({0, 1+} versus {2+} versus {3+}) from 1,212 patients using weighted kappa statistics (R function wkappa(ψ)  (Table 3).
We also performed Kappa permutation test to assess whether the HER2 IHC scores differed in their ability to match the gold standard. This test included categorized variables (n = 352) to assess the ability of the HER2 score to indicate negative (0, 1+) versus equivocal (2+) versus positive (3+) cases where visual 1 IHC score is the gold standard ( Table 4). The permutation test could not be done for binarized IHC scores because there were only 229 cases available for analysis when visual 1 IHC was used as the gold standard, and 382 cases were available when FISH was used as the gold standard. There were no discrepant cases between visual 1 and visual 2, with only one discrepant case between both visual scores and both machines.  Table 5). This was likely caused by the large number of 2+ scores excluded (n = 234) and low number of 3+ scores (n = 6) available for this analysis. Therefore, the proportion of HER2-positive and HER2negative cases was not fairly represented for the concordance analysis of the binarized data.

Concordance of IHC and FISH results by Kappa statistics
The clinical consequences of using a machine for HER2 scoring are summarized in Table 6. Automated scoring on the Ariol machine would result in more 2+ scores (2-3 times as many as visual scoring) with a consequent increase of FISH assessments in clinical practice.

Kaplan-Meier survival analysis
For 1,212 patients whose tissue cores were scorable by all four observers on IHC slides, median age at diagnosis was 59 years, and median follow-up time was 12.24 years. Clinical-pathological characteristics of these patients are summarized in Table 7.
Kaplan-Meier survival analysis of cases stratified based on the HER2 status, as determined by visual or machine scoring of the immunostained slides, is shown in Figure 2.

Discussion
In breast cancer patients, determination of prognosis and treatment strategies based on HER2 status greatly depends on the accurate evaluation of HER2 overexpression by IHC and/or FISH. HER2 immunohistochemistry is an inexpensive method that can be performed readily in all pathology laboratories on either standard paraffin sections or TMA sections [43]. However, consensus regarding the best methods, reagents, or cut-off points to determine HER2 status is still debated [25,28,[44][45][46]. TMAs are useful for the assessment of automated unsupervised image analysis systems because of the careful selection of the areas of interest, the identical staining conditions for all cores on a single slide, and the small size of the tissue cores representable by a single image [37,38,47]. Problems inherent in TMA studies include taking cores from the non-cancerous areas, and a loss of cores during the staining procedure. We analyzed the results of visual (two pathologists) and automated (two operators on the Ariol image analysis system) scoring of HER2 immunostaining. Since only cores with more than 50 tumor cells detected were considered scorable on the Ariol system, the number of cases scorable by all four observers was 1,212. FISH was successfully performed in 1,413 cases (34.9%) with an amplification rate of 17.8%, which is within the reported range of 10-35% [2][3][4][5].
Since the evaluation of staining intensity and percentage of cells with complete membrane positivity is subjective, the inter-observer variability tends to be higher for scoring 2+ cases [11,17,20,21] and discriminating 1+ and 2+ [48] or 2+ and 3+ cases [12]. The percentage of disagreement in intraobserver reproducibility ranges from 0.9% to 3.7%. It is recommended that two expert pathologists evaluate all slides with a double-blind method and discuss discordant cases [49]. In our study, the inter-observer agreement was excellent for categorized variables (0, 1+ versus 2+ versus 3+) between the two pathologists (kappa = 0.929, 95% CI: 0.909-0.946). The first machine scores also showed excellent agreement with both pathologists (kappa = 0.835, 95% CI: 0.806-0.862; kappa = 0.837, 95% CI: 0.81-0.862). The worst concordance for categorized variables was observed between the second machine        also leads to more 1+ cases in comparison to visual scoring, which may give rise to more FISH-amplified cases to be scored as 1+ (negative). However, this would not change patient management for 0 and 1+ cases as these are both interpreted as negative.

Conclusion
The present study shows that fully automated image analysis with a system operated by an experienced operator, but without continuous human supervision, can provide results consistent with the scoring of HER2 immunostaining by pathologists. The inter-observer agreement was excellent between the two pathologists and between the experienced operator and the pathologists for both binarized and categorized HER2 scores, as well as between the two machines for binarized scores. There was a good agreement between the two machines, and between the less experienced operator and the pathologists for categorized HER2 scores. We have previously reported that automated quantitation of ER immunostaining on the same TMA series can produce results that do not differ from pathologist scoring and dextran-coated charcoal biochemical assay [28]. Unlike ER quantitation, automated scoring of HER2 staining on the Ariol system did not provide excellent agreement between machine scores or the gold standard FISH. Although Kaplan-Meier analysis showed similar accuracy of visual and machine scores in assessment of prognostic significance of HER2 status for categorized IHC variables, the automated quantitation could not distinguish 2+ scores better than the pathologists. It resulted in more 2+ cases which would lead to more FISH assessments in clinical practice. Further development of image analysis systems will likely improve the accuracy of detection and categorization of membranous staining in histological sections, making this technique more sensitive, specific and thus suitable for use in quality assurance programs.