Inference of core needle biopsy whole slide images requiring definitive therapy for prostate cancer
BMC Cancer volume 23, Article number: 11 (2023)
Prostate cancer is often a slowly progressive indolent disease. Unnecessary treatments from overdiagnosis are a significant concern, particularly low-grade disease. Active surveillance has being considered as a risk management strategy to avoid potential side effects by unnecessary radical treatment. In 2016, American Society of Clinical Oncology (ASCO) endorsed the Cancer Care Ontario (CCO) Clinical Practice Guideline on active surveillance for the management of localized prostate cancer.
Based on this guideline, we developed a deep learning model to classify prostate adenocarcinoma into indolent (applicable for active surveillance) and aggressive (necessary for definitive therapy) on core needle biopsy whole slide images (WSIs). In this study, we trained deep learning models using a combination of transfer, weakly supervised, and fully supervised learning approaches using a dataset of core needle biopsy WSIs (n=1300). In addition, we performed an inter-rater reliability evaluation on the WSI classification.
We evaluated the models on a test set (n=645), achieving ROC-AUCs of 0.846 for indolent and 0.980 for aggressive. The inter-rater reliability evaluation showed s-scores in the range of 0.10 to 0.95, with the lowest being on the WSIs with both indolent and aggressive classification by the model, and the highest on benign WSIs.
The results demonstrate the promising potential of deployment in a practical prostate adenocarcinoma histopathological diagnostic workflow system.
According to the Global Cancer Statistics 2020, prostate cancer is the second most frequent cancer and the fifth leading cause of cancer death among men in 2020 . Prostate cancer is the most frequently diagnosed cancer in men in over one half (112 of 185) of the countries of the world . Therefore, it is necessary to define optimum therapeutic strategies for detection, treatment, and follow-up for prostate cancer patients . In recent year, pathologists perform the histopathological diagnosis of prostate cancer based on Gleason pattern quantities, tumor growth patterns, and clinical practice advancements (e.g., multiparametric magnetic resonance imaging (mpMRI) targeted biopsy and fusion ultrasound/magnetic resonance imaging biopsy) . Standard active treatments for prostate cancer include hormone therapy, radiotherapy, and radical prostatectomy. However, to avoid the unnecessary side effects associated with overdiagnosis and over treatment, active surveillance is an important option for low-grade prostate cancer patients with reduced mortality risk [2, 4]. As for the active surveillance, it consists in performing regular follow-ups of patients so as to be able to provide appropriate radical treatment for high-risk groups if necessary . The criteria for active surveillance are highly controversial [2,3,4,5,6]. According to the Cancer Care Ontario (CCO) Guideline and American Society of Clinical Oncology (ASCO) Clinical Practice Guideline, it is generally accepted that active surveillance is applied when a prostate cancer is determined by biopsy and Gleason pattern 4 components account for less than 10% of the total cancer volume . However, unfortunately, the inter-observer agreement for the Gleason score is not always high, and the inter-observer reproducibility (variability) of Gleason grading by general pathologists is often a problem [7,8,9,10]. Although International Society of Urological Pathology (ISUP) is making efforts to improve inter-observer agreement and equalize diagnostic quality for general pathologists by publishing consensus reviewing cases (https://isupweb.org/pib/), there are still cases that are not in agreement among pathologists in routine clinical practice.
In computational pathology, deep learning models have been widely applied in histopathological cancer classification on WSIs, cancer cell detection and segmentation, and the stratification of patient outcomes [11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]. Recently, it has been reported that an artificial intelligence (AI)-powered platform used as a clinical decision support tool was able to detect, grade, and quantify prostate cancer with high accuracy and efficiency and was associated with significant reductions in inter-observer variability [26, 27]. As for the global AI competition, the Prostate cANcer graDe Assessment (PANDA) challenge, a group of AI Gleason grading algorithms developed during a global competition generalized well to intercontinental and multinational cohorts with pathologist-level performance . Other works [23, 28,29,30,31,32,33,34] have also looked into developing deep learning algorithms to classify prostate cancer Gleason scores based on histopathological images.
In this study, we investigated deep learning models to classify prostate adenocarcinoma in two classes based on the clinical responses: indolent (applicable for active surveillance) and aggressive (necessary for definitive therapy). To define the criteria of indolent and aggressive, we refered to CCO and ASCO guidelines  and set the cut-off value of 20% identified Gleason score 4 & 5 components in total prostate adenocarcinoma (Fig. 1) to reduce the possibility of inter-observer variability  as compared to the 10% cut-off value proposed by CCO and ASCO . To the best of our knowledge, this is the first study to establish a deep learning model to make an inference of the necessity for active surveillance on prostate core needle biopsy histopathology whole slide images (WSIs). We trained deep learning models using a combination of transfer learning, weakly, and fully supervised learning approaches and evaluated the trained models on core needle biopsy test set, achieving ROC-AUCs 0.846 (indolent) and 0.980 (aggressive). These findings suggest that it would be possible to not only detect adenocarcinoma on biopsy WSIs, but also to predict patients’ optimum clinical interventions (active surveillance or definitive therapy).
Materials and methods
Clinical cases and pathological records
This is the retrospective study. A total of 2,285 H &E (hematoxylin & eosin) stained histopathological core needle biopsy specimen slides of human prostate adenocarcinoma and benign (non-neoplastic) lesions – 1,321 of adenocarcinoma and 964 of benign – were collected from the surgical pathology files of Kamachi Group Hospitals (Shinyukuhashi, Wajiro, and Shinkuki Hospitals) (Fukuoka, Japan) and Sapporo-Kosei General Hospital (Sapporo, Japan), after histopathological review of all specimens by surgical pathologists in each hospital. In Kamachi Group Hospitals, the histopathological specimens were selected randomly to reflect a real clinical settings as much as possible. In Sapporo-Kosei General Hospital, only adenocarcinoma specimens were provided. Prior to the experimental procedures, each WSI diagnosis was observed and verified by at least two senior pathologists. All WSIs were scanned at a magnification of x20 using the same Leica Aperio AT2 Digital Whole Slide Scanner (Leica Biosystems, Tokyo, Japan) and were saved as SVS file format with JPEG2000 compression.
Table 1 shows breakdowns of the distribution of the specimens based on the following: all specimens, consensus specimens by two senior pathologists, training set, validation set, and test set of prostate core needle biopsy WSIs from Kamachi Group Hospitals and Sapporo-Kosei General Hospital. According to the Cancer Care Ontario Guideline  and American Society of Clinical Oncology (ASCO), patients with both low-volume (accounting for 10% total tumor) and intermediate-risk (Gleason score 3 + 4 = 7) prostate cancer may be offered active surveillance. At the same time, because of known interobserver variability associated with the identification of minor Gleason pattern 4 components, prospective intradepartmental consultation with other pathologists should be considered for quality assurance . Therefore, in this study, considering clinical responses, we have set two classes for prostate adenocarcinoma: indolent and aggressive. Indolent suggests observation (active surveillance) and aggressive suggests definitive therapy.
In this study, we labelled (classified) prostate adenocarcinoma WSIs as follows. If the WSI has less than 20% of Gleason pattern 4 and Gleason pattern 5 components in total adenocarcinoma, it should be classified as indolent (Fig. 1). If the WSI has more than 20% of Gleason pattern 4 and Gleason pattern 5 components in total adenocarcinoma, it should be classified as aggressive (Fig. 1). We did not use a global Gleason score . We set the cut-off at 20% of total prostate adenocarcinoma on a WSI (Fig. 1) to reduce the possibility of interobserver variability as compared to 10% , because it has been widely reported that assessment of percentage Gleason pattern 4 in minute cancer foci has poor reproducibility among pathologists, especially for poorly formed glands [3, 35, 37,38,39,40]. The reason we do this is because we wanted to exclude cases in the test set that had interobserver variability.
In total we use indolent, aggressive, and benign as WSI labels for training the deep learning models at the WSI level. During the consensus review by two senior pathologists, 310 adenocarcinoma WSIs were excluded because of low concordance when classified into indolent or aggressive (Table 1). WSIs that had low concordance generally involved borderline Gleason scores (predominately between 10% to 20%). Training, validation, and test set were selected randomly from the consensus WSIs (Table 1).
A senior pathologist, who performs routine histopathological diagnoses in general hospital, manually annotated 100 adenocarcinoma WSIs from the training set. The pathologist carried out annotations by free-hand drawing using an in-house online tool developed by customizing the open-source (OpenSeadragon) tool, which is a web-based viewer for zoomable images. On average, 10-15 lesions were annotated per WSI. The pathologists performed annotations based on the histopathological characteristics of Gleason pattern 3, 4, and 5. For example, well-formed glands with intraluminal crystalloids (Gleason pattern 3) (Fig. 2A), large irregular cribriform glands (Gleason pattern 4) (Fig. 2B), crowded fused glands (Gleason pattern 4) (Fig. 2C), poorly formed small-sized glands with some lumen-formation (Gleason pattern over 4) (Fig. 2D), ductal adenocarcinoma lined by columnar cells with elongated nuclei (Gleason pattern 4) (Fig. 2E), and infiltrating cords and single tumor cells without lumen formation (Gleason pattern 5) (Fig. 2F) were manually annotated. For training step, Gleason pattern 3 annotations were grouped as indolent and Gleason pattern 4 and 5 annotations as aggressive. The pathologist included cancer stroma which surrounds cancer cells in the annotation area. The average annotation time per WSI was about five minutes. All annotations performed by the pathologist were modified (if necessary), confirmed, and verified by a senior pathologist who performs routine histopathological diagnoses in general hospital.
Deep learning models
We trained the models via transfer learning using the partial fine-tuning approach. This is an efficient fine-tuning approach that consists of using the weights of an existing pre-trained model and only fine-tuning the affine parameters of the batch normalization layers and the final classification layer. For the model architecture, we used EfficientNetB1 starting with pre-trained weights on ImageNet. We used similar training methodology as [25, 43]. For clarity, we highlight the main parts below.
We performed tissue detection using Otsu’s thresholding method  by excluding the white background. We then extracted tiles only from the tissue regions. During prediction, we extracted tiles from the entire tissue regions using a sliding window with a fixed-size stride. During training, we performed random balanced sampling of tiles, whereby we first randomly sampled three WSIs, one for each label. Then from each corresponding WSI, we randomly sampled an equal amount of tiles. For aggressive or indolent WSIs, we randomly sampled from the annotated tissue regions; for Benign, we randomly sampled from all the tissue regions.
After a few epochs, we switched to hard mining of tiles where we alternated between training and inference. During inference, the CNN was applied in a sliding window fashion on all of the tissue regions in the WSI, and we then selected the k tiles with the highest probability for being positive. This step effectively selects the tiles that are most likely to be false positives when the WSI is negative. The selected tiles were placed in a training subset, and once that subset contained N tiles, the training was run. We used \(k = 8\), \(N=256\), and a batch size of 32.
For fully-supervised training, we performed the initial random sampling from annotated regions followed by the hard mining. We refer to this as FS+WS. For weakly-supervised training, we only performed the hard mining as it did not involve any annotations. We refer to this as WS.
To obtain a single prediction for the WSIs from the the tile predictions, we took the maximum probability from all of the tiles. We used the Adam optimizer , with the binary cross-entropy as the loss function, with the following parameters: \(beta_1=0.9\), \(beta_2=0.999\), a batch size of 32, and a learning rate of 0.001 when fine-tuning. We used early stopping by tracking the performance of the model on a validation set, and training was stopped automatically when there was no further improvement on the validation loss for 10 epochs. We chose the model with the lowest validation loss as the final model.
Inter- and intra-rater reliability studies
To evaluate human pathologists’ inter-rater and intra-rater reliability, following WSIs were randomly selected from the test set: (i) 25 true negative WSIs (consensus classification by senior pathologists: Benign, deep learning model (TL-Colon poorly ADC (x20, 512) and FS+WS) WSI classification: Benign), (ii) 25 true-positive (indolent) WSIs (consensus: indolent, deep learning model: indolent), (iii) 25 false-positive WSIs (consensus: 13 indolent WSIs and 12 aggressive WSIs, deep learning model: 25 WSIs both indolent & aggressive double classes), (iv) 25 true-positive (aggressive) WSIs (consensus: aggressive, deep learning model: aggressive) (Table 4). A total of 100 WSIs were randomly shuffled and presented to volunteer pathologists using an in-house online tool developed by customizing the open-source (OpenSeadragon) tool, which is a web-based viewer for zoomable images. We performed the same intra-rater reliability study (Table 5) experiment twice with a one-month gap, randomising the order of WSIs each time. Volunteer pathologists recruited in this study consisted of 5 pathologists with less than 10 years experiences after becoming board certified and 5 pathologists with more than 10 years experiences after becoming board certificated (total 10 pathologists) (Table 4).
Software and statistical analysis
The deep learning models were implemented and trained using TensorFlow . AUCs were calculated in python using the scikit-learn package  and plotted using matplotlib . The 95% CIs of the AUCs were estimated using the bootstrap method  with 1000 iterations.
The true positive rate (TPR) was computed as
and the false positive rate (FPR) was computed as
Where TP, FP, and TN represent true positive, false positive, and true negative, respectively. The ROC curve was computed by varying the probability threshold from 0.0 to 1.0 and computing both the TPR and FPR at the given threshold.
To assess the histopathological diagnostic concordance of pathologists, we performed S-score statistics, which is a measure and change-adjusted index for inter-rater reliability of categorical measurements between two or more raters . To evaluate the intra-rater reliability for each pathologist, we performed the weighted kappa statistics [51, 52]. We calculated the S-scores and kappa values using Microsoft Excel 2016 MSO (16.0.13029.20232) 64 bit. The scale for interpretation is as follows: \(\le\)0.0, poor agreement; 0.01-0.20, slight agreement; 0.21-0.40, fair agreement; 0.41-0.60, moderate agreement; 0.61-0.80, substantial agreement; 0.81-1.00, almost perfect agreement (Tables 4, 5).
Availability of data and material
The datasets generated during and/or analysed during the current study are not publicly available due to specific institutional requirements governing privacy protection but are available from the corresponding author on reasonable request. The datasets that support the findings of this study are available from Kamachi Group Hospitals (Fukuoka, Japan) and Sapporo-Kosei General Hospital (Sapporo, Japan), but restrictions apply to the availability of these data, which were used under a data use agreement which was made according to the Ethical Guidelines for Medical and Health Research Involving Human Subjects as set by the Japanese Ministry of Health, Labour and Welfare (Tokyo, Japan), and so are not publicly available. However, the data are available from the authors upon reasonable request for private viewing and with permission from the corresponding medical institutions within the terms of the data use agreement and if compliant with the ethical and legal requirements as stipulated by the Japanese Ministry of Health, Labour and Welfare.
High AUC performance of prostate core needle biopsy WSI evaluation of indolent and aggressive adenocarcinoma histopathology images
We trained deep learning models using two different training approaches: one was transfer learning (TL) and weakly supervised learning (WS) approach [25, 53] (TL-Colon poorly ADC (x20, 512) and WS) and the other was TL and fully supervised (FS) pre-training followed by WS (FS + WS) approach  (TL-Colon poorly ADC (x20, 512) and FS + WS). Both approaches, the models were applied in a sliding window fashion with input tiles of 512x512 pixels, magnification at x20, and strides of 256. As for transfer learning, colon poorly differentiated adenocarcinoma classification model (Colon poorly ADC (x20, 512))  was selected as an initial weight due to its highest ROC-AUC (0.889, CI: 0.861 - 0.914) and lowest log-loss (0.415, CI: 0.378 - 0.457) (Table 2) on test set (Table 1). The other existing deep learning models (Table 2) we have used to compare ROC-AUC and log-loss performances were described previously: Stomach ADC, AD (x10, 512) ; Stomach signet ring cell carcinoma (SRCC) (x10, 224) ; Stomach poorly ADC (x20, 224) ; Colon ADC, AD (x10, 512) ; Pancreas EUS-FNA ADC (x10, 224) ; Breast IDC, DCIS (x10, 224) . As for FS pre-training, we have used manually drawing annotations by pathologists Fig. 2.For test set (Table 1), we computed the ROC-AUC, log loss, accuracy, sensitivity, and specificity and summarized in Table 3 and Fig. 3.
As for WSI classification, the deep learning model for FS pre-training followed by WS approach (TL-Colon poorly ADC (x20, 512) and FS + WS) slightly improved ROC-AUC, accuracy, and sensitivity and decreased log-loss as compared to the model for WS approach (TL-Colon poorly ADC (x20, 512) and WS) in aggressive WSIs but not in indolent WSIs (Fig. 3 and Table 3). On the other hand, when compared with and without FS learning ([TL-Colon poorly ADC (x20, 512) and FS + WS] and [TL-Colon poorly ADC (x20, 512) and WS]) models for indolent and aggressive prediction at tile level in WSIs, FS pre-training followed by WS (FS + WS) approach robustly predicted indolent (Gleason pattern 3) (Fig. 4A, C, D, F) and aggressive (Gleason pattern 4 and 5) (Fig. 4M, O, P, R) patterns on heatmap images as compared to the WS approach (TL-Colon poorly ADC (x20, 512) and WS) (Fig. 4A, B, D, E, M, N, P, Q). Interestingly, the model (TL-Colon poorly ADC (x20, 512) and FS + WS) predicted indolent pattern (Gleason pattern 3) area precisely where pathologists did not mark ink-dots when they performed diagnosis (Fig. 4G, I, J, L), which was not predicted by the WS approach (TL-Colon poorly ADC (x20, 512) and WS) (Fig. 4G, H, J, K).
True positive indolent and aggressive prediction of core needle biopsy WSIs
Our model (TL-Colon poorly ADC (x20, 512) and FS + WS) satisfactorily predicted indolent (Fig. 5A-D) and aggressive (Fig. 5E-H) patterns in core needle biopsy WSIs. According to the histopathological report and additional pathologists’ consensus reviewing, in both #1 and #2 tissue fragments (Fig. 5A), there are adenocarcinoma corresponded with Gleason pattern 3 (Gleason score = 3 + 3) (Fig. 5C), indicating indolent adenocarcinoma pattern and indolent WSI classification. The heatmap image (Fig. 5B, D) shows true positive indolent predictions in #1 and #2 fragments (Fig. 5B), where corresponded with H &E morphology (Fig. 5C, D). In (Fig. 5E), #1 and #2 fragments were benign (non-neoplastic) lesions and there are adenocarcinoma corresponded with Gleason pattern 4 (Gleason score = 4 + 4) (Fig. 5G), indicating aggressive adenocarcinoma pattern and aggressive WSI classification. The heatmap image (Fig. 5F, H) shows true positive aggressive predictions in #3 and #4 fragments (Fig. 5F), where corresponded with H &E morphology (Fig. 5G, H). False positive predictions were not observed in other benign tissue fragments (#1 and #2) (Fig. 5E, F).
True negative indolent and aggressive prediction of core needle biopsy WSIs
Our model (TL-Colon poorly ADC (x20, 512) and FS + WS) showed true negative predictions of indolent (Fig. 6A, C) and aggressive (Fig. 6A, D) patterns in core needle biopsy WSIs. In Fig. 6A, histopathologically, all tissue fragments (#1-#13) were benign (non-neoplastic) lesions. The heatmap image showed true positive prediction of benign (Fig. 6B), true negative predictions of indolent (Fig. 6C) and aggressive (Fig. 6D) patterns.
False positive indolent and aggressive prediction of core needle biopsy WSIs
According to the histopathological reports and additional pathologists’ reviewing, Fig. 7A is a prostatic hyperplasia and Fig. 7E is a chronic prostatitis, which are benign (non-neoplastic) lesions. Our model (TL-Colon poorly ADC (x20, 512) and FS + WS) showed false positive predictions of indolent (Fig. 7B) and aggressive (Fig. 7F) patterns, which caused indolent and aggressive WSI classification. indolent false positive tissue areas showed large and small dilated atrophic glands (Fig. 7C, D) and aggressive false positive tissue areas showed severe infiltration of lymphocytes and histiocytes (Fig. 7G, H), which could be the primary causes of false positives due to its morphological similarity in indolent pattern (Gleason pattern 3) and aggressive pattern (Gleason pattern 4 and 5).
False negative indolent and aggressive prediction of core needle biopsy WSIs
According to the histopathological reports and additional pathologists’ consensus reviewing, in Fig. 8A, infiltrating adenocarcinoma showed indolent pattern (Gleason pattern 3) in the limited area of fragment #1 (Fig. 8C). Fragment #2-#10 were benign (non-neoplastic) lesions. The heatmap image (Fig. 8B) showed a weakly indolent predicted tile (Fig. 8D) which was corresponded with Gleason pattern 3 histopathology (Fig. 8C). Therefore, the false negative WSI classification was provided. In Fig. 8E, a few fragmented adenocarcinoma foci with cribriform pattern which indicated aggressive pattern (Gleason pattern 4) (Fig. 8G) in a fragment (#2). The heatmap image (Fig. 8F) showed true positive prediction of a few adenocarcinoma with low probability (Fig. 8H). Therefore, the false negative WSI classification was provided. Both of these WSIs (Fig. 8A, E) consist of very low volume of adenocarcinoma, which could be the primary causes of false negatives.
Both indolent and aggressive prediction outputs of core needle biopsy WSIs
There were 114 out of 645 WSIs in the test set (Table 1) which were predicted as both indolent and aggressive by our model (TL-Colon poorly ADC (x20, 512) and FS + WS). After looking over these WSIs carefully, we found tendencies in these WSIs which consisted of mixture of Gleason pattern 3 and Gleason pattern 4 adenocarcinoma in degree of the borderline (cut-off 20%) between indolent and aggressive evaluation (Fig. 1). For example, histopathologically, small, indistinct, or fused glands (equivalent to Gleason pattern 4) adenocarcinoma was predominant (Fig. 9A, D, E). However, at the same time, Gleason pattern 3 adenocarcinoma was mixed in various degrees (Fig. 9A, D, E) in the area of Gleason pattern 4 adenocarcinoma infiltration. Importantly, in all 114 WSIs predicted as both indolent and aggressive predicted, the boundary between Gleason pattern 3 and Gleason pattern 4 adenocarcinoma was unclear and traditional which was confirmed retrospectively by senior pathologists. The heatmap images of indolent (Fig. 9B) and aggressive (Fig. 9C) revealed that to some extent, indolent (Gleason pattern 3) (Fig. 9F, H) and aggressive (Gleason pattern 4 and 5) (Fig. 9G, I) prediction outputs were overlapped. Therefore, the WSI prediction outputs (indolent or aggressive) were approximate values. In these WSIs, the WSI classification was selected larger value of indolent or aggressive. If we compute ROC-AUC and log-loss based on the criteria for acceptance of double label WSI classification outputs (meaning both indolent and aggressive prediction outputs), the scores are as follows: indolent ROC-AUC 0.956 [CI: 0.940-0.970], log-loss 0.969 [CI: 0.835-1.109]; aggressive ROC-AUC 0.980 [CI: 0.969-0.990], log-loss 0.213 [CI: 0.167-0.264].
Inter- and intra-rater reliability study
To assess the inter-rater reliability of benign, indolent adenocarcinoma, and aggressive adenocarcinoma classification on WSIs, we have selected WSI based on our deep learning model (TL-Colon poorly ADC (x20, 512) and FS + WS) WSI prediction outputs and consensus classification by senior pathologists. As for true-negative cohort (25 WSIs; consensus: benign, AI predicted label: benign), S-scores in the range of 0.90-0.95, indicating “almost perfect agreement” (Table 4). As for the true-positive indolent cohort (25 WSIs; consensus: indolent, AI predicted label: indolent), S-scores in the range of 0.56-0.72, indicating “moderate to substantial agreement” (Table 4). As for the both indolent and aggressive predicted cohort (25 WSIs; consensus: 13 indolent and 12 aggressive, AI predicted label: indolent & aggressive), S-scores in the range of 0.10-0.28, indicating “slight to fair agreement” (Table 4). As for the true-positive aggressive cohort (25 WSIs; consensus: aggressive, AI predicted label: aggressive),S-scores in the range of 0.48-0.81, indicating “moderate to almost perfect agreement” (Table 4). The inter-rater reliability study was performed two times by randomizing a total 100 of identical WSIs with a one-month interval between 1st and 2nd studies. The S-scores in the 2nd study were slightly higher than 1st study and interpretations in the 2nd study were modestly improved than 1st study (Table 4). As for the aggressive classification, the S-scores in the pathologists more than 10 years experiences were higher than pathologists less than 10 years experiences (Table 4). Overall, WSIs which were predicted as both indolent & aggressive labels by our deep learning model (TL-Colon poorly ADC (x20, 512) and FS + WS) resulted very low S-scores in the range of 0.10-0.28, meaning poor inter-rater reliability (agreement) (Table 4) by pathologists regardless of experiences. As for the intra-rater reliability, all 10 pathologists achieved robust weighted kappa values in the range of 0.93-0.97, indicating “almost perfect agreement” (Table 5. Figure 10 shows a representative example WSI of poor evaluation (diagnostic) concordance among pathologists. As for the inter-rater reliability study, 5 pathologists evaluated as indolent and 5 pathologist as aggressive in this WSI (Fig. 10A). In Fig. 10A, there are wide variety of adenocarcinoma histopathologies. The heatmap images show both indolent (Fig. 10B) and aggressive (Fig. 10C) predictions by our deep learning model (TL-Colon poorly ADC (x20, 512) and FS + WS). In Fig. 10D, Gleason pattern 3 (indicating indolent) adenocarcinoma was predominant, which was predicted as indolent (Fig. 10E) not aggressive (Fig. 10F). In Fig. 10G and J, Gleason pattern 3 (indicating indolent) and Gleason pattern 4 (indicating aggressive) adenocarcinoma were mixed and it was hard to evaluate between two labels (indolent and aggressive), which were predicted as both indolent (Fig. 10H and K) and aggressive (Fig. 10I and L).
In this study, we trained deep learning models for the classification of indolent and aggressive prostate adenocarcinoma in core needle biopsy WSIs to make an inference for patients’ optimum clinical interventions (active surveillance or definitive therapy). We trained deep learning models using a combination of transfer learning [25, 41, 55], weakly supervised , and fully supervised [24, 43, 54] learning approaches. The evaluation results on the WSI level showed no significant differences between transfer learning and weakly supervised learning model (TL-Colon poorly ADC (x20, 512) and WS) and transfer learning, fully and weakly supervised learning model (TL-Colon poorly ADC (x20, 512) and FS+WS) (Table 3). However, the results at the tile level (visualised via heatmap images), the model (TL-Colon poorly ADC (x20, 512) and FS+WS) predicted both indolent (Gleason pattern 3) and aggressive (Gleason pattern 4 and 5) areas more precisely than weakly supervised learning model (TL-Colon poorly ADC (x20, 512) and WS) (Fig. 4. Therefore, we have selected the model (TL-Colon poorly ADC (x20, 512) and FS+WS) as the best model, which achieved ROC-AUCs at 0.846 (CI: 0.813 - 0.879) (indolent) and 0.980 (CI: 0.969 - 0.990) (aggressive) (Table 3). To the best of our knowledge, this is the first study to demonstrate the deep learning model to predict patients’ clinical interventions (active surveillance or definitive therapy) based on the histopathological WSIs. A previously reported deep learning model achieved ROC-AUC in the range of 0.855 (external test set) - 0.974 (internal test set) for the classification of benign and Gleason grade group 1-2 vs. Gleason grade group greater than or equal to 3 . Our model (TL-Colon poorly ADC (x20, 512) and FS+WS) achieved better ROC-AUC performance in aggressive (0.980 (CI: 0.969 - 0.990)) (Table 3). Our model (TL-Colon poorly ADC (x20, 512) and FS+WS) predicted indolent (Gleason pattern 3) and aggressive (Gleason pattern 4 and 5) lesions well after inspection of WSI heatmaps (Fig. 4, 5, 6). The model still had a few cases of false positive and false negative predictions (Fig. 7, 8). Our model (TL-Colon poorly ADC (x20, 512) and FS+WS) tends to show false positive predictions of indolent lesions where the tissues consist of atrophic glands and aggressive lesions where the tissues consist of severe inflammatory cell infiltration (Fig. 7). Our model tends to show false negative predictions of indolent and aggressive lesions where adenocarcinoma tissues were limited volumes (Fig. 8).
However, a major limitation in this study is the inability of the model to decide on borderline cases. As we have increased the cut-off from 10% to 20%, the model would be unable, for instance, to predict a Gleason score 3 (85%) + 5 (15%), which according to the guidelines should be aggressive. Any prediction by our model on such a case would be unreliable due to the lack of borderline cases in the training set and our modified cut-off. Another limitation with this study is related to WSIs with both indolent (Gleason pattern 3) and aggressive (Gleason pattern 4 and 5) components mixed in various proportions, where there was wide variability in inter-rater (observer) concordance among pathologists, regardless of their years of experiences after becoming board certified (Table 4 and Fig. 10) . On such WSIs which consisted of mixture of Gleason pattern 3 and Gleason pattern 4 adenocarcinoma in degree of the borderline (cut-off 20%) between indolent and aggressive evaluation (Fig. 1), our deep learning model (TL-Colon poorly ADC (x20, 512) and FS+WS) tends to predict both indolent and aggressive WSI outputs (17.7% of total WSIs in the test set) as well as pathologists (Fig. 9, 10). Indeed, there were a certain number of WSIs with Gleason pattern 4 or Gleason pattern 5 component around 20% of total adenocarcinoma in the test set, which were the major cause of poor concordance among pathologists and deep learning model WSI prediction outputs with both indolent and aggressive (Fig. 10). It has been reported that with less than 10% involvement of the core, it was more difficult to assess in smaller foci, with only moderate agreement . Given that in a small focus only a few glands of a given pattern can markedly affect the percent Gleason pattern 4, consideration should be given to not recording percent Gleason pattern 4 in small foci of Gleason score 7 tumors on core needle biopsy . This issue is inevitable when classifying WSIs based on percentages of adenocarcinoma components (Gleason pattern 3, 4, 5). Moreover, there were a certain number of WSIs in which there was a marked discrepancy among pathologists as to whether the prostate adenocarcinoma was classified as Gleason pattern 3 or Gleason pattern 4 (Fig. 10). Practically, the histopathological segregation of Gleason pattern 3 and Gleason pattern 4 is often problematic [38, 59]. Currently, according to the diagnostic criteria of Gleason pattern 4 adenocarcinoma on core needle biopsy, poorly formed glands immediately adjacent to other well-formed glands regardless of their number and small foci of less than or equal to 5 poorly formed glands regardless of their location should be graded as Gleason pattern 3 , which would be one of the primary cause of both indolent and aggressive prediction outputs. Moreover, in this study, instead of assigning an indolent or aggressive label to each core needle biopsy specimen, we considered all specimens on a WSI together as a single specimen Therefore, it was possible to be poor inter-observer concordance among pathologists if total histopathological area was too large (e.g., six or eight core specimens in a single WSI) to evaluate. However, it can be possible to resolve the issue by specimen preparation with one core needle biopsy specimen per glass slide (WSI) for biopsy specimens assuming the deep learning model prediction. Interestingly, when we compute the model (TL-Colon poorly ADC (x20, 512) and FS+WS) performance based on the criteria for acceptance of double label WSI classification outputs (both indolent and aggressive), indolent ROC-AUC were increased (0.956 [CI: 0.940-0.970]) and log-loss was decreased (0.969 [CI: 0.835-1.109]) as compared to Table 3. The other limitation in this study is that limited generalization of the deep learning model (TL-Colon poorly ADC (x20, 512) and FS+WS) because training and test set were provided by the same supplier hospitals (Kamachi Group Hospitals and Sapporo-Kosei General Hospital). Therefore, in the next step, to verify the versatility of the model (TL-Colon poorly ADC (x20, 512) and FS+WS), we need to perform verification study using enough number of WSIs from diverse range of hospitals.
The main advantage of our deep learning model (TL-Colon poorly ADC (x20, 512) and FS+WS) is that the model can predict patients’ optimum clinical interventions (active surveillance: indolent or definitive therapy: aggressive) on core needle biopsy WSIs. For most patients with low-risk (Gleason score less than or equal to 6) prostate cancer, active surveillance is the recommended disease management strategy . At the same time, select patients with low-volume, intermediate-risk prostate cancer (indolent in this study) can be offered active surveillance . In routine histopathological diagnosis for prostate cancer in core needle biopsy specimens, pathologists have to report Gleason scores for each core for risk assessment by using microscope which would be fatigue and laborious works. Moreover, it is revealed that there are significant inter-rater variability among pathologists in diagnosis of prostate cancer [9, 35, 59]. By using our deep learning model as an initial screening, pathologists can check WSIs with heatmap image highlighting indolent (Gleason pattern 3) and aggressive (Gleason pattern 4 and 5) adenocarcinoma and WSI prediction outputs (benign, indolent, and aggressive), which would be a great benefit for general pathologists to make diagnoses.
Availability of data and materials
The datasets generated and/or analysed during the current study are not publicly available due to specific institutional requirements governing privacy protection but are available from the corresponding author on reasonable request.
Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71(3):209–49.
Chen RC, Rumble RB, Loblaw DA, Finelli A, Ehdaie B, Cooperberg MR, et al. Active surveillance for the management of localized prostate cancer (Cancer Care Ontario Guideline): American Society of Clinical Oncology clinical practice guideline endorsement. J Clin Oncol Off J Am Soc Clin Oncol. 2016;34(18):2182–90.
Van Leenders GJ, Van Der Kwast TH, Grignon DJ, Evans AJ, Kristiansen G, Kweldam CF, et al. The 2019 International Society of Urological Pathology (ISUP) consensus conference on grading of prostatic carcinoma. Am J Surg Pathol. 2020;44(8): e87.
Morash C, Tey R, Agbassi C, Klotz L, McGowan T, Srigley J, et al. Active surveillance for the management of localized prostate cancer: guideline recommendations. Can Urol Assoc J. 2015;9(5–6):171.
Cyll K, Löffeler S, Carlsen B, Skogstad K, Plathan ML, Landquist M, et al. No significant difference in intermediate key outcomes in men with low-and intermediate-risk prostate cancer managed by active surveillance. Sci Rep. 2022;12(1):1–9.
Russell JR, Siddiqui MM. Active surveillance in favorable intermediate risk prostate cancer: outstanding questions and controversies. Curr Opin Oncol. 2022;34(3):219–27
Allsbrook WC Jr, Mangold KA, Johnson MH, Lane RB, Lane CG, Epstein JI. Interobserver reproducibility of Gleason grading of prostatic carcinoma: general pathologist. Hum Pathol. 2001;32(1):81–8.
Oyama T, Allsbrook WC Jr, Kurokawa K, Matsuda H, Segawa A, Sano T, et al. A comparison of interobserver reproducibility of Gleason grading of prostatic carcinoma in Japan and the United States. Arch Pathol Lab Med. 2005;129(8):1004–10.
Ozkan TA, Eruyar AT, Cebeci OO, Memik O, Ozcan L, Kuskonmaz I. Interobserver variability in Gleason histological grading of prostate cancer. Scand J Urol. 2016;50(6):420–4.
Bulten W, Kartasalo K, Chen PHC, Ström P, Pinckaers H, Nagpal K, et al. Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nat Med. 2022;28(1):154–63.
Yu KH, Zhang C, Berry GJ, Altman RB, Ré C, Rubin DL, et al. Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features. Nat Commun. 2016;7:12474.
Hou L, Samaras D, Kurc TM, Gao Y, Davis JE, Saltz JH. Patch-based convolutional neural network for whole slide tissue image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Manhattan: IEEE address; 2016. p. 2424–2433.
Madabhushi A, Lee G. Image analysis and machine learning in digital pathology: Challenges and opportunities. Med Image Anal. 2016;33:170–5.
Litjens G, Sánchez CI, Timofeeva N, Hermsen M, Nagtegaal I, Kovacs I, et al. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Sci Rep. 2016;6:26286.
Kraus OZ, Ba JL, Frey BJ. Classifying and segmenting microscopy images with deep multiple instance learning. Bioinformatics. 2016;32(12):i52–9.
Korbar B, Olofson AM, Miraflor AP, Nicka CM, Suriawinata MA, Torresani L, et al. Deep learning for classification of colorectal polyps on whole-slide images. J Pathol Inform. 2017;8:30.
Luo X, Zang X, Yang L, Huang J, Liang F, Rodriguez-Canales J, et al. Comprehensive computational pathological image analysis predicts lung cancer prognosis. J Thorac Oncol. 2017;12(3):501–9.
Coudray N, Ocampo PS, Sakellaropoulos T, Narula N, Snuderl M, Fenyö D, et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat Med. 2018;24(10):1559–67.
Wei JW, Tafe LJ, Linnik YA, Vaickus LJ, Tomita N, Hassanpour S. Pathologist-level classification of histologic patterns on resected lung adenocarcinoma slides with deep neural networks. Sci Rep. 2019;9(1):1–8.
Gertych A, Swiderska-Chadaj Z, Ma Z, Ing N, Markiewicz T, Cierniak S, et al. Convolutional neural networks can accurately distinguish four histologic growth patterns of lung adenocarcinoma in digital slides. Sci Rep. 2019;9(1):1483.
Bejnordi BE, Veta M, Van Diest PJ, Van Ginneken B, Karssemeijer N, Litjens G, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama. 2017;318(22):2199–210.
Saltz J, Gupta R, Hou L, Kurc T, Singh P, Nguyen V, et al. Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell Reports. 2018;23(1):181–93.
Campanella G, Hanna MG, Geneslaw L, Miraflor A, Silva VWK, Busam KJ, et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat Med. 2019;25(8):1301–9.
Iizuka O, Kanavati F, Kato K, Rambeau M, Arihiro K, Tsuneki M. Deep learning models for histopathological classification of gastric and colonic epithelial tumours. Sci Rep. 2020;10(1):1–11.
Tsuneki M, Abe M, Kanavati F. A Deep Learning Model for Prostate Adenocarcinoma Classification in Needle Biopsy Whole-Slide Images Using Transfer Learning. Diagnostics. 2022;12(3):768.
Huang W, Randhawa R, Jain P, Iczkowski KA, Hu R, Hubbard S, et al. Development and Validation of an Artificial Intelligence-Powered Platform for Prostate Cancer Grading and Quantification. JAMA Netw Open. 2021;4(11):e2132554–e2132554.
Bulten W, Balkenhol M, Belinga JJA, Brilhante A, Çakır A, Egevad L, et al. Artificial intelligence assistance significantly improves Gleason grading of prostate biopsies by pathologists. Mod Pathol. 2021;34(3):660–71.
Singhal N, Soni S, Bonthu S, Chattopadhyay N, Samanta P, Joshi U, et al. A deep learning system for prostate cancer diagnosis and grading in whole slide images of core needle biopsies. Sci Rep. 2022;12(1):1–11.
Li W, Li J, Wang Z, Polson J, Sisk AE, Sajed DP, et al. PathAL: An Active Learning Framework for Histopathology Image Analysis. IEEE Trans Med Imaging. 2021;41(5):1176–87.
Melo PAdS, Estivallet CLN, Srougi M, Nahas WC, Leite KRM. Detecting and grading prostate cancer in radical prostatectomy specimens through deep learning techniques. Clinics. 2021;76:e3198.
Otálora S, Marini N, Müller H, Atzori M. Combining weakly and strongly supervised learning improves strong supervision in Gleason pattern classification. BMC Med Imaging. 2021;21(1):1–14.
Silva-Rodríguez J, Colomer A, Naranjo V. WeGleNet: A weakly-supervised convolutional neural network for the semantic segmentation of Gleason grades in prostate histology images. Computerized Medical Imaging and Graphics. 2021;88:101846.
Marginean F, Arvidsson I, Simoulis A, Overgaard NC, Åström K, Heyden A, et al. An artificial intelligence-based support tool for automation and standardisation of Gleason grading in prostate biopsies. Eur Urol Focus. 2021;7(5):995–1001.
Nagpal K, Foote D, Liu Y, Chen PHC, Wulczyn E, Tan F, et al. Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer. NPJ Digit Med. 2019;2(1):1–10.
Sadimin ET, Khani F, Diolombi M, Meliti A, Epstein JI. Interobserver reproducibility of percent Gleason pattern 4 in prostatic adenocarcinoma on prostate biopsies. Am J Surg Pathol. 2016;40(12):1686–92.
van Leenders GJLH, van der Kwast TH, Grignon DJ, Evans AJ, Kristiansen G, Kweldam CF, et al. The 2019 International Society of Urological Pathology (ISUP) Consensus Conference on Grading of Prostatic Carcinoma. Am J Surg Pathol. 2020;44(8):e87–99. https://doi.org/10.1097/pas.0000000000001497.
McKenney JK, Simko J, Bonham M, True LD, Troyer D, Hawley S, et al. The potential impact of reproducibility of Gleason grading in men with early stage prostate cancer managed by active surveillance: a multi-institutional study. J Urol. 2011;186(2):465–9.
Egevad L, Algaba F, Berney DM, Boccon-Gibod L, Compérat E, Evans AJ, et al. Interactive digital slides with heat maps: a novel method to improve the reproducibility of Gleason grading. Virchows Arch. 2011;459(2):175–82.
Zhou M, Li J, Cheng L, Egevad L, Deng FM, Kunju LP, et al. Diagnosis of “Poorly Formed Glands’’ Gleason Pattern 4 Prostatic Adenocarcinoma on Needle Biopsy. Am J Surg Pathol. 2015;39(10):1331–9.
Harding-Jackson N, Kryvenko ON, Whittington EE, Eastwood DC, Tjionas GA, Jorda M, et al. Outcome of Gleason 3+ 5= 8 prostate cancer diagnosed on needle biopsy: prognostic comparison with Gleason 4 + 4= 8. J Urol. 2016;196(4):1076–81.
Kanavati F, Tsuneki M. Partial transfusion: on the expressive influence of trainable batch norm parameters for transfer learning. In: Medical Imaging with Deep Learning. Cambridge: PMLR; 2021. p. 338–353.
Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. Cambridge: PMLR; 2019. p. 6105–6114.
Kanavati F, Tsuneki M. A deep learning model for gastric diffuse-type adenocarcinoma classification in whole slide images. 2021. arXiv preprint arXiv:2104.12478.
Otsu N. A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern. 1979;9(1):62–6.
Kingma DP, Ba J. Adam: A method for stochastic optimization. 2014. arXiv preprint arXiv:1412.6980.
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Software available from tensorflow.org. https://www.tensorflow.org/.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12:2825–30.
Hunter JD. Matplotlib: A 2D graphics environment. Comput Sci Eng. 2007;9(3):90–5. https://doi.org/10.1109/MCSE.2007.55.
Efron B, Tibshirani RJ. An introduction to the bootstrap. Boca Raton: CRC press; 1994.
Bennett EM, Alpert R, Goldstein A. Communications through limited-response questioning. Public Opin Q. 1954;18(3):303–8.
Kundel HL, Polansky M. Measurement of observer agreement. Radiology. 2003;228(2):303–8.
Swan K, Speyer R, Scharitzer M, Farneti D, Brown T, Cordier R. A Visuoperceptual Measure for Videofluoroscopic Swallow Studies (VMV): A Pilot Study of Validity and Reliability in Adults with Dysphagia. J Clin Med. 2022;11(3):724.
Kanavati F, Toyokawa G, Momosaki S, Rambeau M, Kozuma Y, Shoji F, et al. Weakly-supervised learning for lung carcinoma classification using deep learning. Sci Rep. 2020;10(1):1–11.
Kanavati F, Ichihara S, Rambeau M, Iizuka O, Arihiro K, Tsuneki M. Deep learning models for gastric signet ring cell carcinoma classification in whole slide images. Technol Cancer Res Treat. 2021;20:15330338211027900.
Tsuneki M, Kanavati F. Deep learning models for poorly differentiated colorectal adenocarcinoma classification in whole slide images using transfer learning. Diagnostics. 2021;11(11):2074.
Naito Y, Tsuneki M, Fukushima N, Koga Y, Higashi M, Notohara K, et al. A deep learning model to detect pancreatic ductal adenocarcinoma on endoscopic ultrasound-guided fine-needle biopsy. Sci Rep. 2021;11(1):1–8.
Kanavati F, Ichihara S, Tsuneki M. A deep learning model for breast ductal carcinoma in situ classification in whole slide images. Virchows Archiv. 2022;480(5):1009–22.
Bulten W, Pinckaers H, van Boven H, Vink R, de Bel T, van Ginneken B, et al. Automated deep-learning system for Gleason grading of prostate cancer using biopsies: a diagnostic study. Lancet Oncol. 2020;21(2):233–41.
Meliti A, Sadimin E, Diolombi M, Khani F, Epstein JI. Accuracy of grading Gleason score 7 prostatic adenocarcinoma on needle biopsy: influence of percent pattern 4 and other histological factors. Prostate. 2017;77(6):681–5.
We are grateful for the support provided by Dr. Shigeo Nakano at Kamachi Group Hospitals (Fukuoka, Japan). We thank pathologists who have been engaged in reviewing cases and clinicopathological discussion for this study. This study is based on results obtained from a project, JPNP14012, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).
This study is based on results obtained from a project, JPNP14012, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).
Ethics approval and consent to participate
The experimental protocol was approved by the ethical board of Kamachi Group Hospitals (No. 173) and Sapporo-Kosei General Hospital (No. 597). All research activities complied with all relevant ethical regulations and were performed in accordance with relevant guidelines and regulations in the all hospitals mentioned above. Informed consent to use histopathological samples and pathological diagnostic reports for research purposes had previously been obtained from all patients prior to the surgical procedures at all hospitals, and the opportunity for refusal to participate in research had been guaranteed by an opt-out manner.
Consent for publication
F.K. and M.T. are employees of Medmain Inc. All authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Tsuneki, M., Abe, M., Ichihara, S. et al. Inference of core needle biopsy whole slide images requiring definitive therapy for prostate cancer. BMC Cancer 23, 11 (2023). https://doi.org/10.1186/s12885-022-10488-5
- Transfer learning
- Weakly supervised learning
- Fully supervised learning
- Deep learning
- Prostate cancer
- Active surveillance