Inference of core needle biopsy whole slide images requiring definitive therapy for prostate cancer

Background Prostate cancer is often a slowly progressive indolent disease. Unnecessary treatments from overdiagnosis are a significant concern, particularly low-grade disease. Active surveillance has being considered as a risk management strategy to avoid potential side effects by unnecessary radical treatment. In 2016, American Society of Clinical Oncology (ASCO) endorsed the Cancer Care Ontario (CCO) Clinical Practice Guideline on active surveillance for the management of localized prostate cancer. Methods Based on this guideline, we developed a deep learning model to classify prostate adenocarcinoma into indolent (applicable for active surveillance) and aggressive (necessary for definitive therapy) on core needle biopsy whole slide images (WSIs). In this study, we trained deep learning models using a combination of transfer, weakly supervised, and fully supervised learning approaches using a dataset of core needle biopsy WSIs (n=1300). In addition, we performed an inter-rater reliability evaluation on the WSI classification. Results We evaluated the models on a test set (n=645), achieving ROC-AUCs of 0.846 for indolent and 0.980 for aggressive. The inter-rater reliability evaluation showed s-scores in the range of 0.10 to 0.95, with the lowest being on the WSIs with both indolent and aggressive classification by the model, and the highest on benign WSIs. Conclusion The results demonstrate the promising potential of deployment in a practical prostate adenocarcinoma histopathological diagnostic workflow system.


Introduction
According to the Global Cancer Statistics 2020, prostate cancer is the second most frequent cancer and the fifth leading cause of cancer death among men in 2020 [1]. Prostate cancer is the most frequently diagnosed cancer in men in over one half (112 of 185) of the countries of the world [1]. Therefore, it is necessary to define optimum therapeutic strategies for detection, treatment, and follow-up for prostate cancer patients [2]. In recent year, pathologists perform the histopathological diagnosis of prostate cancer based on Gleason pattern quantities, tumor growth patterns, and clinical practice advancements (e.g., multiparametric magnetic resonance imaging (mpMRI) targeted biopsy and fusion ultrasound/ magnetic resonance imaging biopsy) [3]. Standard active treatments for prostate cancer include hormone therapy, radiotherapy, and radical prostatectomy. However, to avoid the unnecessary side effects associated with overdiagnosis and over treatment, active surveillance is an important option for low-grade prostate cancer patients with reduced mortality risk [2,4]. As for the active surveillance, it consists in performing regular follow-ups of patients so as to be able to provide appropriate radical treatment for high-risk groups if necessary [4]. The criteria for active surveillance are highly controversial [2][3][4][5][6]. According to the Cancer Care Ontario (CCO) Guideline and American Society of Clinical Oncology (ASCO) Clinical Practice Guideline, it is generally accepted that active surveillance is applied when a prostate cancer is determined by biopsy and Gleason pattern 4 components account for less than 10% of the total cancer volume [2]. However, unfortunately, the inter-observer agreement for the Gleason score is not always high, and the interobserver reproducibility (variability) of Gleason grading by general pathologists is often a problem [7][8][9][10]. Although International Society of Urological Pathology (ISUP) is making efforts to improve inter-observer agreement and equalize diagnostic quality for general pathologists by publishing consensus reviewing cases (https:// isupw eb. org/ pib/), there are still cases that are not in agreement among pathologists in routine clinical practice.
In computational pathology, deep learning models have been widely applied in histopathological cancer classification on WSIs, cancer cell detection and segmentation, and the stratification of patient outcomes [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25]. Recently, it has been reported that an artificial intelligence (AI)-powered platform used as a clinical decision support tool was able to detect, grade, and quantify prostate cancer with high accuracy and efficiency and was associated with significant reductions in inter-observer variability [26,27]. As for the global AI competition, the Prostate cANcer graDe Assessment (PANDA) challenge, a group of AI Gleason grading algorithms developed during a global competition generalized well to intercontinental and multinational cohorts with pathologist-level performance [10]. Other works [23,[28][29][30][31][32][33][34] have also looked into developing deep learning algorithms to classify prostate cancer Gleason scores based on histopathological images.
In this study, we investigated deep learning models to classify prostate adenocarcinoma in two classes based on the clinical responses: indolent (applicable for active surveillance) and aggressive (necessary for definitive therapy). To define the criteria of indolent and aggressive, we refered to CCO and ASCO guidelines [2] and set the cut-off value of 20% identified Gleason score 4 & 5 components in total prostate adenocarcinoma ( Fig. 1) to reduce the possibility of inter-observer variability [35] as compared to the 10% cut-off value proposed by CCO and ASCO [2]. To the best of our knowledge, this is the first study to establish a deep learning model to make an inference of the necessity for active surveillance on prostate core needle biopsy histopathology whole slide images (WSIs). We trained deep learning models using a combination of transfer learning, weakly, and fully supervised learning approaches and Fig. 1 The schematic diagram of classification labels for prostate adenocarcinoma according to clinical treatment. If the whole slide image (WSI) with Gleason pattern 4 and 5 greater than or equal to 20% in the total area of prostate adenocarcinoma observed by pathologists, the WSI was classified as aggressive. On the other hand, the WSIs with Gleason pattern 4 and 5 less than 20% in the total area of prostate adenocarcinoma were classified as indolent evaluated the trained models on core needle biopsy test set, achieving ROC-AUCs 0.846 (indolent) and 0.980 (aggressive). These findings suggest that it would be possible to not only detect adenocarcinoma on biopsy WSIs, but also to predict patients' optimum clinical interventions (active surveillance or definitive therapy).

Clinical cases and pathological records
This is the retrospective study. A total of 2,285 H &E (hematoxylin & eosin) stained histopathological core needle biopsy specimen slides of human prostate adenocarcinoma and benign (non-neoplastic) lesions -1,321 of adenocarcinoma and 964 of benign -were collected from the surgical pathology files of Kamachi Group Hospitals (Shinyukuhashi, Wajiro, and Shinkuki Hospitals) (Fukuoka, Japan) and Sapporo-Kosei General Hospital (Sapporo, Japan), after histopathological review of all specimens by surgical pathologists in each hospital. In Kamachi Group Hospitals, the histopathological specimens were selected randomly to reflect a real clinical settings as much as possible. In Sapporo-Kosei General Hospital, only adenocarcinoma specimens were provided. Prior to the experimental procedures, each WSI diagnosis was observed and verified by at least two senior pathologists. All WSIs were scanned at a magnification of x20 using the same Leica Aperio AT2 Digital Whole Slide Scanner (Leica Biosystems, Tokyo, Japan) and were saved as SVS file format with JPEG2000 compression. Table 1 shows breakdowns of the distribution of the specimens based on the following: all specimens, consensus specimens by two senior pathologists, training set, validation set, and test set of prostate core needle biopsy WSIs from Kamachi Group Hospitals and Sapporo-Kosei General Hospital. According to the Cancer Care Ontario Guideline [2] and American Society of Clinical Oncology (ASCO), patients with both low-volume (accounting for 10% total tumor) and intermediate-risk (Gleason score 3 + 4 = 7) prostate cancer may be offered active surveillance. At the same time, because of known interobserver variability associated with the identification of minor Gleason pattern 4 components, prospective intradepartmental consultation with other pathologists should be considered for quality assurance [2]. Therefore, in this study, considering clinical responses, we have set two classes for prostate adenocarcinoma: indolent and aggressive. Indolent suggests observation (active surveillance) and aggressive suggests definitive therapy.

Dataset
In this study, we labelled (classified) prostate adenocarcinoma WSIs as follows. If the WSI has less than 20% of Gleason pattern 4 and Gleason pattern 5 components in total adenocarcinoma, it should be classified as indolent (Fig. 1). If the WSI has more than 20% of Gleason pattern 4 and Gleason pattern 5 components in total adenocarcinoma, it should be classified as aggressive (Fig. 1). We did not use a global Gleason score [36]. We set the cut-off at 20% of total prostate adenocarcinoma on a WSI (Fig. 1) to reduce the possibility of interobserver variability as compared to 10% [2], because it has been widely reported that assessment of percentage Gleason pattern 4 in minute cancer foci has poor reproducibility among pathologists, especially for poorly formed glands [3,35,[37][38][39][40]. The reason we do this is because we wanted to exclude cases in the test set that had interobserver variability.
In total we use indolent, aggressive, and benign as WSI labels for training the deep learning models at the WSI level. During the consensus review by two senior pathologists, 310 adenocarcinoma WSIs were excluded because of low concordance when classified into indolent or aggressive ( Table 1). WSIs that had low concordance generally involved borderline Gleason scores (predominately between 10% to 20%). Training, validation, and test set were selected randomly from the consensus WSIs (Table 1).

Annotation
A senior pathologist, who performs routine histopathological diagnoses in general hospital, manually annotated 100 adenocarcinoma WSIs from the training set. The pathologist carried out annotations by free-hand drawing using an in-house online tool developed by customizing the open-source (OpenSeadragon) tool, which is a web-based viewer for zoomable images. On average, 10-15 lesions were annotated per WSI. The pathologists performed annotations based on the histopathological characteristics of Gleason pattern 3, 4, and 5. For example, well-formed glands with intraluminal crystalloids (Gleason pattern 3) ( Fig. 2A), large irregular cribriform glands (Gleason pattern 4) (Fig. 2B), crowded fused glands (Gleason pattern 4) (Fig. 2C), poorly formed small-sized glands with some lumen-formation (Gleason pattern over 4) (Fig. 2D), ductal adenocarcinoma lined We did not annotate areas where it was difficult to determine cytologically that the lesions were cancerous by columnar cells with elongated nuclei (Gleason pattern 4) (Fig. 2E), and infiltrating cords and single tumor cells without lumen formation (Gleason pattern 5) (Fig. 2F) were manually annotated. For training step, Gleason pattern 3 annotations were grouped as indolent and Gleason pattern 4 and 5 annotations as aggressive. The pathologist included cancer stroma which surrounds cancer cells in the annotation area. The average annotation time per WSI was about five minutes. All annotations performed by the pathologist were modified (if necessary), confirmed, and verified by a senior pathologist who performs routine histopathological diagnoses in general hospital.

Deep learning models
We trained the models via transfer learning using the partial fine-tuning approach [41]. This is an efficient finetuning approach that consists of using the weights of an existing pre-trained model and only fine-tuning the affine parameters of the batch normalization layers and the final classification layer. For the model architecture, we used EfficientNetB1 [42] starting with pre-trained weights on ImageNet. We used similar training methodology as [25,43]. For clarity, we highlight the main parts below. We performed tissue detection using Otsu's thresholding method [44] by excluding the white background. We then extracted tiles only from the tissue regions. During prediction, we extracted tiles from the entire tissue regions using a sliding window with a fixed-size stride. During training, we performed random balanced sampling of tiles, whereby we first randomly sampled three WSIs, one for each label. Then from each corresponding WSI, we randomly sampled an equal amount of tiles. For aggressive or indolent WSIs, we randomly sampled from the annotated tissue regions; for Benign, we randomly sampled from all the tissue regions.
After a few epochs, we switched to hard mining of tiles where we alternated between training and inference. During inference, the CNN was applied in a sliding window fashion on all of the tissue regions in the WSI, and we then selected the k tiles with the highest probability for being positive. This step effectively selects the tiles that are most likely to be false positives when the WSI is negative. The selected tiles were placed in a training subset, and once that subset contained N tiles, the training was run. We used k = 8 , N = 256 , and a batch size of 32.
For fully-supervised training, we performed the initial random sampling from annotated regions followed by the hard mining. We refer to this as FS+WS. For weaklysupervised training, we only performed the hard mining as it did not involve any annotations. We refer to this as WS.
To obtain a single prediction for the WSIs from the the tile predictions, we took the maximum probability from all of the tiles. We used the Adam optimizer [45], with the binary cross-entropy as the loss function, with the following parameters: beta 1 = 0.9 , beta 2 = 0.999 , a batch size of 32, and a learning rate of 0.001 when fine-tuning. We used early stopping by tracking the performance of the model on a validation set, and training was stopped automatically when there was no further improvement on the validation loss for 10 epochs. We chose the model with the lowest validation loss as the final model.

Inter-and intra-rater reliability studies
To evaluate human pathologists' inter-rater and intrarater reliability, following WSIs were randomly selected from the test set: (i) 25 Table 4). A total of 100 WSIs were randomly shuffled and presented to volunteer pathologists using an in-house online tool developed by customizing the open-source (OpenSeadragon) tool, which is a web-based viewer for zoomable images. We performed the same intra-rater reliability study (Table 5) experiment twice with a onemonth gap, randomising the order of WSIs each time. Volunteer pathologists recruited in this study consisted of 5 pathologists with less than 10 years experiences after becoming board certified and 5 pathologists with more than 10 years experiences after becoming board certificated (total 10 pathologists) ( Table 4).

Software and statistical analysis
The deep learning models were implemented and trained using TensorFlow [46]. AUCs were calculated in python using the scikit-learn package [47] and plotted using matplotlib [48]. The 95% CIs of the AUCs were estimated using the bootstrap method [49] with 1000 iterations.
The true positive rate (TPR) was computed as and the false positive rate (FPR) was computed as Where TP, FP, and TN represent true positive, false positive, and true negative, respectively. The ROC curve was computed by varying the probability threshold from 0.0 to 1.0 and computing both the TPR and FPR at the given threshold.
To assess the histopathological diagnostic concordance of pathologists, we performed S-score statistics, which is a measure and change-adjusted index for inter-rater reliability of categorical measurements between two or more raters [50]. To evaluate the intra-rater reliability for each pathologist, we performed the weighted kappa statistics [51,52]. We calculated the S-scores and kappa values using

Availability of data and material
The datasets generated during and/or analysed during the current study are not publicly available due to specific institutional requirements governing privacy protection but are available from the corresponding author on reasonable request. The datasets that support the findings of this study are available from Kamachi Group Hospitals (Fukuoka, Japan) and Sapporo-Kosei General Hospital (Sapporo, Japan), but restrictions apply to the availability of these data, which were used under a data use agreement which was made according to the Ethical Guidelines for Medical and Health Research Involving Human Subjects as set by the Japanese Ministry of Health, Labour and Welfare (Tokyo, Japan), and so are not publicly available. However, the data are available from the authors upon reasonable request for private viewing and with permission from the corresponding medical institutions within the terms of the data use agreement and if compliant with the ethical and legal requirements as stipulated by the Japanese Ministry of Health, Labour and Welfare.
As for WSI classification, the deep learning model for FS pre-training followed by WS approach (TL-Colon poorly ADC (x20, 512) and FS + WS) slightly improved ROC-AUC, accuracy, and sensitivity and decreased logloss as compared to the model for WS approach (TL-Colon poorly ADC (x20, 512) and WS) in aggressive WSIs but not in indolent WSIs ( Fig. 3 and Table 3). On the other hand, when compared with and without FS learning ([TL-Colon poorly ADC (x20, 512) and FS + WS] and [TL-Colon poorly ADC (x20, 512) and WS]) models for indolent and aggressive prediction at tile level in WSIs, FS pre-training followed by WS (FS + WS) approach robustly predicted indolent (Gleason pattern 3) (Fig. 4A, C, D, F) and aggressive (Gleason pattern 4 and 5) (Fig. 4M, O, P, R) patterns on heatmap images as compared to the WS approach (TL-Colon poorly ADC (x20, 512) and WS) (Fig. 4A, B, D, E, M, N, P, Q). Interestingly, the model (TL-Colon poorly ADC (x20, 512) and FS + WS) predicted indolent pattern (Gleason pattern 3) area precisely where pathologists did not mark ink-dots when they performed diagnosis (Fig. 4G, I, J, L), which was not predicted by the WS approach (TL-Colon poorly ADC (x20, 512) and WS) (Fig. 4G, H, J, K).

False positive indolent and aggressive prediction of core needle biopsy WSIs
According to the histopathological reports and additional pathologists' reviewing, Fig. 7A is a prostatic hyperplasia and Fig. 7E is a chronic prostatitis, which are benign (non-neoplastic) lesions. Our model (TL-Colon poorly ADC (x20, 512) and FS + WS) showed false positive predictions of indolent (Fig. 7B) and aggressive (Fig. 7F) patterns, which caused indolent and aggressive WSI classification. indolent false positive tissue areas showed large and small dilated atrophic glands (Fig. 7C, D) and aggressive false positive tissue areas showed severe infiltration of lymphocytes and histiocytes (Fig. 7G, H), which could be the primary causes of false positives due to its morphological similarity in indolent pattern (Gleason pattern 3) and aggressive pattern (Gleason pattern 4 and 5).

False negative indolent and aggressive prediction of core needle biopsy WSIs
According to the histopathological reports and additional pathologists' consensus reviewing, in Fig. 8A, infiltrating adenocarcinoma showed indolent pattern (Gleason pattern 3) in the limited area of fragment #1 (Fig. 8C). Fragment #2-#10 were benign (non-neoplastic) lesions. The heatmap image (Fig. 8B) showed a weakly indolent predicted tile (Fig. 8D) which was corresponded with Gleason pattern 3 histopathology (Fig. 8C). Therefore, the false negative WSI classification was provided. In Fig. 8E, a few fragmented adenocarcinoma foci with cribriform pattern which indicated aggressive pattern (Gleason pattern 4) (Fig. 8G) in a fragment (#2). The heatmap image (Fig. 8F) showed true positive prediction of a few adenocarcinoma with low probability (Fig. 8H). Therefore, the false negative WSI classification was provided. Both of these WSIs (Fig. 8A, E) consist of very low volume of adenocarcinoma, which could be the primary causes of false negatives.

Both indolent and aggressive prediction outputs of core needle biopsy WSIs
There were 114 out of 645 WSIs in the test set (Table 1) which were predicted as both indolent and aggressive by our model (TL-Colon poorly ADC (x20, 512) and FS + WS). After looking over these WSIs carefully, we found tendencies in these WSIs which consisted of mixture of Gleason pattern 3 and Gleason pattern 4 adenocarcinoma in degree of the borderline (cut-off 20%) between indolent and aggressive evaluation (Fig. 1). For example, histopathologically, small, indistinct, or fused glands (equivalent to Gleason pattern 4) adenocarcinoma was predominant (Fig. 9A, D, E). However, at the same time, Gleason pattern 3 adenocarcinoma was mixed in various degrees (Fig. 9A, D, E) in the area of Gleason pattern 4 adenocarcinoma infiltration. Importantly, in all 114 WSIs predicted as both indolent and aggressive predicted, the boundary between Gleason pattern 3 and Gleason pattern 4 adenocarcinoma was unclear and traditional which was confirmed retrospectively by senior pathologists. The heatmap images of indolent (Fig. 9B) and aggressive (Fig. 9C) revealed that to some extent, indolent (Gleason pattern 3) (Fig. 9F, H) and aggressive (Gleason pattern 4 and 5) (Fig. 9G, I) prediction outputs were overlapped. Therefore, the WSI prediction outputs (indolent or aggressive) were approximate values. In these WSIs, the WSI classification was selected larger value of indolent or aggressive. If we compute ROC-AUC and log-loss based on the criteria for acceptance of double label WSI classification outputs (meaning both indolent and aggressive prediction outputs), the scores are as follows:

Inter-and intra-rater reliability study
To assess the inter-rater reliability of benign, indolent adenocarcinoma, and aggressive adenocarcinoma classification on WSIs, we have selected WSI based on our deep learning model (TL-Colon poorly ADC (x20, 512) and FS + WS) WSI prediction outputs and consensus classification by senior pathologists. As for true-negative cohort (25 WSIs; consensus: benign, AI predicted label: benign), S-scores in the range of 0.90-0.95, indicating "almost perfect agreement" ( Table 4). As for the truepositive indolent cohort (25 WSIs; consensus: indolent, AI predicted label: indolent), S-scores in the range of 0.56-0.72, indicating "moderate to substantial agreement" ( Table 4). As for the both indolent and aggressive predicted cohort (25 WSIs; consensus: 13 indolent and 12 aggressive, AI predicted label: indolent & aggressive), S-scores in the range of 0.10-0.28, indicating "slight to fair agreement" ( Table 4). As for the true-positive aggressive cohort (25 WSIs; consensus: aggressive, AI predicted label: aggressive),S-scores in the range of 0.48-0.81, indicating "moderate to almost perfect agreement" ( Table 4).
The inter-rater reliability study was performed two times by randomizing a total 100 of identical WSIs with a onemonth interval between 1st and 2nd studies. The S-scores in the 2nd study were slightly higher than 1st study and interpretations in the 2nd study were modestly improved than 1st study ( Table 4). As for the aggressive classification, the S-scores in the pathologists more than 10 years experiences were higher than pathologists less than 10   (Table 4). Overall, WSIs which were predicted as both indolent & aggressive labels by our deep learning model (TL-Colon poorly ADC (x20, 512) and FS + WS) resulted very low S-scores in the range of 0.10-0.28, meaning poor inter-rater reliability (agreement) ( Table 4) by pathologists regardless of experiences. As for the intra-rater reliability, all 10 pathologists achieved robust weighted kappa values in the range of 0.93-0.97, indicating "almost perfect agreement" (Table 5. Figure 10 shows a representative example WSI of poor evaluation (diagnostic) concordance among pathologists. As for the inter-rater reliability study, 5 pathologists evaluated as indolent and 5 pathologist as aggressive in this WSI (Fig. 10A). In Fig. 10A, there are wide variety of adenocarcinoma histopathologies. The heatmap images show both indolent (Fig. 10B) and aggressive (Fig. 10C) predictions by our deep learning model (TL-Colon poorly ADC (x20, 512) and FS + WS). In Fig. 10D, Gleason pattern 3 (indicating indolent) adenocarcinoma was predominant, which was predicted as indolent (Fig. 10E) not aggressive (Fig. 10F). In Fig. 10G and J, Gleason pattern 3 (indicating indolent) and Gleason pattern 4 (indicating aggressive) adenocarcinoma were mixed and it was hard to evaluate between two labels (indolent and aggressive), which were predicted as both indolent (Fig. 10H and K) and aggressive ( Fig. 10I and L).

Discussion
In this study, we trained deep learning models for the classification of indolent and aggressive prostate adenocarcinoma in core needle biopsy WSIs to make an inference for patients' optimum clinical interventions (active surveillance or definitive therapy). We trained deep learning models using a combination of transfer learning  [25,41,55], weakly supervised [53], and fully supervised [24,43,54] learning approaches. The evaluation results on the WSI level showed no significant differences between transfer learning and weakly supervised learning model (TL-Colon poorly ADC (x20, 512) and WS) and transfer learning, fully and weakly supervised learning model (TL-Colon poorly ADC (x20, 512) and FS+WS) ( Table 3). However, the results at the tile level (visualised via heatmap images), the model (TL-Colon poorly ADC (x20, 512) and FS+WS) predicted both indolent (Gleason pattern 3) and aggressive (Gleason pattern 4 and 5) areas more precisely than weakly supervised learning model (TL-Colon poorly ADC (x20, 512) and WS) ( Fig. 4. Therefore, we have selected the model (TL-Colon poorly ADC (x20, 512) and FS+WS) as the best model, which achieved ROC-AUCs at 0.846 (CI: 0.813 -0.879) (indolent) and 0.980 (CI: 0.969 -0.990) (aggressive) ( Table 3).
To the best of our knowledge, this is the first study to demonstrate the deep learning model to predict patients' clinical interventions (active surveillance or definitive therapy) based on the histopathological WSIs. A previously reported deep learning model achieved ROC-AUC in the range of 0.855 (external test set) -0.974 (internal test set) for the classification of benign and Gleason grade group 1-2 vs. Gleason grade group greater than or equal to 3 [58]. Our model (TL-Colon poorly ADC (x20, 512) and FS+WS) achieved better ROC-AUC performance in aggressive (0.980 (CI: 0.969 -0.990)) ( Table 3). Our model (TL-Colon poorly ADC (x20, 512) and FS+WS) predicted indolent (Gleason pattern 3) and aggressive (Gleason pattern 4 and 5) lesions well after inspection of WSI heatmaps (Fig. 4, 5, 6). The model still had a few cases of false positive and false negative predictions (Fig. 7, 8). Our model (TL-Colon poorly ADC (x20, 512) and FS+WS) tends to show false positive predictions of indolent lesions where the tissues consist of atrophic glands and aggressive lesions where the tissues consist of severe inflammatory cell infiltration (Fig. 7). Our model tends to show false negative predictions of indolent and aggressive lesions where adenocarcinoma tissues were limited volumes (Fig. 8).
However, a major limitation in this study is the inability of the model to decide on borderline cases. As we have increased the cut-off from 10% to 20%, the model would be unable, for instance, to predict a Table 4 Inter-rater reliability between pathologists using the S-score for two experiments on the same set conducted with a one month gap  Gleason score 3 (85%) + 5 (15%), which according to the guidelines should be aggressive. Any prediction by our model on such a case would be unreliable due to the lack of borderline cases in the training set and our modified cut-off. Another limitation with this study is related to WSIs with both indolent (Gleason pattern 3) and aggressive (Gleason pattern 4 and 5) components mixed in various proportions, where there was wide variability in inter-rater (observer) concordance among pathologists, regardless of their years of experiences after becoming board certified (Table 4 and Fig. 10) [59]. On such WSIs which consisted of mixture of Gleason pattern 3 and Gleason pattern 4 adenocarcinoma in degree of the borderline (cut-off 20%) between indolent and aggressive evaluation (Fig. 1), our deep learning model (TL-Colon poorly ADC (x20, 512) and FS+WS) tends to predict both indolent and aggressive WSI outputs (17.7% of total WSIs in the test set) as well as pathologists (Fig. 9, 10). Indeed, there were a certain number of WSIs with Gleason pattern 4 or Gleason pattern 5 component around 20% of total adenocarcinoma in the test set, which were the major cause of poor concordance among pathologists and deep learning model WSI prediction outputs with both indolent and aggressive (Fig. 10). It has been reported that with less than 10% involvement of the core, it was more difficult to assess in smaller foci, with only moderate agreement [35]. Given that in a small focus only a few glands of a given pattern can markedly affect the percent Gleason pattern 4, consideration should be given to not recording percent Gleason pattern 4 in small foci of Gleason score 7 tumors on core needle biopsy [35]. This issue is inevitable when classifying WSIs based on percentages of adenocarcinoma components (Gleason pattern 3, 4, 5). Moreover, there were a certain number of WSIs in which there was a marked discrepancy among pathologists as to whether the prostate adenocarcinoma was classified as Gleason pattern 3 or Gleason pattern 4 (Fig. 10). Practically, the histopathological segregation of Gleason pattern 3 and Gleason pattern 4 is often problematic [38,59]. Currently, according to the diagnostic criteria of Gleason pattern 4 adenocarcinoma on In (D), Gleason pattern 3 adenocarcinoma was predominant, which was precisely predicted as indolent (E) but not as aggressive (F). In (G), Gleason pattern 3 and 4 adenocarcinoma were mixed, which were predicted as both indolent (H) and aggressive (I). In (J), the majority of adenocarcinoma was mixed Gleason pattern 3 and Gleason pattern 4, which were predicted as both indolent (K) and aggressive (L). The model (TL-Colon poorly ADC (x20, 512) and FS+WS) predicted the WSI (A) as both indolent and aggressive. The heatmap uses the jet color map where blue indicates low probability and red indicates high probability core needle biopsy, poorly formed glands immediately adjacent to other well-formed glands regardless of their number and small foci of less than or equal to 5 poorly formed glands regardless of their location should be graded as Gleason pattern 3 [39], which would be one of the primary cause of both indolent and aggressive prediction outputs. Moreover, in this study, instead of assigning an indolent or aggressive label to each core needle biopsy specimen, we considered all specimens on a WSI together as a single specimen Therefore, it was possible to be poor inter-observer concordance among pathologists if total histopathological area was too large (e.g., six or eight core specimens in a single WSI) to evaluate. However, it can be possible to resolve the issue by specimen preparation with one core needle biopsy specimen per glass slide (WSI) for biopsy specimens assuming the deep learning model prediction.
Interestingly, when we compute the model (TL-Colon poorly ADC (x20, 512) and FS+WS) performance based on the criteria for acceptance of double label WSI classification outputs (both indolent and aggressive), indolent ROC-AUC were increased (0.956 [CI: 0.940-0.970]) and log-loss was decreased (0.969 [CI: 0.835-1.109]) as compared to Table 3. The other limitation in this study is that limited generalization of the deep learning model (TL-Colon poorly ADC (x20, 512) and FS+WS) because training and test set were provided by the same supplier hospitals (Kamachi Group Hospitals and Sapporo-Kosei General Hospital). Therefore, in the next step, to verify the versatility of the model (TL-Colon poorly ADC (x20, 512) and FS+WS), we need to perform verification study using enough number of WSIs from diverse range of hospitals. The main advantage of our deep learning model (TL-Colon poorly ADC (x20, 512) and FS+WS) is that the model can predict patients' optimum clinical interventions (active surveillance: indolent or definitive therapy: aggressive) on core needle biopsy WSIs. For most patients with low-risk (Gleason score less than or equal to 6) prostate cancer, active surveillance is the recommended disease management strategy [2]. At the same time, select patients with low-volume, intermediaterisk prostate cancer (indolent in this study) can be offered active surveillance [2]. In routine histopathological diagnosis for prostate cancer in core needle biopsy specimens, pathologists have to report Gleason scores for each core for risk assessment by using microscope which would be fatigue and laborious works. Moreover, it is revealed that there are significant inter-rater variability among pathologists in diagnosis of prostate cancer [9,35,59]. By using our deep learning model as an initial screening, pathologists can check WSIs with heatmap image highlighting indolent (Gleason pattern 3) and aggressive (Gleason pattern 4 and 5) adenocarcinoma and WSI prediction outputs (benign, indolent, and aggressive), which would be a great benefit for general pathologists to make diagnoses.