Clinical samples
We obtained the formalin-fixed paraffin embedded (FFPE) tissue samples from 55 primary tumors of lung cancer (35 lung adenocarcinoma, 20 lung squamous cell carcinoma - hereafter denoted LUSC). The material was derived from FFPE surgical resections at the Medical University of Lublin, Poland. At the moment of diagnosis and surgical resection of the primary cancer lesions, none of the patients had received neoadjuvant therapies. We collected clinical and demographic patients’ data in a manner that protected their personal information. The study protocol received ethical approval from the Ethics Committee of the Medical University of Lublin, Poland (no KE-0254/235/2016).
Extraction and annotation of the training dataset for ARA-CNN
We extracted the training dataset from the H&E slides sourced from 55 lung cancer patients in total, with 1 slide per patient (Fig. 1a). From these 55, 26 were selected and from them regions of contiguous tissue were annotated using QuPath [36] by an expert pathologist, marking them as one of the following nine classes: tumor with neoplastic epithelial cells; stroma composed of connective tissue within tumor or extra-tumoral connective tissue; mixed where connective tissue was strongly infiltrated with immune cells; immune composed of lymphocytes and plasma cells or fragments of pulmonary lymph nodes; vessel composed of smooth muscle layers (veins and arteries) with red blood cells within lumen; bronchi composed of cartilage and bronchial mucosa; necrosis including necrotic tissue or necrotic debris; lung (lung parenchyma); and background of the tissue scan (no tissue). In the 26 annotated slides, 13 were from LUAD, 10 from LUSC, 2 from large cell carcinoma and 1 from small-cell lung cancer patients. The TME in the original slides differed between patients, which gave us a diverse set of training examples (Sup. Figure 1 in Additional file 1). Some of the slides were more covered by tumor and necrotic cells or stroma, while in others immune infiltration, vessels or mixed class were dominating. In most of the slides we observed the “normal” lung structures, so bronchi was less common and needed more training data from many sections. All annotated regions were chosen for the purpose of providing the best material for model training. To this end, for a given class, we were annotating tissue that was undoubtedly of that class, and there was enough of that tissue visible in the slides to provide enough annotated patches. For example, for the vessel class, we did not consider arterioles, dilated capillaries or venules, as these tissues were too small for the chosen patch scale. Lymphatics were ignored due to them being imperceptible on H&E slides. For the immune class, intrapulmonary lymph nodes were included due to their high concentration of well-visible lymphocytes, even though the presence of such lymph nodes is not correlated with the tumor’s immune response.
The annotated regions were then traced by a moving window, which cut out non-overlapping square patches of tissue with side size of 87 μm (which corresponded to 172 px). In addition to 87 μm, we also tested the training performance for patches with sizes of 74 μm and 100 μm (see Sup. Table 1–3 in Additional file 1). They resulted in worse performance (mean classification accuracy 84.64% for 74 μm, 84.35% for 100 μm, 85.21% for 87 μm), so we proceeded with using the 87 μm sized ones. This gave us an initial version of the training dataset, which was then improved upon by utilizing human-in-the-loop active learning, as part of the previously proposed accurate, reliable and active (ARA) image classification framework [37] (Fig. 1b; see below). In total, we ended up with 23,199 patches, divided in the following manner: 3311 tumor patches, as well as 1511 stroma, 716 mixed, 1196 immune, 1236 vessel, 2030 bronchi, 4448 necrosis, 6031 lung, and 2211 background patches.
Training and validation of the ARA-CNN model
The main component of the ARA framework is ARA-CNN, a Convolutional Neural Network (CNN) architecture inspired in part by Microsoft ResNet [38] and DarkNet 19 [39]. It includes standard techniques for such models: Batch Normalization [40] (for normalization and to reduce overfitting) and dropout [41] (to reduce overfitting). The latter allowed us to apply variational dropout [42] during testing. Variational dropout is used to estimate uncertainty for every input image, which is returned together with its predicted class. For the overview of the model architecture, refer to Fig. S1 in Rączkowski et al. [37].
The model was trained in three active learning rounds, each one improving upon the previous ones. After the first training process, the distribution of uncertainty for images in each class was measured separately and due to higher median uncertainty, it was concluded that there are three classes in need of more training examples: mixed, vessel and bronchi. These were passed on to the pathologist, who labeled new regions belonging to these classes. The resulting new training samples were extracted and added to the previous training dataset. After each iteration of the adaptive training procedure, the uncertainty for each class was measured again. After three iterations it was decided that the uncertainty results, with median uncertainty level below 1.5 for each class, were at a satisfactory level (uncertainty was measured using the entropy of the predictions [37]; see Sup. Figure 2 for the obtained entropy levels after the last iteration).
For training and evaluation of the ARA-CNN model on the lung cancer tissue patches, we used stratified 10-fold cross-validation. This was the case during the active learning process as well. In a given iteration, the whole gathered dataset of images was split into a training dataset and a test dataset used for evaluation. For the final evaluation, the full dataset of 23,199 patches was divided into the test dataset containing 2316 patches and the training dataset consisting of 20,883 patches. Each class was split in exactly the same proportion: 10% were sent to the test dataset and 90% to the training dataset.
Additionally, in each training epoch the training data was split into two datasets: the actual training data and a validation dataset. The latter was used for informing the learning rate reducer - we monitored the accuracy on the validation set and if it stopped improving, the learning rate was reduced by a factor of 0.1. This split was in proportion 90 to 10% between actual training data and the validation set, respectively.
Each training process included real-time data augmentation, chosen at random from the following set of transformations: horizontal flip, vertical flip, rotation (up to 90 degrees), zoom (up to 40%), width shift (up to 10%) and height shift (up to 10%).
For parameter optimisation, we used the Adam optimiser [43]. The training time was set to 100 epochs. The training data was passed to the network in batches of 32, while the validation and test data was split into batches of 128 images. The loss function used during training was the categorical cross-entropy. The final ARA-CNN model was trained on the whole dataset of 23,199 patches (which was still split into the training and validation parts in proportion 90 to 10%). This final model was used in further experiments, segmentation of slides from an independent TCGA dataset, predictions of mutations and patient survival (see below).
TCGA data extraction and processing
Independent patient data, including H&E images, mutations, and clinical information was extracted from the TCGA database (as of 2020-05-14) through a REST API provided by TCGA. The database contained 478 LUAD cancer patients with at least one H&E tissue slide per patient. Out of these, frozen tissue slides were filtered out, which left 514 images. We developed a parallelized pipeline that downloaded the slides, ran all necessary calculations and then removed the processed images.
Each processed slide was split into non-overlapping patches with side size of 87 μm, same as for the patches in LubLung. As an optimisation step, we filtered the extracted patches and excluded the ones where most of the area was empty. To perform this filtering, we first converted each patch to grayscale using the standard Rec. 601 luma formula. Then, we mapped each pixel to either black (pixel value < 200) or white (pixel value ≥ 200) and counted them. The patch was deemed as relevant if the proportion of black pixels to white pixels was larger than 0.05.
In addition to the slides, clinical and mutation data was also extracted from TCGA. The former was downloaded using the curatedTCGAData R package [44] and it contained data for 518 LUAD patients. We removed data for Asian patients, where the race was determined by the ‘race’ column present in the TCGA clinical data. These patients are noted to be very distinct when it comes to disease progression [45, 46] and mutational profile. In particular, for Asians EGFR is mutated in up to 45% of cases, while for Caucasians it is closer to 10%. These factors and the low number of such cases when compared to the whole dataset contributed to omitting Asian patients in further analysis. This left us with clinical data for 510 patients. Mutation data for 563 LUAD patients was downloaded from the UCSC Xena Browser [47] (dated 07-20-2019), in the form of a TCGA-LUAD.muse_snv.tsv file. In addition, it was downsized to include only genes selected as relevant to lung cancer. To this end, we examined a set of genes that are either known to be frequently mutated in lung cancer and are important for patient prognosis and treatment, as characterized by the SureSelect Cancer All-In-One Catalog and Custom Assays [48], or were studied previously by Kather et al. [34]. Specifically, we selected such genes from the Kather et al. study, which showed an AUC higher than 70% or p-value < 0.001 in the task of mutation prediction by a deep learning model. From this set of 24 genes, we filtered out such genes for which there were only up to 15 LUAD patients carrying their mutation. This resulted in the following set of 13 selected genes: ALK, BRAF, DDR2, EGFR, KEAP1, KRAS, MET, PIK3CA, RET, ROS1, STK11, TP53, PDGFRB.
All datasets were merged together by the TCGA patient identifier. This left an intersection between the image, clinical and mutation datasets, which contained 444 patients and 506 slides.
As a last step, the clinical variables were pre-processed in the following manner. Age was quantified into two groups: 65 years and older, as well as younger than 65 years. Sex was set to 1 for male and 0 for female patients. Pack years were quantified into three groups: non-smoker (0 pack years or smoking history set as ‘lifelong non-smoker’ or, if information about pack years was missing, smoking history set to ‘current reformed smoker for > 15 years’), light smoker (less than 30 pack years or, if information about pack years was missing, smoking history set as ‘current reformed smoker for < or = 15 years’) and heavy smoker (30 or more pack years, or smoking history set to ‘current smoker’). Pathologic stage was mapped into three groups as well: early (stage I, Ia, Ib), locally advanced (stage II, IIa, IIb, IIIa) and advanced (stage IIIb, IV).
Due to its limited availability, we did not include treatment information in the main dataset. For the sake of additional analyses, we prepared a smaller dataset, containing 377 patients, with treatment data included (Sup. Methods in Additional File 1).
TCGA H&E patch normalization
Due to the fact that the tissue slides stored in TCGA exhibit high color variation, they needed to be normalized to a common color space, matching that of the training dataset. Three normalization algorithms were considered: Reinhard et al. [49], Macenko et al. [50] and Vahadane et al. [51]. To decide which of these three should be used on the TCGA data, a series of experiments was conducted, in which the LubLung training dataset was normalized with each of these algorithms and then ARA-CNN was trained in a cross-validation schema. The results showed that the best classification performance (mean accuracy 81.52% for Macenko et al., 82.39% for Vahadane et al., 85.76% for Reinhard et al.) was achieved for the dataset variant normalized with the Reinhardt et al. algorithm. Consequently, each relevant patch extracted from the TCGA slides was normalized individually with the Reinhardt et al. procedure. The normalization was performed with a region of interest image selected at random from the training dataset. All image patches from the TCGA database were transformed to match the color space of that image.
TCGA image data segmentation using ARA-CNN
The normalized patches served as input to ARA-CNN. For each input patch, the model returned a classification probability into each of the nine predefined classes. With these results, each patch was labeled with the class with the highest probability and then the labeled patches were merged back into their full respective slides and colored by the label. This created segmented slides, with clearly visible continuous areas of differing tissue.
The segmented slides were next validated by an expert pathologist, who assessed that 39 slides needed to be excluded from further analysis. There were two reasons for that. The first one involved erroneous classifications returned by ARA-CNN - 21 out of 506 slides contained errors of such nature. The other 18 slides were excluded due to colored ink markings and other staining errors. After this process, the final dataset contained 467 slides from 411 patients.
Quantification of spatial features for the segmented tumor tissues
The obtained segmented images from TCGA were then processed further in order to extract spatial information in the form of two types of features, which we referred to as tissue prevalence (TIP) and tumor microenvironment composition (TMEC). TIP is a distribution of tissue classes within the whole tissue area, i.e. excluding the background class. TMEC measures a distribution of tissues that neighbor the tumor tissue within a predefined margin.
The prevalence ti of tissue i is expressed as:
where ni is the number of patches for tissue i and N is the total number of tissue patches (excluding the background class) and i∈ {TUMOR, STROMA, MIXED, IMMUNE, VESSEL, BRONCHI, NECROSIS, LUNG}. The vector with entries given by ti makes up the TIP features. The background class was omitted, as it’s not relevant to the tissue structure.
The microenvironment composition mj for tissue j is:
where bj is the number of patches of class j that neighbor the tumor class and B is the total number of all patches neighboring the tumor class (excluding the tumor itself and the background class), with j∈ {STROMA, MIXED, IMMUNE, VESSEL, BRONCHI, NECROSIS, LUNG}. The TMEC features are organized in a vector with mj as its entries. The neighbor patches are considered only within a margin around the borders of tumor regions. Each tumor patch is considered separately and up to eight neighbors around it are counted. These patches are summed up to bj for each class j and to B in total.
Using the microenvironment and prevalence data, we also calculated three spatial metrics that were previously defined in the literature: intra-tumor lymphocyte ratio (ITLR) [21], Simpson diversity index [52], Shannon diversity index [53]. We used a simplified version of these metrics - instead of cell-wise, we calculated them patch-wise. Specifically, these metrics were computed as follows:
$$ITLR=\frac{b_{IMMUNE}}{n_{TUMOR}}$$
$$Shannon=-{\sum}_i{t}_i\mathit{\log}\left({t}_i\right)$$
$$Simpson={\sum}_i{t}_i^2,$$
where bIMMUNE is the number of immune patches that neighbor the tumor and tTUMOR is the number of tumor patches in the whole slide.
Multivariate survival modeling using Cox model
The aforementioned predictors were used as input to the Cox proportional hazards model. They were organized into the following basic variants: clinical, clinical + ITLR, clinical + Shannon diversity index, clinical + Simpson diversity index, clinical + TMEC, clinical + TIP, clinical + TMEC + TIP. In addition, variants with mutation data added on top of clinical data were considered. Each variant was trained in a 10-fold cross-validation schema. For categorical variables, the hazard ratio of their basal values was set to 1. For the sex variable, the basal value was ‘Female’. For the Stage variable, the basal value was ‘Early stage’. For mutation variables (EGFR, STK11 and TP53) the basal value was the absence of alteration. Finally, for smoking status, non-smoker was set as basal.
Mutation classification
The processed data from TCGA served as input in the mutation classification task. The predictor variables were the same as in the survival prediction task (minus mutation status). The response variables were binary and were defined by the mutation status for the 13 previously chosen frequently mutated LUAD genes (see TCGA data extraction and processing).
For each classification task, where the class was specified by the presence of mutation of a given gene, the dataset was oversampled so that positive (mutation occurred) and negative (mutation did not occur) subsets of examples were equal in size. Oversampling was done by inserting multiple copies of the positive examples so that their number reached that of the negative ones.
Eight combinations of predictive features were tested: clinical, clinical + ITLR, clinical + Shannon diversity index, clinical + Simpson diversity index, clinical + TMEC, clinical + TIP, clinical + TMEC + TIP, TMEC only. To classify the mutation status for each gene, two distinct machine learning models were trained and compared. The first one was a simple linear model in the form of logistic regression. It was fitted using the Liblinear solver [54], with the L2 (ridge) penalty [55] and up to 2000 iterations. The second one was the Random Forest algorithm [56]. We used the implementation from the sklearn Python library [57] with default parameter values.
All models were trained 100 times with 10-fold cross-validation and the resulting classification accuracy metrics were averaged. Classification performance was evaluated using the AUC metric.