We begin with a description of the computational pipeline. As noted above, a key aspect of our approach is the use of FT-IR imaging data on a serial section that is H&E-stained to enhance the segmentation of nuclei and lumens. The first two components of the pipeline are geared to this functionality, while the next three components exploit the segmented features obtained from image data to classify the tissue sample (Figure 3).
Image Registration
Given two images, the image registration problem can be defined as finding the optimal spatial and intensity transformation [41] of one image to the other. Here, two images are H&E stained (I
reference
) and "IR classified" images (I
target
) which were acquired from adjacent tissue samples. The IR classified image represents the FT-IR imaging data, processed as indicated in Figure 2, to classify each pixel as a particular cell type. Although the two tissue samples were physically in the same intact tissue and are structurally similar, the two images have different properties (total image and pixel sizes, contrast mechanisms and data values). Hence, features to spatially register the images are not trivial. The H&E image provides detailed morphological information that could ordinarily be used for registration, but the IR image lacks such information. On the other hand, the IR image specifies the exact areas corresponding to each cell type, but the difficulty in precisely extracting such regions from the H&E image hinders us from using cell-type information for registration. The only obvious features are macroscopic tissue sample shape and empty space (lumens) inside the tissue samples. To utilize these two features and to avoid problems due to differences in the two imaging techniques, both images are first converted into binary images. Due to the binarization, the intensity transformation is not necessary. As a spatial transformation, we use an affine transformation ( f ) [41] where a coordinate (x
1
, y
1
) is transformed to the (x
2
, y
2
) coordinate after translations (t
x
, t
y
), rotation by θ, and scaling by factor s.
Accordingly, we find the optimal parameters of the affine transformation that minimizes the absolute intensity difference between two images (I
reference
and I
target
). In other words, image registration amounts to finding the optimal parameter values . The downhill simplex method [42] is applied to solve the above equation. An example of this registration process is shown in Figure 4. (See [Additional file 1: Image Registration] for details.)
Identification of epithelial cells and their morphologic features
While a number of factors are known to be transformed in cancerous tissues, epithelial morphology is utilized as the clinical gold standard. Hence, we focus here on cellular and nuclear morphology of epithelial nuclei and lumens. These structures are different in normal and cancerous tissues, but are not widely used in automated analysis due to a few reasons. First, as described above, simple detection of epithelium from H&E images is difficult. Second, detection of epithelial nuclei may be confounded by a stromal response that is not uniform for all grades and types of cancers. We focused first on addressing these two challenges that hinder automatically parsing morphologic features such as the size and number of epithelial nuclei and lumens, distance from nuclei to lumens, geometry of the nuclei and lumens, and others (Feature Extraction). In order to use these properties, the first step is to detect nuclei and lumens correctly and we sought to develop a robust strategy for the same.
Lumen Detection
In H&E stained images, lumens are recognized to be empty white spaces surrounded by epithelial cells. In normal tissues, lumens are larger in diameter and can have a variety of shapes. In cancerous tissues, lumens are progressively smaller with increasing grade and generally have less distorted elliptical or circular shapes. Our strategy to detect lumens was to find empty areas that are located next to the areas rich in epithelium. White spots inside the tissue sample can be found from the H&E image by using a proper threshold value (>200) for the intensity of Red, Green, and Blue channels, and the pixels corresponding to epithelial cells can be mapped on the H&E image from the IR classified image through image registration. Although restricting the white areas adjacent to epithelial cells, in our observations, many artifactual lumens are still present. Additionally, the size and shape of lumens are examined to eliminate such artifacts. We note that while lumens are ideally completely surrounded by epithelial cells (called complete lumens), some tissue samples have lumens (called incomplete lumens) that violate this criterion because only a part of lumen is present in the tissue sample. To identify these incomplete lumens, we model an entire tissue sample as a circle, and the white spots between the tissue sample and the circle are the candidate incomplete lumens. As did in complete lumen detection, the same threshold value is used to identify white areas. To identify artifacts, we use heuristic criteria based on the size, shape, presence of epithelial cells and background around the areas. In addition, the distances from the center of the tissue to the white spots are examined to identify the artifacts in crescent form which resulted from the small gaps between the tissue sample and the circle fitted to the sample. (See [Additional file 1: Lumen Detection] for details.)
Nucleus Detection - single epithelial cells
Epithelial nucleus detection by automated analysis is more difficult than lumen detection due to variability in staining and experimental conditions under which the entire set of H&E images were acquired. Differences between normal and cancerous tissues, and among different grades of cancerous tissues, also hamper facile detection. To handle such variations and make the contrast of the images consistent, we perform smoothing [43] and adaptive histogram equalization [44] prior to nuclei identification. Nuclei are relatively dark and can be modeled as small elliptical areas in the stained images. This geometrical model is often confounded as multiple nuclei can be so close as to appear like one large, arbitrary-shaped nucleus. Also, small folds or edge staining around lumens can make the darker shaded regions difficult to analyze. Here, we exploit the information provided by the IR classified image to limit ourselves to epithelial cells, and use a thresholding heuristic on a color space-transformed image to identify nuclei with high accuracy. Superimposing the IR classified image on the H&E image, pixels corresponding to epithelium can be identified on the H&E image. These epithelial pixels are dominated by one of two colors: blue or pink, which arise from the nuclear and cytoplasmic component respectively. For nuclei restricted to epithelial cells in this manner, a set of general observations were made that led us to convert the stained image to a new image where each pixel has an intensity value |R + G - B|. (R, G, and B represent the intensity of Red, Green, and Blue channels, respectively.) This transformation, followed by suitable thresholding, was able to successfully characterize the areas where nuclei are present. The threshold values are adaptively determined for Red and Green channels due to the variations in the color intensity. Finally, filling holes and gaps within nuclei by a morphological closing operation [45], the segmentation of each nucleus is accomplished by using a watershed algorithm [45] followed by elimination of false detections. The size, shape, and average intensity are considered to identify and remove artifactual nuclei. Figure 5 details the nucleus detection procedure. (See [Additional file 1: Nucleus Detection] for details.)
Feature Extraction
As mentioned above, the characteristics of nuclei and lumens change in cancerous tissues. In a normal tissue, epithelial cells are located mostly in thin layers around lumens. In cancerous tissue, these cells generally grow to fill lumens, resulting in a decrease in the size of lumens, with the shape of lumens becoming more elliptical or circular. The epithelial association with a lumen becomes inconsistent and epithelial foci may adjoin lumens or may also exist without an apparent lumen. Epithelial cells invading the extra-cellular matrix also result in a deviation from the well-formed lumen structure; this is well-recognized as a hallmark of cancer. Due to filling lumen space and invasion into the extra-cellular space, the number density of epithelial cells increases in tissue. The size of individual epithelial cells and their nuclei also tend to increase as malignancy of a tumor increases. Motivated by such recognized morphological differences between normal and cancerous tissues, we chose to use epithelial nuclei and lumens as the basis of the several quantitative features that our classification system works with. (See examples of such features in Figure 6.) It is notable that these observations are qualitative in actual clinical practice and have not been previously quantified.
Epithelial cell-related features
Epithelial cell information is available from IR data. However, individual epithelial cells in the tissue are not easily delineated. Therefore, in addition to features directly describing epithelial cells, we also quantify properties of epithelial nuclei, which are available from the segmentation described above. The quantities we measure in defining features are: (1) size of epithelial cells, (2) size of epithelial nuclei, (3) number of nuclei in the tissue sample, (4) distance from a nucleus to the closest lumen, (5) distance from a nucleus to the epithelial cell boundary, (6) number of "isolated" nuclei (nuclei that have no neighboring nucleus within a certain distance), (7) number of nuclei located "far" from lumens, and (8) entropy of spatial distribution of nuclei (Figure 6G). [Additional file 1: Epithelium-related Features] provide specifics of these measures and their calculation.
Lumen-related features
Features describing glands have been shown to be effective in PCa classification [21, 25]. Here, we try to characterize lumens and mostly focus on the differences in the shape of the lumens. The quantities we measure in defining these features are: (1) size of a lumen, (2) number of lumens, (3) lumen "roundness" [25], defined as where L
peri
is the perimeter of the lumen, L
area
is the size of the lumen (i.e., number of pixels in the lumen), and r is the radius of a circle of size L
area
, (4) lumen "distortion" (Figure 6A), computed as where is the distance from the center of a lumen to the boundary of the lumen and AVG(·) and STD(·) represent the average and standard deviation, (5) lumen "minimum bounding circle ratio" (Figure 6B), defined as the ratio of the size of a minimum bounding circle of a lumen to the size of the lumen, (6) lumen "convex hull ratio" (Figure 6C), which is the ratio of the size of a convex hull of a lumen to the size of the lumen, (7) symmetric index of lumen boundary (Figure 6E, see [Additional file 1: Lumen-related Features]), (8) symmetric index of lumen area (Figure 6F, see [Additional file 1: Lumen-related Features]), and (9) spatial association of lumens and cytoplasm-rich regions (Figure 6D, see [Additional file 1: Lumen-related Features]). Features (3) - (8) are various ways to summarize lumen shapes, while feature (9) is motivated by the loss of functional polarization of epithelial cells in cancerous tissues.
Global & local tissue features
We have described above the individual measures of epithelium and lumen related quantities that form the basis of the features used by our classification system. Normally, these features have to be summary measures over the entire tissue sample or desired classification area. Hence, we employ average (AVG) or standard deviation (STD), and in some cases the sum total (TOT) of these quantities for further analysis. These features are called "global" features since they are calculated from the entire tissue sample. However, in some cases global features may be misleading, especially where only a part of the tissue sample is indicative of cancer. Therefore, in addition to global features, we define "local" features by sliding a rectangular window of a fixed size (100 × 100 pixels) throughout a tissue sample. For each window, AVG and/or TOT of the epithelium and lumen related quantities are computed. STD or extremal values (MIN or MAX) of the AVG and/or TOT values over all windows become local feature values (Figure 7). In all, 67 features (29 global and 38 local features) are defined capturing various aspects of tissue morphology.
Feature Selection
Feature selection is the step where the classifier examines all available features (67 in our case) with respect to the training data, and selects a subset to use on test data. This selection is generally based on the criterion of high accuracy on training data, but also strives to ensure generalizability beyond the training data. We adopt a two-stage feature selection approach here. In the first stage, we generate a set of candidate features (C
candidate
) by using the so-called minimum-redundancy-maximal-relevance (mRMR) criterion [46] (see [Additional file 1: mRMR]). In each iteration, given a feature set chosen thus far, mRMR chooses the single additional feature that is least redundant with the chosen features, while being highly correlated with the class label. C
candidate
is a set of features that is expected to be close to the optimal feature set for a data set and a classifier under consideration. It is constructed as follows. Given a feature set F = (f1, ..., fM) ordered by mRMR, the area under the ROC curve (AUC) of the set of i top-ranked features is computed for varying values of i. We limit the value of i to be ≤ 30. The feature subset with the best AUC is chosen as the C
candidate
. In the second stage, feature selection continues with C
candidate
as the starting point, using the sequential floating forward selection (SFFS) method [47]. This method sequentially adds new features followed by conditional deletion(s) of already selected features. Starting with the C
candidate
, SFFS searches for a feature x ∉ C
candidate
that maximizes the AUC among all feature sets C
candidate
∪ {x}, and adds it to C
candidate
. Then, it finds a feature x ∈ C
candidate
that maximizes the AUC among all feature sets C
candidate
- {x}. If the removal of x improves the highest AUC obtained by C
candidate
, x is deleted from C
candidate
. As long as this removal improves upon the highest AUC obtained so far, the removal step is repeated. SFFS repeats the addition and removal steps until AUC reaches 1.0 or the number of additions and deletions exceeds 20, and the feature set with the highest AUC thus far is chosen as the optimal feature set. The classification capability of a feature set, required for feature selection, is measured by AUC, obtained by cross-validation on the training set. SFFS can be directly applied to the original feature set; however, using mRMR may help to reduce the search space and time and to build the optimal classifier by providing a good initial feature set for SFFS.
Classification
We note that there are two levels of classification here. In the first, IR spectral data is used to provide histologic images where each pixel has been classified as a cell type. In the second, the measures from H&E images and IR images are used to classify tissue into disease states. For the first classification task, we used a Bayesian classifier built on 18 spectral features. This previously achieved >0.99 AUC on cell type classification [48, 49]. For the latter task, we used a well established classification algorithm, namely support vector machine (SVM) [50]. As a kernel function, a radial basis function with parameter γ = 10, 1, 0.1, 0.01, 0.001 is used. Two cost factors are introduced to deal with an imbalance in training data [51]. The ratio between two cost functions was chosen as
to make the potential total cost of the false positives and the false negatives the same. (See [Additional file 1: SVM] for details.)
Samples and Data preparation
All of the H&E stained images were acquired on a standard optical microscope at 40x magnification. The size of each pixel is 0.9636 um × 0.9636 um. On the other hand, the pixel size of IR images is 6.25 um × 6.25 um.
Two different tissue microarrays were obtained from two different sources (Tissue microarray research program at the National Institutes of Health and Clinomics Inc.). The first data set ("Data1") consisted of 240 tissue samples from 180 patients, and the second set ("Data2") includes 160 tissue samples from 80 patients. Both sets of tissue samples were sectioned to ~ 7 micron thick sections, with a section being placed on IR transparent BaF2 slides and a serial section on a standard glass slide. The acquisition of data is described elsewhere in [28]. Unfortunately, we were not able to use all of these tissue samples for several reasons. Each data set has two TMAs. One is H&E stained image and the other is IR image. Since these were experimental arrays, some TMA spots were missing in one or both arrays due to processing and plating on the salt plates used for IR analysis. Since our method focuses on epithelial cells, tissue samples which do not have enough epithelial cells (>100) in either of two images (H&E and IR) were not considered in this study. Moreover, some tissue samples in Data2 are spatially displaced and fused with neighboring tissue samples. Eliminating those tissue samples, 66 benign tissue samples and 115 cancer tissue samples are remained for Data1, and 14 benign and 36 cancer tissue samples remained for Data2. An example of H&E images for both data sets is shown in Figure 8.