Patients and clinical variables
This retrospective study was approved by the institutional review board of the West China Hospital of Sichuan University. We collected possible cases by reviewing discharge records of patients in West China Hospital from January 2010 to July 2017. The following terms were used to extract the data: lung cancer, lung adenocarcinoma, lung squamous carcinoma, non-small cell lung cancer, small cell lung cancer; inflammatory lung nodule, benign lung nodule, benign lung tumor, lung hamartoma, lung sclerosing hemangioma, lung tuberculosis, lung granuloma. Then, the patient was enrolled based on the following criteria: (a) there was an untreated, pathologically confirmed, 5–30 mm noncalcified solid nodule detected on chest CT; (b) the slice thickness of CT was less than or equal to 1 mm. Otherwise, patients were excluded if (a) there were multiple pulmonary nodules, or pleural effusion, atelectasis, lymph node enlargement was observed; (b) it wasn’t a primary lung tumor.
Totally, the current study enrolled 720 patients with 720 nodules, 348 benign and 372 malignant. The pathology of benign nodules was confirmed by surgery (N = 315, 90.5%) and CT guided percutaneous lung biopsy (N = 33, 9.5%), while the malignant nodules was confirmed by surgery (N = 365, 98.1%), CT guided percutaneous lung biopsy (N = 4, 1.1%) and transbronchial lung biopsy (N = 3, 0.8%), respectively.
Following clinical characteristics were recorded, including age, sex, smoking status, history of malignancy, family history of malignancy, nodule diameter, location, pathology and clinical stage. As surgically resected adenocarcinomas were predominant among all malignant nodules, prognostic data were collected for survival analysis.
CT image acquisition and nodule segmentation
Thoracic CT before treatment was obtained for each patient. All images were acquired from GE, Siemens or Philips scanners, with tube voltage and current being 100 ~ 120 Kvp and 60 ~ 250 mAs. Reconstructions were performed using a standard convolution kernel. The detailed information on manufacturer, manufacturer’s model and slice thickness were summarized in Table S1 and Table S2.
All target nodules were first manually segmented in 3D by one author with 4 years of clinical experience in pulmonology, using the ITK-SNAP software. Then, randomly selecting 100 patients, both the same author and another author manually segmented the target nodules again to assess the consistency of the intra-rater and inter-rater segmentations by calculating Dice similarity coefficient. Both authors were blinded to pathological results of lesions.
CNN models
Patients were randomly divided into training and testing set at a ratio of 7:3 during model establishment. The overall framework of the CNN model is shown in Fig. 1. Here we used transfer learning from a pre-trained benign-malignant nodule classification model, in which 1715 pathologically-confirmed nodules and 14,735 unlabeled nodules were used [20]. In detail, there were one 3D convolution layer with a kernel size of 3 and stride of 1 as input block, four 3D convolution layers with a kernel size of 3 and stride of 2 as downsample block, and two fully connected layers as output block for the benign-malignant classification task. Besides, the class activation mapping was used to guide the network focusing on the nodule region, where attention maps were generated by back-propagating weights of the fully-connected layer onto the convolutional feature maps [21]. In total, two CNN models were established based on whether clinical features were added.
Radiomics models
Firstly, radiomics features were extracted of segmented nodules, including 42 dedicated handcrafted features and 104 widely-used radiomics features. Details of handcrafted features were described in previous study [22]. The widely-used first-order image intensity statistics, shape and texture features were extracted using PyRadiomics [23]. Then, three RF models were established by using radiomics features, clinical features and both features, respectively. To avoid overfitting and obtain predictive features, least absolute shrinkage and selection operator (LASSO) was applied for radiomics feature selection where regression coefficients of irrelevant variables were shrunk to zero. To achieve the best performance, an optimization algorithm based on Bayesian was used to optimize the hyperparameters.
To compare the diagnostic performance of established models with manual visual assessment, two junior radiologists were invited to blindly classify the solid nodules in the testing set.
Statistical analysis
The continuous variables, age and nodule diameter, were presented with mean ± standard deviation and compared with Student’s t-test. The follow-up time was compared with Mann-Whitney U test. The other categorical data were described in number of cases (proportion) and compared with Chi-square test.
The classification performance of the models was evaluated on sensitivity, specificity, accuracy, receiver operating characteristic curves (ROC) and values of area under the ROC curve (AUC). Calibration curves were also plotted to evaluate the accuracy of risk estimate. Additionally, Brier scores were calculated that quantitatively measure the distance in the probability domain and a lower score means better prediction. Differences in the AUC values were assessed by Delong test [24].
For prognostic analysis, a Rad-score was computed for each patient by combining LASSO selected radiomics features. According to the Rad-score, patients were classified into low-risk or high-risk group split by X-tile (version 3.6.1, http://tissuearray.org/) [25]. The potential association of radiomics signature with disease-free survival (DFS) was evaluated by Kaplan-Meier survival analysis and multivariate Cox regression. Similarly, the prognostic value of malignancy-score derived from CNN model (with clinical features) was also evaluated. Differences in survival curves were assessed by log-rank test.
The LASSO analysis, ROC curves, calibration curves and Brier scores were implemented with an open source “Scikit-learn 1.1.2” in Python. The Kaplan-Meier survival analysis and multivariate Cox regression were performed with “survival 3.1-8, survminer 0.4.8” packages in R. The statistical tests were all two-sided and differences with P < 0.05 were considered statistically significant. All statistical analyses were conducted using R version 3.6.0 and Python version 3.7.0.