Data source
We collected data from the SEER database of National Cancer Institute in the United States [7]. The data was obtained using the SEER* Stat. The North American Association of Central Cancer Registries (NAACCR) documented data items and codes [8]. Primary cancer histology and site were coded by the 3rd edition of the International Classification of Diseases for Oncology (ICD-O-3).
Cohort selection
Patients with lung tumors (site codes, C34.0-C34.9) were included in this study from the year 2004 to 2014. The following histologic codes were designated as NSCLC: 8010, 8012, 8013, 8014,8015, 8020,8021,8022,8031,8032, 8046, 8050–8052, 8070–8078, 8140–8147, 8250–8255, 8260, 8310,8323, 8430, 8480, 8481,8482, 8490, 8560, and 8570–8575. Patients who did not receive radical surgery or aged 18 years or younger were excluded. In accordance with the requirement of using SEER database [9], we obtained the data agreement. Figure 1 displayed the flow chart of patients’ selection procedure in this study. SEER database conducted the follow-up for all patients, and the information of patients’ follow-up time, survival status and survival time were all recorded. Therefore we could investigate the follow-up time and OS for these patients. In this study, the missing data that could not use to assess the survival status was eliminated before statistics.
Statistical analysis
Demographic and clinical variables adopted in the further analysis included age, gender, ethnicity, primary tumor location, anatomic sites, histological subtype, tumor extent, differentiation, clinical stage, tumor size, lymph node involvement, examined lymph node (ELNs), positive lymph node (PLNs), chemotherapy and radiotherapy. Categorical variables were grouped for clinical reasons, and the decisions regarding grouping were made before data analysis. Mean, medians and ranges were reported for continuous variables, as appropriate. Frequencies and proportions were reported for categorical variables.
The primary endpoint of this study was cause-specific survival. According to the COD code, we defined the cause of death into three groups: lung cancer related, other cancer related and non-cancer related. Cumulative incidence function (CIF) was used to illustrate death rate. The CIF was compared across groups by using Gray’s test [10]. Fine and Gray competing risks proportional hazards regressions was performed to predict five- and ten-year probabilities of the three causes of death [11]. For nomogram construction, two thirds of the patients were randomly assigned to the training data set (n = 31,415) and one third to the validation data set (n = 13,465). We used restricted cubic splines with three knots at the 10, 50, and 90% empirical quantiles to model continuous variables [12]. A model selection technique based on the Bayesian information criteria was employed to avoid overfitting when establishing competing risk models (eTable S1) [13].
The performance of the nomogram included its discrimination and calibration was tested using the validation data set. Discrimination is the ability of a model to separate subject outcomes, which is indicated by Harrell C index [14, 15]. Calibration, which compares predicted with actual survival, was evaluated with a calibration plot. We used the validation set to compare the final reduced model-predicted probability of death with the observed 5 and 10-year cumulative incidence of death. The predictions were supposed to fall on a 45-degree diagonal line if the model was well calibrated. In addition, the bootstrapping technique was used for internal validation of the developed model based on 1000 resamples.
The R software (version 3.3.3; http:// www.r-project.org) was performed for all statisitcal analysis. We used R packages cmprsk, rms and mstate for modeling and developing the nomogram. The reported significance levels were all two-sided, with statistical significance set at 0.05.