Development and validation of a risk model with variables related to non-small cell lung cancer in patients with pulmonary nodules: a retrospective study

Background Lung cancer is a major global threat to public health for which a novel predictive nomogram is urgently needed. Non-small cell lung cancer (NSCLC) which accounts for the main port of lung cancer cases is attracting more and more people’s attention. Patients and methods Here, we designed a novel predictive nomogram using a design dataset consisting of 515 pulmonary nodules, with external validation being performed using a separate dataset consisting of 140 nodules and a separate dataset consisting of 237 nodules. The selection of significant variables for inclusion in this model was achieved using a least absolute shrinkage and selection operator (LASSO) logistic regression model, after which a corresponding nomogram was developed. C-index values, calibration plots, and decision curve analyses were used to gauge the discrimination, calibration, and clinical utility, respectively, of this predictive model. Validation was then performed with the internal bootstrapping validation and external cohorts. Results A predictive nomogram was successfully constructed incorporating hypertension status, plasma fibrinogen levels, blood urea nitrogen (BUN), density, ground-glass opacity (GGO), and pulmonary nodule size as significant variables associated with nodule status. This model exhibited good discriminative ability, with a C-index value of 0.765 (95% CI: 0.722-0.808), and was well-calibrated. In validation analyses, this model yielded C-index values of 0.892 (95% CI: 0.844-0.940) for external cohort and 0.853 (95% CI: 0.807-0.899) for external cohort 2. In the internal bootstrapping validation, C-index value could still reach 0.753. Decision curve analyses supported the clinical value of this predictive nomogram when used at a NSCLC possibility threshold of 18%. Conclusion The nomogram constructed in this study, which incorporates hypertension status, plasma fibrinogen levels, BUN, density, GGO status, and pulmonary nodule size, was able to reliably predict NSCLC risk in this Chinese cohort of patients presenting with pulmonary nodules.


Introduction
Lung cancer is a form of malignancy arising due to the unrestrained growth of bronchial and lung cells [1,2], and it is one of the leading causes of mortality in the world [3].Rates of lung cancer have been rising rapidly in recent years, particularly in more heavily industrialized nations [4].Currently, lung cancer patients exhibit 5-year survival rates of approximately 16.6% [5], and roughly 1 million individuals in China are predicted to be diagnosed with lung cancer by the year 2025 such that China exhibits the highest global lung cancer incidence rate.Accounting for the main port of lung cancer, the early treatment of NSCLC is the hot spot of the current research.
Key risk factors associated with lung cancer development include specific genetic mutations, smoking, and environmental exposures such as air pollution.There is also some evidence suggesting that factors such as a poor diet, alcohol intake, estrogen levels, the smoking of marijuana, and infection with human papillomavirus (HPV), human immunodeficiency virus (HIV), and Epstein-Barr virus may increase lung cancer risk, although such evidence remains somewhat inconclusive [6].Analyses of patient computed tomography (CT) scans often reveal pulmonary nodules, and many models have been developed to gauge the link between such nodules and lung cancer risk, including the Brock model [7] and the Mayo model.These models, however, often do not take epidemiological variables, clinical findings, and CT scan results into consideration at the same time, limiting their value as predictors of the relative risk of a given pulmonary nodule being malignant.The development of more reliable and accurate predictive tools has the potential to enable early intervention and treatment for NSCLC patients, maximizing their odds of positive outcomes.Herein, we analyzed 28 variables that were considered potentially relevant to the diagnosis of a given pulmonary nodule as being benign or malignant based on previous studies [1,[7][8][9].
By analyzing epidemiological, clinical, and CT-related factors for patients with pulmonary nodules that had undergone surgical treatment, we sought to develop a simple but robust predictive model that would enable the relative assessment of NSCLC risk based only upon characteristics that can be readily assessed prior to surgery or other therapeutic interventions.

Patients
The Ethics Committee of the affiliated Lihuili Hospital of Ningbo University, Lihuili hospital approved this study (approval no KY2020PJ141).Enrolled patients for design cohort were individuals from China recruited at the Xingning campus of Lihuili Hospital between October 2020 to February 2022, with the external validation cohort being recruited from April 2022 to June 2022.An external validation cohort 2 was recruited from March 2023 to June 2022 at the Eastern campus of Lihuili Hospital.Eligible patients were individuals that had undergone surgical resection following pulmonary nodule identification.Small cell lung cancer cases were removed to avoid bias for the small sample size.Patients provided written informed consent to participate in this study.Inclusion criteria: (1) pulmonary nodules that were detected through CT scanning; (2) patients who were asymptomatic at time of diagnosis; (3) patients physically able to undergo surgery.Any patients diagnosed with serious cognitive or physical impairments, or other serious diseases were excluded from the study cohort.Data including patient clinical, demographic, and disease-related characteristics were retrieved from patient medical records.

Statistical analysis
Data are given as numbers (percentages), and were analyzed using R (v 4.2.1;https:// www.R-proje ct.org) and IBM SPSS Statistics 23.0.The LASSO method, which enables the reduction of high-dimensional datasets, was utilized as a means of selecting the optimal predictors associated with NSCLC risk among the included pulmonary nodule patients.Those features that yielded non-zero coefficient values in this LASSO regression analysis were retained for nomogram incorporation.The final predictive model was constructed via a univariate logistic regression analysis followed by a multivariate logistic regression analysis, with all significance levels being two-sided.The design cohort was used to develop the predictive model, with calibration curves being used to assess nomogram calibration.Significant calibration curve results were indicative of a model that was not perfectly calibrated.Model discrimination performance was assessed based on the value of Harrell's C-index.Validation of this nomogram was additionally performed to calculate an accurate C-index value by internal bootstrapping validation (1,000 bootstrap resamples) and external validation.Decision curve analyses were used to assess the clinical utility of this NSCLC risk nomogram by quantifying the net benefit at different probability thresholds in the cohorts, with the net benefit being calculated by subtracting the proportion of patients with false-positive results from the proportion of patients with true-positive results and by assessing the relative harm of failing to intervene as compared to the potential negative outcomes associated with an unnecessary intervention.Receiver operating characteristic (ROC) curves were also used to assess the precision of this predictive risk model.
The net reclassification improvement index (NRI) and integrated discrimination improvement index (IDI) analysis were performed to calculate the improvement of the new model.

Patient characteristics
In total, data from 515 patients with pulmonary nodules that visited our clinic between October 2020 and February 2022 were included in the design cohort for this study, while data from 140 patients collected from April 2022 to June 2022 were designed as external validation cohort and patients from Eastern campus were set as external validation cohort 2. Patients aged 21-86 (mean age: 58.97 ± 12.02 years) in the design cohort were separated into groups with benign nodules and malignant lesions, as well as patients aged 26-78 (mean age: 57.17 ± 11.18 years) in the external cohort and patients aged 22-85 (mean age: 56.52 ± 12.20 years) in the external cohort 2. For details regarding the demographic and clinical characteristics of patients in these groups, see Table 1.

Feature selection and predictive model development
In total, 28 potentially relevant features were evaluated for inclusion in a predictive model.Of these features, 14 were ultimately selected through a LASSO regression analysis of the 515 patients in the design cohort (Fig. 1A  and B).These features included "border clear", vessels pass through, hypertension status, smoking history, drinking history, blood glucose, BUN, serum uric acid (SUA), triglyceride (TG) levels, plasma fibrinogen levels, density, ground-glass opacities (GGOs), spicule sign, and pulmonary nodule size.Then, in Table 2, univariate and multivariate logistic regression analyses were performed.The P-value of 0.624 in the Hosmer-Lemeshow test indicated non-significance.CT characteristics and correlative pathological results of representative nodules were shown in Fig. 2. A predictive model incorporating these significant variables was developed using the design cohort (Fig. 3).Nodule density was defined as being "low" when it exhibited a CT value that was higher than that of pulmonary tissue but lower than that of pulmonary vessels, "intermediate" for nodules with solid and GGO components, and "high" when CT values were greater for the nodule than for pulmonary vessels.While the features of Mayo model were smoking history, age, nodule diameter, cancer history, site in the left and spicule sign.This model was explained by the calculation formula: P = e x /(1 + e x ), where x = − 6.8272 + (0.7917 × smoking history) + (0.0391 × age) + (0.1274 × nodule diameter) + (1.3388 × cancer history) + (0.7838 × the upper lobe) + (1.0407 × spicule sign).One recent research [10] which was published in Chest showed that the parsimonious Brock model (including gender, size, upper location and spicule sign) could predict cancer risk well, and we calculated the performance of the model in our cohorts.
We classified GGOs as pure GGOs (pGGOs, n = 208) and mixed GGOs (mGGOs, n = 81).The relationship between them and lung cancers were further analyzed.P-value of 0.460 was got in univariate analysis and NSCLC was excluded in the forward likelihood ratio logistic analyses (Table 3).Further, mGGOs was positively correlated with nodule size when compared with pGGOs.

Assessment of predictive risk model performance
Calibration curves for this predictive nomogram when used to analyze the design cohort revealed it to be well-calibrated, with a C-index value of 0.765 (95% CI: 0.722-0.808)(Fig. 4A).Similarly, the C-index values for the external validation cohort, external validation cohort 2 and internal bootstrapping validation were 0.892 (95% CI: 0.844-0.940)(Fig. 4B), 0.853 (95% CI: 0.807-0.899)(Fig. 4C) and 0.753, respectively, consistent with the discriminative value of this model, suggesting that it exhibits good predictive utility.

Different types of NSCLC compared with normal cases by multinomial logistic analyses
We classified NSCLC into carcinoma in situ, minimally invasive adenocarcinoma (MIA), invasive adenocarcinoma (IAC) and other types according to pathological results.MIA and IAC accounted for 80.32%, and the rest types of NSCLC accounted for only 19.68%.Among them, for example, the degrees of invasion between minimally invasive adenocarcinoma (MIA) and invasive adenocarcinoma (IAC) are incremental.Therefore, the assessment of each factor in the model among the various types of NSCLC was necessary.We excepted carcinoma in situ for the small sample size and mainly concentrated on evaluating the associations with MIA and IAC in

Analysis of model clinical utility
Decision curve analyses for this predictive nomogram were next performed (Fig. 5).These analyses revealed that at a threshold probability of a patient and a doctor is > 18 and < 90% and > 3% in the two cohorts, respectively, then this nomogram exhibits value as a means of predicting NSCLC risk.Net benefit was comparable with some overlap within this range when assessing NSCLC Optimal values were marked with dashed vertical lines at optimal values using minimum criteria with the 1-SE criteria.The selected optimal λ value was 0.021.B LASSO coefficient profiles for 28 potential features were generated, with coefficient profile plots against the log(lambda) sequence being generated.Five-fold validation was used to draw vertical lines at selected values, with optimal lambda results yielding fourteen total features with non-zero coefficient values.SE: standard error showed a higher overall net benefit (Fig. 5A, B, C) when compared with the Mayo models (the red line) and simplified Brock model (the green line) in the two cohorts.

ROC curve analysis
ROC curve analyses of the two cohorts included in this study confirmed the predictive value of the two model, with an area under the curve value of 0.765 vs. 0.548 vs. 0.565 for the design cohort (Fig. 6A) and 0.892 vs. 0.741 vs. 0.672 for the external validation cohort (Fig. 6B) and 0.853 vs. 0.715 vs. 0.728 for the external validation cohort (Fig. 6C).The adopting the area under the ROC curve (AUC) values of our model (the blue line) were all higher than that of Mayo model (the red line) and parsimonious version of the Brock model (the green line).

NRI and IDI analysis of the three models
As a supplement for the comparison of the AUC values, we calculated the net reclassification improvement index (NRI) and integrated discrimination improvement index (IDI) of the two models to research the improvement of our model (  .Row 1: the pathological finding of a 16-mm high-density nodule for a 52-year-old woman with is benign (X 100); Row 2: the pathological finding of a 5.7-mm low-density nodule for a 34-year-old woman was a carcinoma in situ (X 200); Row 3: the pathological finding of a 8-mm low-density nodule for a 56-year-old man was a MIA (X 100); Row 4: the pathological finding of an 11-mm low-density nodule for a 50-year-old woman was an IAC (X 100); Row 5: the pathological finding of a 7-mm partly solid-density nodule for a 63-year-old woman was a MIA (X 100); Row 6: the pathological finding of an 11-mm partly solid-density nodule for a 81-year-old woman was an IAC (X 100) that our model could identify the benign and malignant nodules more accurately.

Discussion
Nomograms are valuable predictive tools that have been widely utilized in oncology and other clinical and research fields, offering a user-friendly approach to intuitively assessing the odds of a given diagnosis or outcome based on a set of specific variables, thereby aiding in clinical decision-making [11].Many models for the treatment of pulmonary nodules were established based upon certain epidemiological variables and CT scan results.However, clinical findings such as hematological biomarkers are also very important for the diagnosis of lung cancer [1].Moreover, for some of these variables, such as GGO, the surgical criteria are not well defined such that treatments are often conducted according to the experience of the operating surgeons [12,13].As such, we herein sought to develop a new nomogram capable of predicting the relative risk of malignancy when evaluating patients with pulmonary nodules.
We designed and validated a novel predictive model capable of assessing the risk of a given lung nodule being benign or malignant based on analysis of data from patients that had undergone pulmonary nodule resection.The resultant model incorporated demographic, disease-, and treatment-related features to easily predict the odds of a given pulmonary nodule corresponding to    Prior studies have confirmed that hypertension is a common comorbidity in cancer patients [14].Several mechanisms may explain this observation, including the fact that hypertension can increase VEGF levels in the plasma [15].We identified hypertension as a risk factor for lung nodule malignancy.Fibrinogen has also been significantly linked to the risk of lung cancer in the past [16], with Kuang et al. having demonstrated that a combination of the beta and gamma chains of fibrinogen may offer value as a sensitive biomarker for differentiating between lung nodules that are benign and malignant [17], potentially explaining the significance of plasma fibrinogen levels in our model.One research indicated that the value of BUN to seralbumin ratio might predict patients with serious pulmonary cancer [18].BUN had a positive relationship with pulmonary tumor risk and was included in risk prediction model therefore [19].Some researches demonstrated that the maximum diameter of nodules > 8 mm was independent risk factors for malignancy [20] and presence of solid element in the GGO nodules might cause lymph node metastasis [21].GGO findings have been reported to be associated with cancer rates as high as 63%, with many surgeons believing that GGO nodules should be resected, particularly if they grow in size.Persistent GGO nodules may be indicative of a greater risk of malignancy when solid components are evident [12].Tu et al. found CT density to be a valuable feature when differentiating between nodules that were malignant and benign [22].Qiu et al. further determined that solitary ground-glass opacity nodule size and density upon high-resolution T evaluation were associated with invasive adenocarcinoma risk [23].Nodule size may be the most important variable included in our predictive model, given that nodule diameter is a key determinant of treatment under the British Thoracic Society guidelines [24] and Fleischner Society Guidelines [25].For nodules ≥ 10 mm in diameter, the odds of malignancy in the NELSON screening study were 15.2% [26].As such, we included nodule diameter as the size variable in the present study.As the comparison of the AUC value between different models had certain limitations, we calculated the NRI and IDI of the two models to explain the improvement of our model.
Herein, we thus designed a risk nomogram that may aid clinicians in differentiating between patients with benign or malignant lung nodules.It may also aid in the optimal selection of pulmonary nodules in the context of clinical research.For example, this model might be used to aid investigators in selecting patients with larger nodules and other risk-related findings when identifying candidates for surgical procedures or other interventions.Early interventions including CT scans, biochemical analyses of blood samples, and family support can better benefit low-risk patients, while regular clinical examination can ensure the appropriate monitoring of lung nodules to better guide the appropriate assessment of patient diagnosis.
Previous classical models based on large-scale screening experiments have been widely used for clinical evaluation.However, people who go to different hospitals for treatment are inevitably screened by human factors.For example, as a tertiary hospital, our hospital serves for   Although our model showed good accuracy and stability in different validation cohorts.Among the variables included in the model, BUN demonstrated statistical significance solely within the training cohort, while it did not exhibit significance in both the external validation cohort and external validation cohort 2. This suggests potential instability of this index and highlights room for improvement within the model.The presence of these findings indicates that there is still scope for enhancing the current study's model, which is currently limited by its inclusion of a restricted number of variables.With the continuous advancement of artificial intelligence technology, we believe that future research endeavors will benefit from larger training cohorts encompassing more diverse variables, thereby facilitating the establishment of more precise and straightforward prediction models.

Limitations
There are multiple limitations to this study.For one, all patients in our study were enrolled from a single center over a relatively limited study period.Additionally, risk factor analyses did not incorporate all possible risk factors that may be relevant to the differentiation between benign and malignant nodules.Other relevant factors not included in this analysis included the number of nodules and specific comorbidity incidence rates.In addition, the selection of variables made by taking previous studies into account and the patients were from a tertiary referral center, potentially contributing to significant bias affecting these statistical analyses.Also, the comparison of the AUC value between different models had certain limitations.Lastly, while a bootstrap testing approach was used to validate our nomogram, the patients used for this validation approach may not be sufficient to ensure the generalizability of these data to patients from other countries or regions.As such, further external validation in a wider pulmonary nodule patient population will be essential in the future.

Conclusions
In summary, we herein designed a novel nomogram with good accuracy that offers value as a means of differentiating between benign and malignant pulmonary nodules, enabling clinicians to better plan patient treatment.Such individualized risk analyses offer clinicians an opportunity to appropriately monitor and treat patients.However, further work will be needed to validate this nomogram in larger patient populations and to establish whether the treatment decisions made based on this nomogram will reduce rates of incorrect diagnosis and treatment planning for patients with pulmonary nodules.
• fast, convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year

•
At BMC, research is always in progress.

Learn more biomedcentral.com/submissions
Ready to submit your research Ready to submit your research ?Choose BMC and benefit from: ? Choose BMC and benefit from:

Fig. 1
Fig.1LASSO binary logistic regression model-based clinicopathological feature selection.Notes: A Five-fold cross-validation was used for optimal parameter (λ) selection in the LASSO model via minimum criteria, with a partial likelihood deviance (binomial deviance) curve being plotted against log(λ).Optimal values were marked with dashed vertical lines at optimal values using minimum criteria with the 1-SE criteria.The selected optimal λ value was 0.021.B LASSO coefficient profiles for 28 potential features were generated, with coefficient profile plots against the log(lambda) sequence being generated.Five-fold validation was used to draw vertical lines at selected values, with optimal lambda results yielding fourteen total features with non-zero coefficient values.SE: standard error

Fig. 2
Fig.2Different lung nodules represent their characteristics in CT scan (A) and correlative pathology stained by hematoxylin & eosin (B).Row 1: the pathological finding of a 16-mm high-density nodule for a 52-year-old woman with is benign (X 100); Row 2: the pathological finding of a 5.7-mm low-density nodule for a 34-year-old woman was a carcinoma in situ (X 200); Row 3: the pathological finding of a 8-mm low-density nodule for a 56-year-old man was a MIA (X 100); Row 4: the pathological finding of an 11-mm low-density nodule for a 50-year-old woman was an IAC (X 100); Row 5: the pathological finding of a 7-mm partly solid-density nodule for a 63-year-old woman was a MIA (X 100); Row 6: the pathological finding of an 11-mm partly solid-density nodule for a 81-year-old woman was an IAC (X 100)

Fig. 4
Fig. 4 Calibration curves for NSCLC nomogram predictions in the design cohort (A), external validation cohort (B) and external validation cohort 2 (C).Note: Predicted risk of NSCLC and actual NSCLC diagnoses are shown on the x-axis and y-axis, respectively, with the dotted line corresponding to a diagnostic model with perfect predictive accuracy and the solid line corresponding to actual nomogram performance.The closer these lines are to one another, the better the predictive performance of this nomogram

Fig. 5
Fig.5Decision curve analysis.Notes: Net benefit is shown on the y-axis, with the blue or red line corresponding to the NSCLC risk nomogram.The thin and thick lines respectively correspond to the assumptions that all patients or no patients got NSCLC, with the decision curve demonstrating that if the threshold probability of a patient and a doctor is > 18% and < 90% (A), > 3% (B) and > 7% and < 90% (C) in our model for the three cohorts, respectively, then the use of this nomogram to predict the risk of NSCLC is more beneficial than a treat-all or treat-none interventional scheme for these patients.The red line stands for Mayo model, the blue line stands for our model and the green line stands for parsimonious version of the Brock model

Fig. 6
Fig. 6 Receiver operating characteristic curve analyses for the design cohort (A), external validation cohort (B) and external validation cohort 2 (C).The red line stands for Mayo model, the blue line stands for our model and the green line stands for parsimonious version of the Brock model

Table 1 (
continued)GGT gamma-glutamyl transpeptidase, CR Creatinine, BUN Blood urea nitrogen, SUA Serum uric acid similar trend.Sizes of nodules in different types NSCLC were all significant (P:0.015 vs. 0.000 vs. 0.017), what's more, when the pulmonary nodule size was ≥ 8 mm the degree of infiltration might be deeper in MIA and IAC.
Except for the above features, IAC and other types in the multinomial models had a same risk factor hypertension, while BUN and plasma fibrinogen levels seemed to be risk factors of MIA.

Table 2
The features for patients with pulmonary nodules in the design cohort using multivariate logistic regression analyses

Table 3
Variables significance analyses in the two cohorts using multivariate logistic regression by forward stepwise likelihood ratio way BMI Body Mass Index, GGT Gamma-glutamyl transpeptidase

Table 4
Ten-fold cross-validation analysis for the design cohort The values of kappa, accuracy and AUC in the design cohort are shown in the table.The final average is calculated from ten groups (AUC = 0.747 ± 0.081; Accuracy = 0.732 ± 0.064; Kappa = 0.376 ± 0.155) a NSCLC diagnosis.The model developed herein was accurate, and exhibited good calibration and discrimination in our validation cohort.The C-index value in this validation cohort was also high, indicating that the nomogram can be accurately used to gauge patient risk of pulmonary nodule malignancy.

Table 5
Ten-fold cross-validation analysis for the validation cohortThe values of kappa, accuracy and AUC in the validation cohort are shown in the table.The final average is calculated from ten groups (AUC = 0.849 ± 0.104; Accuracy = 0.761 ± 0.092; Kappa = 0.321 ± 0.259)

Table 6
Ten-fold cross-validation of the model in the validation cohort 2The values of kappa, accuracy and AUC in the validation cohort are shown in the table.The final average from ten groups are calculated (Kappa = 0.461 ± 0.179; Accuracy = 0.737 ± 0.087; AUC value = 0.833 ± 0.079)

Table 7
Odds ratios of model variables in different types of NSCLC compared with normal cases using binomial and multinomial logistic regression analyses PF abnormal plasma fibrinogen levels, Other other types of NSCLC, BUN abnormal blood urea nitrogen many patients come from subordinate hospitals, which may express the high proportion of patients with ≥ 8 mm and malignant nodules in our cohorts.Therefore, it is necessary to develop clinical assessment models for pulmonary nodules based on different groups of patients.Accurate predictive evaluation can aid surgeons in predicting lung cancer risk in individual patients, ensuring timely intervention for high-risk patients while reducing the need for interventional treatment in low-risk patients.Accurately predicting the risk of lung cancer in a given patient is very challenging, and appropriate measurements together with multifaceted interventional approaches are thus the most reliable approach to detecting and evaluating patients with pulmonary nodules.Further research on this topic is warranted as the accurate detection of pulmonary nodules alone is necessary but insufficient for treating affected patients, underscoring directions for future study.