Skip to main content

Deep learning approaches for differentiating thyroid nodules with calcification: a two-center study



Calcification is a common phenomenon in both benign and malignant thyroid nodules. However, the clinical significance of calcification remains unclear. Therefore, we explored a more objective method for distinguishing between benign and malignant thyroid calcified nodules.


This retrospective study, conducted at two centers, involved a total of 631 thyroid nodules, all of which were pathologically confirmed. Ultrasound image sets were employed for analysis. The primary evaluation index was the area under the receiver-operator characteristic curve (AUROC). We compared the diagnostic performance of deep learning (DL) methods with that of radiologists and determined whether DL could enhance the diagnostic capabilities of radiologists.


The Xception classification model exhibited the highest performance, achieving an AUROC of up to 0.970, followed by the DenseNet169 model, which attained an AUROC of up to 0.959. Notably, both DL models outperformed radiologists (P < 0.05). The success of the Xception model can be attributed to its incorporation of deep separable convolution, which effectively reduces the model’s parameter count. This feature enables the model to capture features more effectively during the feature extraction process, resulting in superior performance, particularly when dealing with limited data.


This study conclusively demonstrated that DL outperformed radiologists in differentiating between benign and malignant calcified thyroid nodules. Additionally, the diagnostic capabilities of radiologists could be enhanced with the aid of DL.

Peer Review reports


Thyroid nodules represent a common endocrine disease, with a detection rate of up to 68% in adults [1, 2]. The majority of these nodules are benign and pose minimal risk [3]. However, malignant thyroid nodules can burden patients psychologically, with certain high-risk cases carrying the potential of metastasis. Therefore, it becomes crucial to differentiate between benign and malignant thyroid nodules. Ultrasonography (US) and fine needle aspiration (FNA) constitute the two primary methods for distinguishing between benign and malignant thyroid nodules [4, 5]. FNA, being an invasive procedure, can be significantly burdensome for patients [6]. Hence, US emerges as the preferred evaluation method for thyroid nodule evaluation [3]. During ultrasound examinations, several nodule features, including calcification, a hypoechoic appearance, irregular margins, and a taller-than-wide shape, are strongly correlated with malignancy [2]. Among these features, calcification is present in approximately 19.8–32.1% of all thyroid nodules and is considered one of the most critical ultrasound indicators [7, 8]. However, calcification is not a specific marker for malignancy since it can also occur in benign nodules [9].

Currently, radiologists primarily differentiate between benign and malignant calcified thyroid nodules by assessing the type of calcification present [10]. Calcification can be categorized into three types: microcalcification, macrocalcification, and peripheral calcification [11]. Among these types, microcalcification carries the highest risk of malignancy. The American College of Radiology classifies thyroid nodules with echogenic foci in ultrasound images into three groups: macrocalcification is assigned one point, peripheral calcification is assigned two points, and the presence of punctate echogenic foci is assigned three points [12]. The higher the point value, the greater the associated risk of malignancy [13]. However, both ultrasound imaging and US-FNA methods have limitations in determining the type of calcification. Although microcalcification (indicated by punctate echogenic foci) exhibits the highest correlation with malignancy, nodules with microcalcification are not necessarily malignant [14,15,16]. Furthermore, nodules exhibiting either macrocalcification or peripheral calcification cannot be automatically classified as benign [17]. In conclusion, the clinical significance of various calcification types remains unclear, underscoring the need for a more objective approach to distinguishing between benign and malignant thyroid nodules featuring calcification.

In recent years, there has been a growing interest in artificial intelligence (AI), particularly in the context of deep learning (DL) and its automatic image analysis capabilities. Liu et al. [18] developed a combined DL model with a high discrimination ability for predicting malignant microcalcification in Breast Imaging-Reporting and Data System (BI-RADS) 4 breast nodules, outperforming junior doctors. Patel et al. [19] utilized DL segmentation and computed tomography radiomics to evaluate the microarchitectural changes in cardiovascular calcification following in vivo interventions. Nam et al. [20] developed DL algorithms capable of detecting calcification in chest radiographs, while Yao et al. [21] developed a multimodal DL model for predicting cervical lymph node metastasis in papillary thyroid carcinoma. However, limited studies currently apply DL techniques to predict the malignancy risk associated with calcified thyroid nodules.

The risk of malignancy linked to various types of calcifications in thyroid nodules remains inconclusive, and the clinical significance of calcification has not been adequately studied. Therefore, our study employs DL techniques to automatically extract malignant features from ultrasound images of calcified thyroid nodules and predict the risk of malignancy. Our objective is to compare the diagnostic performance of our method with that of radiologists and investigate whether it can enhance the diagnostic capabilities of radiologists in making accurate clinical decisions for patients with thyroid nodules.


Study design and datasets

This retrospective diagnostic study was conducted at two centers and included patients who met the following criteria: (1) age 18 years or older; (2) ultrasonographic diagnosis of thyroid nodules; (3) detection of calcification in both ultrasound and pathology reports; and (4) a definitive pathological diagnosis of either benign or malignant nodules (confirmed through surgical specimen or FNA [Bethesda II or VI]). Two pathologists made the pathological diagnoses, and in case of any disagreement, the diagnosis of a third senior pathologist was used. Ultrasound images of low quality with incomplete lesions were excluded after screening.

Written informed consent from the patients was waived by the ethics committee of the Independent Ethics Committee of Zhejiang Cancer Hospital and the Medical Ethics Committee of Taizhou Cancer Hospital (IRB-2020-287, IRB-2023-001-IIT) and all images and data were anonymized.

Evaluation metrics

The primary outcome measure was to investigate the area under the receiver-operator characteristic curve (AUROC) for diagnosing calcified thyroid nodules. The secondary outcomes included accuracy (ACC), sensitivity (SEN), specificity (SPE), positive predictive value (PPV), and negative predictive value (NPV) for calcified thyroid nodules. We compared the diagnostic performance of the DL models with that of radiologists with varying levels of seniority and investigated whether radiologists could improve their diagnostic accuracy with the aid of the DL model.


Figure 1 presents an overall schematic of this retrospective diagnostic study. We conducted a retrospective search in the thyroid ultrasound image databases of both Zhejiang Cancer Hospital (Center 1) and Taizhou Cancer Hospital (Center 2) to identify ultrasound images of patients with thyroid nodules recorded between January 2020 and December 2022. We included images that featured comprehensive ultrasound data and clear pathological results, resulting in a total of 1265 images from 546 nodules from Center 1 and 126 images from 85 nodules from Center 2. We divided the 546 nodules from Center 1 into training and verification sets in an 8:2 ratio. The 85 nodules from Center 2 were designated as a separate test set. Notably, there was no patient overlap between Center 1 and Center 2.

Fig. 1
figure 1

Overall schematic for differentiating between benign and malignant thyroid nodules with calcification using DL models

A total of 1391 ultrasound images of thyroid nodules in Digital Imaging and Communications in Medicine (DICOM) format underwent a lossless conversion to the Joint Picture Group (JPG) format. Manual cropping and removal of noise information, including ultrasound equipment and patient information, from around the original image was performed. As shown in Fig. 2, two radiologists, each possessing over five years of work experience, utilized LabelMe software to outline the nodule area by marking it within a rectangular box on the complete ultrasound image. Following this delineation, a JSON format file containing coordinates, width, and height information for the upper left corner of the nodule was generated. Subsequently, the nodules were extracted from the complete ultrasound image using the JSON format files, resulting in images that only featured the nodule area. To meet the input size requirements of various models, the size of the cropped nodule area image was adjusted accordingly before input into the model. The image length and width were adjusted to 224 × 224, 299 × 299, and 331 × 331 pixels, respectively, to align with the required image input size of the model, and the images were normalized. Due to the limited number of thyroid calcified nodules and the sparse nature of medical data, various data augmentation techniques such as rotation, flipping, scaling, translation, and mixing [22, 23] were applied to transform the existing ultrasound images. These augmentation methods serve to enhance the generalization capabilities of DL models. The ultrasound images of thyroid calcified nodules from Center 1 were divided into a training set (80%) and a validation set (20%). The training set images were expanded to five times their original data volume, which was employed for training the DL models.

Fig. 2
figure 2

An illustration of the process of inputting images

In this study, we employed a total of five convolutional neural network (CNN) models, each with distinct structures and depths: DenseNet121, DenseNet169, NASNetLarge, ResNet101v2 and Xception. These models were utilized to extract features from thyroid calcified nodules and establish the classification models. The neural network’s backpropagation method was employed to iteratively update the model parameters, ultimately facilitating the classification of benign and malignant thyroid calcified nodules.

For transfer learning in this study, we utilized the pre-training weight parameters from these models in the ImageNet context. The fully connected layers of the original models’ weight parameters’ were removed and replaced with four fully connected layers containing 1024, 512, 512, and 2 neurons, respectively. During the training process, fine-tuning was applied to expedite the model’s search for the optimal global solution, with the models being iterated for 300 epochs. To prevent overfitting, three dropout regularizations were introduced between the fully connected layers, randomly deactivating 50% of the neurons during training. Binary cross-entropy loss and the Adam optimizer were employed to iteratively update the model’s weight parameters. The initial learning rate for the optimizer was set at 0.001, with a dynamic adjustment strategy. If the ACC did not increase for five consecutive epochs during the training process, the learning rate was reduced, and the learning rate factor was set to 0.5. During training, a batch size of eight was utilized. The SoftMax activation function was applied in the final fully connected classification layer to output the probabilities of benign and malignant thyroid calcified nodules, enabling the DL models to evaluate these nodules. We re-evaluated the generalization performance of the model by conducting 3-fold cross-validation, independently re-adjusting the hyperparameters for each model during this process. Given that we employed DL models with varying convolutional layers and structures, maintaining consistency in hyperparameters across all models was crucial to ensure fair model training. Therefore, all hyperparameters were initialized with default values from the TensorFlow scientific database. The ROC curve was generated from the average probabilities calculated during the three-fold cross-validation, and the remaining metrics were determined through voting to establish the final category results based on the outcomes of the three-fold cross-validation. Two junior and two senior radiologists were tasked with diagnosing 126 identical ultrasound images from Center 2, depicting thyroid nodules, to assess the potential clinical applicability of our model. The four radiologists initially conducted independent diagnoses of the thyroid nodules in the test set. Subsequently, after a washout period exceeding two months, they re-evaluated the same images with the assistance of a DL model. During the initial diagnosis, the radiologists were required to independently identify benign and ma-lignant thyroid calcified nodules. Importantly, the pathological diagnosis for all nodules was established but kept confidential from the radiologists. The radiologists were informed that their diagnostic performance would be compared with that of the DL models. Fol-lowing this initial phase, and after the washout period, we furnished the radiologists with the DL model’s reference diagnosis, which included the malignant probability of nodules and the classification of benign and malignant nodules based on the DL model’s as-sessment. The radiologists had the option to either adhere to their initial diagnosis or incorporate the DL model’s results into their final diagnosis. The diagnostic efficacy of the DL model remained undisclosed to the radiologists.

Statistical analysis

All statistical analyses were conducted using Python (version 3.8.13), Numpy (version 1.22.3), and Scipy (version 1.8.0). Quantitative data were expressed as mean ± SD. The main evaluation index was the AUROC. A 2 × 2 confusion matrix was generated to calculate ACC, SEN, SPE, PPV, NPV and F1-score (F1) [24, 25]. The receiver operator characteristic (ROC) curve was plotted, depicting the true positive rate (SEN) and the true negative rate (SPE). Subsequently, AUROC values were calculated, and the significance of AUROC differences was assessed using the Delong test, with P < 0.05 considered indicative of a significant difference between the two AUROCs.



The study included a total of 631 thyroid nodules, and Table 1 provides an overview of the characteristics of the patients and nodules examined in the study. In Center 1, the mean age of the patients was 51.5 ± 11.6 years, comprising 380 females and 105 males. Meanwhile, Center 2 recorded a mean patient age of 54.5 ± 11.0 years, encompassing 58 females and 22 males. The average nodule size in Center 1 measured 1.3 ± 1.1 cm, with 72.7% of the nodules being smaller than 1.5 cm. The malignancy rate for nodules in Center 1 was 54.6%. Similarly, Center 2 reported an average nodule size of 1.4 ± 1.2 cm, with 69.4% of the nodules being smaller than 1.5 cm. The malignancy rate for nodules in Center 2 was 58.8%. Table 1 presents an overview of the image distribution within the dataset.

Table 1 Patient and Nodule Characteristics*

Performance of models

After 300 epochs of learning, the external test set from Center 2 was employed to evaluate the performance of the five classification models. The Xception classification model exhibited the highest performance, achieving an AUROC of up to 0.970, followed by the DenseNet169 model with an AUROC of up to 0.959. The key distinction between the Xception model and the other four models lies in its incorporation of deep separable convolution into the model structure. This structural element segregates the channel axis of the image data from the spatial axis and subsequently conducts separate convolutions on these two channels. This separation framework results in a reduction in the model’s parameter count, reduces computational complexity, and enhances overall efficiency. Furthermore, this structure enhances the model’s capacity to capture both local features and global information within the image, ultimately elevating the model’s ability to comprehend image content. Given its fewer parameters and stronger feature extraction capabilities, the Xception model exhibits superior generalization performance and overall performance, particularly when dealing with limited data samples. Figure 3 illustrates the AUROC curves for all five models, while Table 2 provides an overview of the evaluation metrics for each model.

Table 2 Diagnostic performance of DL models in the test set

Performance of radiologists with and without DL assistance

The Xception classification model outperformed other models, closely followed by the DenseNet169 model (Table 2). Table 3 presents the performance of both junior and senior radiologists in diagnosing benign and malignant thyroid calcified nodules. The combined AUROC of the two junior radiologists was 0.674, while that of the two senior radiologists was 0.745. These findings indicate that both models outperformed the radiologists, with a statistically significant difference (a P-value < 0.05, as indicated in Table 4).

Table 3 Diagnostic performance of four radiologists
Table 4 The p-values of the DeLong test for different methods in the test set

Table 3 and Table 5 provide insight into the diagnostic performance of radiologists in distinguishing between benign and malignant thyroid calcified nodules, both with and without the assistance of the Xception model. In the absence of the DL model, the combined AUROC for the two junior radiologists was 0.674 (0.574, 0.774). However, with the DL model’s assistance, the AUROC for the junior radiologists increased to 0.743 (0.650, 0.836). The combined ACC also improved from 0.694 (0.675,0.855) to 0.753 (0.662, 0.845). Similarly, when not aided by the DL model, the combined AUROC for Radiologist 4 was 0.719 (0.623, 0.815), and with the DL model’s assistance, it increased to 0.764 (0.674, 0.855). The ACC also improved from 0.729 (0.635, 0.823) to 0.788 (0.701, 0.875).

Table 5 Diagnostic performance of four radiologists aided by the Xception model
Fig. 3
figure 3

ROC curves of the DL models and the performance of radiologists aided by the Xception model

Attention maps generated by CAM

Upon forwarding the image of the calcified thyroid nodule to the Xception model, a convolution operation is performed, followed by the application of the Rectified Linear Unit (ReLu) activation function to nonlinearly process the extracted features. Subsequently, the image is subjected to depth-separable convolution for further feature extraction. Layer skip connections are incorporated within this process to augment feature utilization. After multiple iterations of this layer, global average pooling (GAP) is employed to reduce the dimensionality of the features. A fully connected layer operates on the features post-global average pooling, facilitating feature learning and classification prediction, ultimately producing the benign and malignant probability scores for calcified nodules. The Xception model workflow is illustrated in Fig. 4. Throughout this process, we implemented the class activation mapping (CAM) model to generate attention maps highlighting the critical regions in the prediction of nodules. Figure 5 displays these maps, with the red areas indicating the regions of primary focus for the DL model. These selected nodules were instances where doctors’ judgments were incorrect, but the model made accurate assessments.

Fig. 4
figure 4

Workflow diagram of the Xception model, “B” stands for benign and “M” for malignant

Fig. 5
figure 5

CAM-Generated attention maps, “B” stands for benign and “M” for malignant


This study evaluated various methods for differentiating between benign and malignant thyroid calcified nodules. Results from the external test set demonstrated that the Xception classification model achieved the highest performance with an AUROC of up to 0.970, followed by the DenseNet169 model with an AUROC of up to 0.959. The Xception classification model, which demonstrated superior generalization ability, provided valuable assistance to radiologists of varying experience levels in differentiating between benign and malignant calcified nodules. These findings underscore the diagnostic potential of DL methods in assisting radiologists to make more accurate clinical diagnoses, potentially reducing the need for unnecessary FNA procedures for thyroid calcified nodules. Typically, DL models outperform radiologists due to their ability to discern intricate details beyond human perception. In this study, we employed the Gradient-CAM technique to generate attention maps, highlighting regions of interest for the CNN. In the future, it is anticipated that additional feature visualization methods for DL models can be utilized to visualize areas of concern, thus enabling the model to directly communicate to doctors which features it deems significant when rendering judgments.

Calcification serves as a common indicator in thyroid nodules, and distinct types of calcifications suggest varying degrees of malignancy. Previous studies have reported that malignant calcification in thyroid nodules typically arises from the proliferation of blood vessels, dense fibrous tissue, and the accumulation of calcium salts [10]. Calcification can be present in both benign and malignant thyroid nodules, and the morphological features often overlap [26]. In postoperative histopathological examinations, microcalcification is associated with the presence of psammoma bodies, which are round, lamellar, crystalline calcified deposits measuring between 10 and 100 μm in size and are distinctive features unique to papillary thyroid carcinoma. Hence, microcalcifications are highly indicative of malignancy, with an SPE ranging from 86 to 95% and a PPV varying between 42% and 94% [27]. However, the identification of microcalcifications in ultrasound images is highly dependent on the scanning angle and the radiologists’ experience, and outcomes may differ across varying imaging machines. Regarding peripheral calcification, certain studies [4, 28, 29] have suggested that the disruption and thickening of peripheral calcification, along with the presence of a peripheral halo around thyroid nodules, strongly indicate a heightened likelihood of malignancy. However, several other studies [17, 30] refute this conclusion, possibly due to the subjective interpretation of radiologists regarding peripheral calcification [30, 31]. Similarly, previous studies have reported conflicting and inconsistent results regarding the malignant risk associated with macrocalcification [32,33,34,35,36,37]. This variability may stem from variations in the composition and echogenic characteristics displayed in ultrasound images of thyroid nodule. Discrepancies may also arise from differences in how radiologists define distinct calcification types and the specific attributes of the study populations. Ha et al. [37] integrated the Thyroid Imaging, Reporting and Data System (TI-RADS) to stratify the malignant risk of thyroid nodules with different echogenic foci types. Their findings revealed that the PPV for nodules featuring large echogenic foci without shadowing, macrocalcification, peripheral curvilinear or eggshell echogenic foci, with or without shadowing, was relatively low (33.3–56.4%). However, when the highly suspicious categories within TI-RADS were combined, the PPV notably increased to a range of 50.0–90.9%. In our study, the PPV achieved by the DenseNet169 and Xception models could reach as high as 90.9% and 91.4%, surpassing the traditional combination of TI-RADS for assessing the malignant risk associated with calcified nodules.

Both US and FNA have limitations in determining the malignant risk of thyroid calcified nodules. FNA, for instance, can only observe morphological and structural changes in a limited number of cells. Additional constraints include a limited comprehension of the overall tissue structure and the potential for undetected cases due to unsatisfactory sampling. Consequently, there is a pressing need for supplementary diagnostic methods. DL methods offer potential advantages over traditional diagnostic techniques due to their objectivity and reproducibility. These methods operate by automatically classifying images through the training of CNNs on extensive datasets. A CNN comprises multiple convolutional layers capable of automatically extracting meaningful features from input data and integrating them as they traverse through deep layers. Specifically, CNNs excel at automatically classifying images by extracting optimal features, identifying, and analyzing the characteristics of thyroid nodules, and effectively distinguishing between benign and malignant nodules [38, 39]. Given that some radiologists may overestimate the malignant risk associated with calcification, it becomes crucial to maintain a high SPE in the identification of thyroid calcified nodules to minimize unnecessary FNA or surgery for benign calcified nodules. The DL method proposed in this study demonstrates its ability to significantly enhance the diagnostic accuracy of radiologists with varying levels of experience in distinguishing between benign and malignant thyroid calcified nodules, potentially averting unnecessary invasive procedures for benign calcified nodules. A distinguishing feature of the Xception model compared to the other four models lies in the introduction of deep separable convolution within the model structure. This structural enhancement translates to fewer parameters and reduces computational complexity compared to the other four models. Simultaneously, this structural element enhances the model’s capability to capture features across different scale feature maps during feature extraction, thereby strengthening its capacity to comprehend and convey image content. Furthermore, this enhancement allows the Xception model to exhibit robust learning capabilities even with limited sample data, achieving superior performance with the same volume of data. Although the AUROC of the Xception model reached 0.970, surpassing the performance of the other models and radiologists with varying levels of experience, the radiologists’ performance did not match or exceed that of the Xception model, even when aided by the DL model. Radiologists displayed only marginal improvements in diagnostic proficiency when comparing the results with and without the DL model’s assistance. This divergence could potentially be attributed to radiologists’ lack of awareness regarding the performance of our DL model during their second diagnosis, fostering a heightened sense of confidence and subjectivity. The noticeable decline in performance exhibited by Radiologist 3 further underscores the significant subjectivity that can persist among radiologists during the process of assisted image interpretation. In future studies, we intend to inform radiologists of the model’s results, which may lead to improved DL-assisted image interpretation. Additionally, it is worth noting that we only provided static images, and the DL model could identify numerous features not discernible to the naked eye. Nonetheless, the limited cut surface may have resulted in the loss of valuable information for radiologists. As demonstrated in Tables 3 and 5, junior radiologists were able to match or even surpass the performance of the senior radiologists with the assistance of the DL model.

Our study has several limitations that warrant acknowledgment. Firstly, selection bias was inevitable as we exclusively included calcified nodules with confirmed pathological diagnosis. A significant proportion of benign nodules do not undergo pathological examination, potentially contributing to the higher incidence of malignancy observed in our study. Secondly, the static ultrasound images used in this study offer a limited angle for viewing thyroid nodules, possibly resulting in lower ACC for radiologists compared to dynamic video identification. Future studies should consider including dynamic videos to enhance accuracy. Thirdly, while radiologists were tasked with assessing benign and malignant thyroid nodules in our study, in clinical practice, radiologists may prescribe FNA for certain suspicious benign nodules. Consequently, the performance of radiologists in our study may have been overestimated. Finally, our study specifically focused on thyroid calcified nodules, thus neglecting consideration of other non-calcified nodules. Early nodule screening relies heavily on the expertise of radiologists and the capabilities of imaging equipment, especially given the small size and concealed location of nodules. Our approach, being an AI method based on ultrasound images, is inherently limited in this regard. Alternatively, the utilization of T-cell receptor (TCR) sequencing data offers a promising avenue for early cancer diagnosis and demonstrates greater generalizability than our approach, which solely targets thyroid calcified nodules [40,41,42]. However, this method is expensive and does not accurately pinpoint the nodule’s location, potentially impacting subsequent treatment planning. In the future, it would be prudent to integrate TCR data with existing imaging screening methods and assess whether it can elevate the diagnostic proficiency of doctors.


In conclusion, our study trained and validated DL models using 1265 images of 546 nodules in Center 1, with an external test set of 126 images of 85 nodules from Center 2. Our findings affirm that DL methods outperform radiologists in the evaluation of thyroid nodules with calcification, establishing them as valuable adjunctive tools. However, further training and validation on multicenter data are necessary before integrating this method into clinical practice.

Data Availability

The datasets generated and/or analysed during the current study are not publicly available due privacy but are available from the corresponding author on reasonable request. The neural networks used in our AI system were developed in Tensorflow2.4.0-GPU. Code for preprocessing the data and running the inference, including the weights of the neural networks, sufficient to evaluate our system on other datasets, is available for research purposes upon a request made to the corresponding author ( Requests will be answered within one week. At this point, we are not sharing the code publicly in order not to compromise potential commercialization of our system.





Artificial intelligence


Area under the receiver-operator characteristic curve


Breast Imaging-Reporting and Data System


Class activation mapping


Convolutional neural network


Digital Imaging and Communications in Medicine


Deep learning




Fine needle aspiration


Global average pooling


Joint Picture Group


Negative predictive value


Positive predictive value


Rectified Linear Unit


Receiver operator characteristic






T cell receptor


Thyroid Imaging, Reporting and Data System




  1. Karkada M, Costa AF, Imran SA, Hart RD, Bullock M, Ilie G, et al. Incomplete thyroid Ultrasound reports for patients with thyroid nodules: implications regarding Risk Assessment and Management. AJR Am J Roentgenol. 2018;211(6):1348–53.

    Article  PubMed  Google Scholar 

  2. Kobaly K, Kim CS, Mandel SJ. Contemporary management of thyroid nodules. Annu Rev Med. 2022;73:517–28.

    Article  CAS  PubMed  Google Scholar 

  3. Alexander EK, Doherty GM, Barletta JA. Management of thyroid nodules. Lancet Diabetes Endocrinol. 2022;10(7):540–8.

    Article  CAS  PubMed  Google Scholar 

  4. Haugen BR, Alexander EK, Bible KC, Doherty GM, Mandel SJ, Nikiforov YE et al. 2015 American thyroid Association Management Guidelines for adult patients with thyroid nodules and differentiated thyroid Cancer: the American thyroid Association Guidelines Task Force on thyroid nodules and differentiated thyroid Cancer. Thyroid 2016;26(1).

  5. Ha EJ, Chung SR, Na DG, Ahn HS, Chung J, Lee JY, et al. Korean Society of Thyroid Radiology Consensus Statement and Recommendations. Korean J Radiol. 2021;22(12):2094–123. 2021 Korean Thyroid Imaging Reporting and Data System and Imaging-Based Management of Thyroid Nodules:.

  6. Todsen T, Bennedbaek FN, Kiss K, Hegedüs L. Ultrasound-guided fine-needle aspiration biopsy of thyroid nodules. Head Neck. 2021;43(3):1009–13.

    Article  PubMed  Google Scholar 

  7. Lu Z, Mu Y, Zhu H, Luo Y, Kong Q, Dou J, et al. Clinical value of using ultrasound to assess calcification patterns in thyroid nodules. World J Surg. 2011;35(1):122–7.

    Article  PubMed  Google Scholar 

  8. Chen G, Zhu XQ, Zou X, Yao J, Liang JX, Huang HB, et al. Retrospective analysis of thyroid nodules by clinical and pathological characteristics, and ultrasonographically detected calcification correlated to thyroid carcinoma in South China. Eur Surg Res. 2009;42(3):137–42.

    Article  CAS  PubMed  Google Scholar 

  9. Khoo MLC, Asa SL, Witterick IJ, Freeman JL. Thyroid calcification and its association with thyroid carcinoma. Head Neck. 2002;24(7):651–5.

    Article  PubMed  Google Scholar 

  10. Yin L, Zhang W, Bai W, He W. Relationship between morphologic characteristics of Ultrasonic calcification in thyroid nodules and thyroid carcinoma. Ultrasound Med Biol. 2020;46(1):20–5.

    Article  PubMed  Google Scholar 

  11. Lacout A, Chevenet C, Thariat J, Marcy PY. Thyroid calcifications: a pictorial essay. J Clin Ultrasound. 2016;44(4):245–51.

    Article  PubMed  Google Scholar 

  12. Tessler FN, Middleton WD, Grant EG, Hoang JK, Berland LL, Teefey SA, et al. ACR thyroid imaging, reporting and Data System (TI-RADS): White Paper of the ACR TI-RADS Committee. J Am Coll Radiol. 2017;14(5):587–95.

    Article  PubMed  Google Scholar 

  13. Tappouni RR, Itri JN, McQueen TS, Lalwani N, Ou JJ. ACR TI-RADS: pitfalls, solutions, and future directions. Radiographics. 2019;39(7):2040–52.

    Article  PubMed  Google Scholar 

  14. Lee J, Lee SY, Cha S-H, Cho BS, Kang MH, Lee O-J. Fine-needle aspiration of thyroid nodules with macrocalcification. Thyroid. 2013;23(9):1106–12.

    Article  PubMed  Google Scholar 

  15. Erdem Toslak I, Martin B, Barkan GA, Kılıç AI, Lim-Dunham JE. Patterns of Sonographically detectable echogenic Foci in Pediatric thyroid carcinoma with corresponding histopathology: an observational study. AJNR Am J Neuroradiol. 2018;39(1):156–61.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Gwon HY, Na DG, Noh BJ, Paik W, Yoon SJ, Choi SJ, et al. Thyroid nodules with isolated macrocalcifications: malignancy risk of isolated macrocalcifications and postoperative risk stratification of malignant tumors manifesting as isolated macrocalcifications. Korean J Radiol. 2020;21(5):605–13.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Shin HS, Na DG, Paik W, Yoon SJ, Gwon HY, Noh BJ, et al. Malignancy risk stratification of thyroid nodules with macrocalcification and Rim Calcification based on Ultrasound patterns. Korean J Radiol. 2021;22(4):663–71.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Liu H, Chen Y, Zhang Y, Wang L, Luo R, Wu H, et al. A deep learning model integrating mammography and clinical factors facilitates the malignancy prediction of BI-RADS 4 microcalcifications in breast cancer screening. Eur Radiol. 2021;31(8):5902–12.

    Article  PubMed  Google Scholar 

  19. Patel NR, Setya K, Pradhan S, Lu M, Demer LL, Tintut Y. Microarchitectural changes of Cardiovascular calcification in response to in vivo interventions using deep-learning segmentation and computed Tomography Radiomics. Arterioscler Thromb Vasc Biol. 2022;42(8):e228–e41.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Nam JG, Kim M, Park J, Hwang EJ, Lee JH, Hong JH et al. Development and validation of a deep learning algorithm detecting 10 common abnormalities on chest radiographs. Eur Respir J. 2021;57(5).

  21. Yao Jincao LZ, Yue Wenwen F, Bojian L, Wei O, Di F, Na L, Yidan X, Jing C, Wencong Y, Chen W, Lijing W, Liping L, Junping W, Peiying. Xu Hui-Xiong, Xu Dong. DeepThy-Net: a Multimodal Deep Learning Method for Predicting Cervical Lymph Node Metastasis in Papillary thyroid Cancer. Adv Intell Syst. 2022.

  22. Zhang H, Cisse M, Dauphin YN, Lopez-Paz D. mixup: Beyond Empirical Risk Minimization. Learning. 2017.

  23. DeVries T, Taylor GW. Improved Regularization of Convolutional Neural Networks with Cutout. Computer Vision and Pattern Recognition. 2017.

  24. Wu Y. Joint comparison of the predictive values of multiple binary diagnostic tests: an extension of McNemar’s test. J Biopharm Stat. 2022.

  25. Leisenring W, Alonzo T, Pepe MS. Comparisons of predictive values of binary medical diagnostic tests for paired designs. Biometrics. 2000;56(2):345–51.

    Article  CAS  PubMed  Google Scholar 

  26. Kwak JY, Han KH, Yoon JH, Moon HJ, Son EJ, Park SH, et al. Thyroid imaging reporting and data system for US features of nodules: a step in establishing better stratification of cancer risk. Radiology. 2011;260(3):892–9.

    Article  PubMed  Google Scholar 

  27. Rago T, Vitti P. Risk stratification of thyroid nodules: from Ultrasound features to TIRADS. Cancers (Basel). 2022;14(3).

  28. Kim BM, Kim MJ, Kim E-K, Kwak JY, Hong SW, Son EJ, et al. Sonographic differentiation of thyroid nodules with eggshell calcifications. J Ultrasound Med. 2008;27(10):1425–30.

    Article  PubMed  Google Scholar 

  29. Park M, Shin JH, Han B-K, Ko EY, Hwang HS, Kang SS, et al. Sonography of thyroid nodules with peripheral calcifications. J Clin Ultrasound. 2009;37(6):324–8.

    Article  PubMed  Google Scholar 

  30. Malhi HS, Velez E, Kazmierski B, Gulati M, Deurdulian C, Cen SY, et al. Peripheral thyroid nodule calcifications on Sonography: evaluation of malignant potential. AJR Am J Roentgenol. 2019;213(3):672–5.

    Article  PubMed  Google Scholar 

  31. Hoang JK, Middleton WD, Farjat AE, Teefey SA, Abinanti N, Boschini FJ, et al. Interobserver variability of Sonographic features used in the American College of Radiology thyroid imaging reporting and Data System. AJR Am J Roentgenol. 2018;211(1):162–7.

    Article  PubMed  Google Scholar 

  32. Malhi H, Beland MD, Cen SY, Allgood E, Daley K, Martin SE, et al. Echogenic foci in thyroid nodules: significance of posterior acoustic artifacts. AJR Am J Roentgenol. 2014;203(6):1310–6.

    Article  PubMed  Google Scholar 

  33. Zayadeen AR, Abu-Yousef M, Berbaum K. JOURNAL CLUB: retrospective evaluation of Ultrasound features of thyroid nodules to assess malignancy risk: a step toward TIRADS. AJR Am J Roentgenol. 2016;207(3):460–9.

    Article  PubMed  Google Scholar 

  34. Frates MC, Benson CB, Doubilet PM, Kunreuther E, Contreras M, Cibas ES, et al. Prevalence and distribution of carcinoma in patients with solitary and multiple thyroid nodules on sonography. J Clin Endocrinol Metab. 2006;91(9):3411–7.

    Article  CAS  PubMed  Google Scholar 

  35. Seo H, Na DG, Kim J-H, Kim KW, Yoon JW. Ultrasound-based risk stratification for malignancy in thyroid nodules: a four-tier categorization system. Eur Radiol. 2015;25(7):2153–62.

    Article  PubMed  Google Scholar 

  36. Moon W-J, Jung SL, Lee JH, Na DG, Baek J-H, Lee YH, et al. Benign and malignant thyroid nodules: US differentiation–multicenter retrospective study. Radiology. 2008;247(3):762–70.

    Article  PubMed  Google Scholar 

  37. Ha SM, Chung YJ, Ahn HS, Baek JH, Park SB. Echogenic foci in thyroid nodules: diagnostic performance with combination of TIRADS and echogenic foci. BMC Med Imaging. 2019;19(1):28.

    Article  PubMed  PubMed Central  Google Scholar 

  38. Mazurowski MA, Buda M, Saha A, Bashir MR. Deep learning in radiology: an overview of the concepts and a survey of the state of the art with focus on MRI. J Magn Reson Imaging. 2019;49(4):939–54.

    Article  PubMed  Google Scholar 

  39. Buda M, Wildman-Tobriner B, Hoang JK, Thayer D, Tessler FN, Middleton WD, et al. Management of thyroid nodules seen on US images: Deep Learning May Match Performance of Radiologists. Radiology. 2019;292(3):695–701.

    Article  PubMed  Google Scholar 

  40. Xu Y, Qian X, Zhang X, Lai X, Liu Y, Wang J. DeepLION: Deep Multi-Instance Learning improves the prediction of Cancer-Associated T Cell receptors for Accurate Cancer Detection. Front Genet. 2022;13:860510.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Beshnova D, Ye J, Onabolu O, Moon B, Zheng W, Fu Y-X, Brugarolas J, Lea J, Li B. De novo prediction of cancer-associated T cell receptors for noninvasive cancer detection. Sci Transl Med 2020, 12(557).

  42. Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical Motifs in T-cell receptor sequences distinguish repertoires from Tumor-Infiltrating lymphocyte and adjacent healthy tissue. Cancer Res. 2019;79(7):1671–80.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references


Not applicable.


This research was funded by the National Natural Science Foundation of China, grant number 82071946; the “Pioneer” and “Leading Goose” R&D Program of Zhejiang, grant number 2023C04039; the Research Program of National Health Commision Capacity Building and Continuing Education Center, grant number CSJRZC2021JJSJ001; the Natural Science Foundation of Zhejiang Province, grant number LZY21F030001 and the Research Program of Zhejiang Provincial Department of Health, grant number 2021KY099 and 2022KY110.

Author information

Authors and Affiliations



CC conceived and designed the study. WK, ZML, PQM, WH, WJX and QXQ were responsible for data correction and interpretation. CC and LYZ were responsible for image data analysis. SF, TY, GL and YYJ were responsible for literature research. LYZ were responsible for statistical analysis. CC did a major contributor in writing the manuscript. XD, WYF and YJC were responsible for manuscript reviewing and editing. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yifan Wang or Dong Xu.

Ethics declarations

Ethics approval and consent to participate

This retrospective study’s protocol was approved by the Independent Ethics Committee of Zhejiang Cancer Hospital (NO: IRB-2020-287) and the Medical Ethics Committee of Taizhou Cancer Hospital (NO: IRB-2023-001-IIT), and written informed consent from the patient was waived by the ethics committee of the Independent Ethics Committee of Zhejiang Cancer Hospital and the Medical Ethics Committee of Taizhou Cancer Hospital (IRB-2020-287, IRB-2023-001-IIT). Patient records were anonymized and deidentified before analysis. We confirm that all methods were performed in accordance with the relevant guidelines and regulations.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, C., Liu, Y., Yao, J. et al. Deep learning approaches for differentiating thyroid nodules with calcification: a two-center study. BMC Cancer 23, 1139 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: