Artificial intelligence for pre-operative lymph node staging in colorectal cancer: a systematic review and meta-analysis

Background Artificial intelligence (AI) is increasingly being used in medical imaging analysis. We aimed to evaluate the diagnostic accuracy of AI models used for detection of lymph node metastasis on pre-operative staging imaging for colorectal cancer. Methods A systematic review was conducted according to PRISMA guidelines using a literature search of PubMed (MEDLINE), EMBASE, IEEE Xplore and the Cochrane Library for studies published from January 2010 to October 2020. Studies reporting on the accuracy of radiomics models and/or deep learning for the detection of lymph node metastasis in colorectal cancer by CT/MRI were included. Conference abstracts and studies reporting accuracy of image segmentation rather than nodal classification were excluded. The quality of the studies was assessed using a modified questionnaire of the QUADAS-2 criteria. Characteristics and diagnostic measures from each study were extracted. Pooling of area under the receiver operating characteristic curve (AUROC) was calculated in a meta-analysis. Results Seventeen eligible studies were identified for inclusion in the systematic review, of which 12 used radiomics models and five used deep learning models. High risk of bias was found in two studies and there was significant heterogeneity among radiomics papers (73.0%). In rectal cancer, there was a per-patient AUROC of 0.808 (0.739–0.876) and 0.917 (0.882–0.952) for radiomics and deep learning models, respectively. Both models performed better than the radiologists who had an AUROC of 0.688 (0.603 to 0.772). Similarly in colorectal cancer, radiomics models with a per-patient AUROC of 0.727 (0.633–0.821) outperformed the radiologist who had an AUROC of 0.676 (0.627–0.725). Conclusion AI models have the potential to predict lymph node metastasis more accurately in rectal and colorectal cancer, however, radiomics studies are heterogeneous and deep learning studies are scarce. Trial registration PROSPERO CRD42020218004. Supplementary Information The online version contains supplementary material available at 10.1186/s12885-021-08773-w.


Background
Colorectal cancer (CRC) is the second most common malignancy and the third leading cause of cancer-related mortality in the world, accounting for 862,000 deaths annually [1]. CRC nodal metastases play a pivotal role in disease-free survival and in determining appropriate adjuvant and neoadjuvant treatment [2]. As a result of the application of preoperative staging MRI in patients with rectal cancer, neoadjuvant chemoradiation has become the standard of care in locally advanced tumours, resulting in improved local control and resectability. Owing to the lower accuracy of lymph node staging in colon cancer at diagnosis, neoadjuvant treatment is not as commonly recommended [3,4]. However, this may change following the results of the recent Fluoropyrimidine, Oxaliplatin and Targeted Receptor Pre-Operative Therapy (FOXTROT) trial showing the safety and efficacy of neoadjuvant chemotherapy in patients with locally advanced colon cancer [5]. Therefore, improved accuracy in clinical nodal staging at diagnosis may become critical in surgical planning and targeting effective neoadjuvant treatment for these patients [6,7].
Clinical staging of CRC is typically performed by radiologists assessing contrast enhanced computer tomography (CT) images in patients with colorectal cancer, and in addition, magnetic resonance imaging (MRI) in patients with rectal cancer. The staging accuracy of CT and MRI is affected by multiple factors, such as equipment performance, standardised imaging protocols, the reporting radiologist's experience, and patient-specific factors. Overall, published series have reported a 70% accuracy of diagnosing lymph node metastasis on CT, and 69% on MRI using standard criteria [8,9].
Current staging paradigms with its limited diagnostic and staging accuracy may be able to overcome by using Artificial intelligence (AI) models. AI-enabled radiomics involves the extraction of a large number of investigator defined features from medical images using advanced computational algorithms [10]. While radiomics models have been used to predict lymph node metastasis in CRC with partial success, previous studies by Ding et al. and Wang et al. demonstrate that deep learning algorithms have the potential to identify more subtle patterns that may elude conventional radiological and statistical methods [11][12][13]. Deep learning is a technique that involves the use of convolutional neural networks to self-educate an algorithm based on useful representations of images, thus bypassing the step of extracting manually designed features [14]. In recent years, radiomics nomograms and deep learning models have started to make a meaningful contribution to radiological diagnoses [15].
The aim of this systematic review and meta-analysis is to evaluate the accuracy of AI models in diagnosing lymph node metastasis on CT and/or MRI in colorectal cancer patients.

Search strategy
This systematic review and meta-analysis was performed according to the recommendations of the Preferred Reporting Items for Systematic Review and Meta-Analyses (PRISMA) guidelines and was registered with the International Prospective Register of Systematic Reviews with an analysis plan prior to conducting the research. A systematic search of the Cochrane Library, PubMed (MEDLINE), EMBASE and IEEE Xplore databases was performed for studies published between January 1st 2010 and October 1st 2020. The following search terms were used: artificial intelligence, deep learning, convolutional neural network, machine learning, automatic detection, radiomics, radiomic, CT/MRI, lymph node, lymph node metastasis, colon, rectal, colorectal (Additional file 1: Table S1). Reference lists of articles retrieved were also searched manually to identify additional eligible studies.

Study selection
Articles were included if they met the following criteria: (1) included patients with histopathological diagnosis of CRC; (2) developed or used a radiomics or deep learning algorithm to assess CT or MRI pre-operative lymph node metastasis detection and (3) published in English language. Exclusion criteria were (1) case reports, review articles, editorials, letters, comments, and conference abstracts; (2) studies focusing on segmentation or feature extraction methods only and (3) animal studies. After removing duplicates, titles and abstracts were reviewed for eligibility by two independent reviewers (SB and NNDV) using Covidence systematic review software (Veritas Health Innovation, Melbourne, Australia, available at www.covidence.org). Any disagreements were resolved by consensus arbitrated by a third author (TS).

Data extraction
Data from selected full-text articles were reviewed for reporting on the type of radiomics or deep learning model, study characteristics and outcome measures. The extracted data included the first author, year of publication, country, study type, number of patients, sample size for diagnostic accuracy, age, imaging modality, type of malignancy, AI model, and referenced standard. Data related to the accuracy of the radiologists' assessment derived from studies using clinical nodal staging or clinical nomograms solely based on N-staging was also collected. To obtain diagnostic accuracy data of AI models and radiologists' assessment, two-by-two contingency tables, sensitivity, specificity, accuracy, and area under the receiver operating characteristic curve (AUROC) were extracted or reconstructed. The primary endpoint was AUROC, secondary endpoints included sensitivity, specificity, and accuracy.

Quality assessment and publication Bias
The modified version as proposed by Sollini et al. of the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool was used to access the methodological quality of the included studies [16]. Minimum criteria for fulfilling each QUADAS-2 item were discussed by two reviewers (SB and NNDV) and disagreements were resolved by consensus. Publication bias was assessed using the Egger regression test and is presented as a funnel plot of diagnostic AUROC.

Statistical analysis
Meta-analysis was performed using testing set results of studies that presented absolute numbers for AUROC and 95% confidence intervals, contingency tables or provided sufficient information to derive the numbers manually. If results were not reported in an independent test set, cross validation or full test sample results are presented in this review. When results of different AI algorithms were reported in one article, the proposed algorithm with the highest diagnostic performance was analysed.
Three software packages, MedCalc for Windows, version 16.4.3 (MedCalc Software, Ostend, Belgium), Rev-Man, version 5.3.21 and Meta-DiSc version 1.4, were utilised for statistical analysis. Missing data were computed using confusion matrix calculator or manually derived using formulas in Additional file 1: Table S2. Pooling sensitivity, specificity and AUROC data was conducted using the Mantel-Haenszel method (fixed-effects model) and the DerSimonian Laird method (random-effects model) [17,18]. To assess heterogeneity between studies, the inconsistency index (I 2 ) was used [19]. Heterogeneity was quantified as low, moderate, and high, with upper limits of 25, 50 and 75% for I 2 , respectively. Forrest plots were drawn to show AUROC estimates in each study in relation to the summary pooled estimate. A funnel plot was constructed to visually assess publication bias.

Study characteristics
Twelve studies used radiomics models and five used deep learning models (Additional file 1: Table S3). All included studies were published between 2011 and 2020. Study design was retrospective in 11 and prospective in six studies. Fourteen studies were single-center and three were multi-center. Patients were predominantly male with a median age of 60 years (54-64). Eight studies used MRI and nine used CT to train their algorithm. The type of malignancy was colorectal in three studies, colon only in two, and rectal only in 12. Eleven studies used per-patient diagnostic output (the patient is node positive or negative) and 6 studies used per-nodal diagnostic output of lymph node metastasis (each individual node analysed separately). Fifteen studies used the postoperative pathology report as reference standard, and one study used a radiology report as the reference standard. The reference standard for the one remaining study was not reported.

Quality assessment and publication Bias
The methodologic quality of included studies is summarized in Fig. 2. As per the QUADAS-2 tool, risk of bias in patient selection was low in 15 (88%) studies and high in two (12%) studies. Risk of bias in the index test was high in one study (6%) and low in 16 (94%). Risk of bias in the reference standard test was low in 15 (88%), high in one study (6%) and unclear in one study (6%). Flow and timing had all 17 studies with unclear risk of bias. Overall applicability concerns were low (Additional file 1: Table S4). Funnel plot assessment (Additional file 1: Figure S1) showed no significant publication bias (Egger's intercept 1.11, 95%CI − 1.22 to 3.42, p = 0.313).

Diagnostic accuracy
For the 12 studies that could be included in the quantitative analysis, 10 used radiomics and two used deep learning. For each outcome, summary estimates of sensitivity, specificity and AUROC were produced with 95% confidence intervals on a per-patient and per-nodal basis ( Table 1). Pooled colorectal and rectal, per-patient and per-node detailed diagnostic measures reported by individual studies are shown in Table 2. The data for radiomics models in rectal cancer showed high heterogeneity with the exception of per-node AUROC and sensitivity. On a per-patient basis, radiomics in rectal cancer pooled AUROC was 0.808 (95%CI 0.739-0.876; Fig. 3) and pooled sensitivity and specificity were 0.776 (95%CI 0.685-0.851) and 0.676 (95%CI 0.608-0.739), respectively. On a per-nodal basis radiomics in rectal cancer pooled AUROC was 0.846 (95%CI 0.803-0.890) and pooled sensitivity and specificity were 0.896 (95%CI 0.834-0.941) and 0.743 (95%CI 0.665-0.811), respectively. On a per-patient basis radiomics in CRC pooled AUROC was 0.727 (95%CI 0.633-0.821). The radiologist per-patient assessment in rectal cancer pooled AUROC  (Table 1).

Discussion
To our knowledge, this is the first systematic review and meta-analysis of deep learning and radiomics performance in the assessment of lymph node metastasis in

Radiomics
Per-patient A number of research studies have already suggested AI has the potential to transform the healthcare sector particularly in areas where image recognition can be applied [35][36][37]. In terms of colorectal diseases, AI has been applied to colonic polyps, adenomas, colorectal cancer, ulcerative colitis and intestinal motility disorders [38][39][40][41]. Owing to the rapid development of AI technology, AI is bound to continually play an important role in the field of colorectal diagnosis and treatment [42]. Furthermore, the increase in computing power paired with the availability of large imaging databases offer the opportunity to develop more accurate AI algorithms. (10) At present, applications of deep learning to medical imaging are in vogue. However, deep learning models have several drawbacks, including variability in the images, large sample size, poor generalization and extensive computing resources. These models tend to rely on superficial data patterns and often fail when external factors such as different imaging acquisition parameters and types of scanners cause a distribution shift [43].
In this review, most studies used radiomics (n = 12), rather than deep learning methodology (n = 5) largely owing to deep learning technology being more recent, but also because it requires specific expertise. This limits the ability to draw definitive comparisons between the two AI models as one is somewhat over-represented in the data. Additionally, most studies were retrospective in design, making them prone to confounding and selection bias. Several studies focused on the technical aspects of the algorithm and did not address key limitations such as input variation, absence of clinical information (age, tumour site, patient history) and potential data overfitting often caused by noise in the data, overcomplicated models, and small sample sizes. Another issue, particularly common in deep learning studies, is the failure to report contingency tables or sufficient detail to enable reconstruction. We had to exclude five (29%) studies from the meta-analysis due to incomplete data. Most studies were conducted at a single-center and used internal verification or resampling methods (cross validation). Internal validation, however, tends to overestimate the AUROC due to the model's lack of generalizability, limiting the integration of AI models into the clinical setting [44]. Therefore, external validation prediction models using images from different hospitals are required to create reliable estimates on the level of performance at other sites [45]. The number of studies diagnosing lymph node metastasis on a per-  learning studies were available for inclusion. This, along with the heterogeneity seen in radiomics studies, means that the summary estimates of AUROCs have to be interpreted with caution. Secondly, because of incomplete reporting of results by several studies, estimates of diagnostic performance were calculated using limited data. Thirdly, given the majority of the included studies originate from China, there is a potential for geographical bias. Lastly, the wide range of scanner types, imaging protocols, and criteria for lymph node metastasis used may have affected accuracy of results. Results for radiomics and the radiologist assessment were highly heterogenous, which may be attributed to the different imaging modalities and small sample sizes. In the future, diagnostic AI models will have to be rigorously evaluated on their clinical benefit in comparison to current standard of care, as not all are suitable for clinical practice. Therefore, studies comparing AI with the clinicians' performance are most valuable and are more likely to ensure safe and effective implementation of AI technology into daily practice [46,47].

Conclusion
AI models have the potential to predict lymph node metastasis more accurately on a per-patient basis in colorectal cancer than the radiologists' assessment, however, radiomics studies are heterogeneous and deep learning studies are scarce. With further development and refinement, AI models capable of accurately predicting nodal stage may represent a significant advance in preoperative staging of colorectal cancer to better inform clinician and patient.
Abbreviations CRC: Colorectal cancer; MRI: Magnetic resonance imaging; CT: Computer tomography; AI: Artificial intelligence; QUADAS-2: Quality Assessment of Diagnostic Accuracy Studies tool 2; AUROC: Area under the receiver operating characteristic curve
Additional file 1: Table S1. Search Strategy. Table S2. Diagnostic accuracy measures. Table S3. Selected characteristics of included studies. Table S3. Quality assessment of studies included in systematic review, according to the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) Tool adapted with signalling questions by Sollini et al. Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.