A systematic review reporting quality of radiomics research in neuro-oncology: toward clinical utility and quality improvement using high-dimensional imaging features

Background To evaluate radiomics analysis in neuro-oncologic studies according to a radiomics quality score (RQS) system to find room for improvement in clinical use. Methods Pubmed and Embase were searched up the terms radiomics or radiogenomics and gliomas or glioblastomas until February 2019. From 189 articles, 51 original research articles reporting the diagnostic, prognostic, or predictive utility were selected. The quality of the methodology was evaluated according to the RQS. The adherence rates for the six key domains were evaluated: image protocol and reproducibility, feature reduction and validation, biologic/clinical utility, performance index, a high level of evidence, and open science. Subgroup analyses for journal type (imaging vs. clinical) and biomarker (diagnostic vs. prognostic/predictive) were performed. Results The median RQS was 11 out of 36 and adherence rate was 37.1%. Only 29.4% performed external validation. The adherence rate was high for reporting imaging protocol (100%), feature reduction (94.1%), and discrimination statistics (96.1%), but low for conducting test-retest analysis (2%), prospective study (3.9%), demonstrating potential clinical utility (2%), and open science (5.9%). None of the studies conducted a phantom study or cost-effectiveness analysis. Prognostic/predictive studies received higher score than diagnostic studies in comparison to gold standard (P < .001), use of calibration (P = .02), and cut-off analysis (P = .001). Conclusions The quality of reporting of radiomics studies in neuro-oncology is currently insufficient. Validation is necessary using external dataset, and improvements need to be made to feature reproducibility, demonstrating clinical utility, pursuits of a higher level of evidence, and open science.


Background
Radiomics is a powerful tool for developing and testing medical hypotheses, involving the use of high-dimensional quantitative imaging features for predictive purposes. The extraction of high-throughput quantitative features and the use of sophisticated bioinformatics tools enables the development of models with potential diagnostic, prognostic, or predictive utility in cancer studies [1][2][3]. In the field of neuro-oncology, a large number of radiomics studies have demonstrated their diagnostic, prognostic, and predictive use in differential diagnosis [4,5], molecular classification [6][7][8], survival analysis [9,10], and treatment response to antiangiogenic treatment [11].
Although radiomics research shows great potential, its current use is rather confined to the academic literature, without real-world clinical applications. This is in part due to a lack of efficient and effective strategies for biomarker translation [12], which hampers the effective development of radiomics as an imaging biomarker to cross the 'translational gap' for use in guiding clinical decisions [13,14]. A standardized evaluation of the performance, reproducibility, and/or clinical utility of radiomics biomarkers is needed; with regard to the great need for qualified reporting, a system of metrics to determine the validity and completeness of radiomics studies was developed by Lambin et al. [2] in the form of the radiomics quality score (RQS). The RQS is comprised of 16 components, chosen to emulate the Transparent Reporting of a multivariable prediction model for Individual Prognosis OR Diagnosis (TRIPOD) initiative [15]. These are applied to a radiomics-specific design that considers high-dimensional data and modeling, and emphasizes clinical adoption of modeling research as in the TRI-POD guidelines.
A subsequent RQS study from the developer [3] reported an average score less than 50% over 41 radiomics studies using various modalities and conditions, including ultrasound, computed tomography, positron emission tomography, and magnetic resonance imaging (MRI). However, the results do not represent the field of neuro-oncology as both disease and imaging modality varies. Most of neuro-oncologic imaging studies are based on MRI, which is particularly challenging in generalizability and robustness in radiomics analysis as it has non-standardized pixel-values and large variations in signal intensities. To our knowledge, the quality of the science and reporting in radiomics research studies in the neuro-oncologic imaging is largely unknown.
In this study, we evaluated the radiomics analysis conducted in previous publications, summarizing six domains from the RQS: image protocol and feature reproducibility, feature reduction and validation, biologic/clinical validation and utility, performance index, high level of evidence, and open science. Our intention was to promote the quality of radiomics research studies as diagnostic, prognostic, and/or predictive biomarkers, to allow radiomics to become a viable tool for medical decision-making by facilitating the combined analysis of clinical data and high-throughput imaging features. The purpose of our study was to evaluate the quality of reporting radiomicsbased analysis in neuro-oncologic studies using RQS.

Article search strategy and study selection
To identify all potentially relevant original research papers published in the neuro-oncology field, database search was conducted in the MEDLINE (National Center for Biotechnology Information, NCBI) and EMBASE databases up from any time until February 28, 2019. The search terms used to find radiomics studies were "glioma" OR "glioblastoma" AND "radiomic" OR "radiogenomic". The search identified 293 candidate articles. Retrieved articles were screened for eligibility. After removal of 122 duplicates, screening of the abstracts of the remaining 171 articles was performed. Abstract review further excludes 98 articles for the following reasons: 31 non-radiomics studies, 27 reviews, 26 technical notes, 10 conference abstracts, 2 letter or opinion, 1 case report, and 1 animal study. Full-text reviews of the 73 potentially eligible articles were performed by two experienced reviewer (H.S.K., with 20 years of experience in neurooncologic imaging and J.H.K., with 25 years of experience in neurosurgery) selected articles in consensus that tested performance in respect to a diagnostic, prognostic, or predictive biomarker study aspect. Radiogenomics studies searching only for correlations and not containing a performance measurement for a diagnostic, prognostic, or predictive biomarker, were excluded. This process further removed 21 articles as they did not comprise performance tests for diagnostic, prognostic, or predictive utility. Finally, 52 articles were included in the main analysis ( Fig. 1).

Data extraction and analysis
Three reviewers (J.E.P., D.K., who had 5 and 2 years of experience in neuro-oncologic imaging and H.S.K.) evaluated the eligible radiomics performance studies. Before performing their analysis, a seminar was convened in which the reviewers participated, to review and discuss the items listed in the RQS and to ensure they all had a clear knowledge of RQS.
The detailed RQS score with 16 components is defined elsewhere [2] (Additional file 1: Table S1). The reviewers extracted the data using a predetermined RQS evaluation according to six domains. Domain 1 covered protocol quality and reproducibility in image and segmentation: well- Each article was evaluated by two of the three independent reviewers. Disagreements between any two reviewers were discussed at a research meeting attended by all three reviewers and an additional statistical reviewer. The following topics were, subject to some initial disagreements and were discussed until a consensus was reached.
1) Multiple segmentation (domain 1): when there were two or more readers, the article earned an additional point as segmentation variability was considered. Automatic segmentation using a convolutional neural network or other automatic software earned a point as the method pursued better segmentation reproducibility.
2) Validation (domain 2): a definition of missing validation (− 5 points) was applied when the article performed cross-validation or nested cross-validation using only the training data, as the validation needs to be performed without retraining and without adaptation of the previous cut-off value from the training data.
3) Multivariate analysis with non-radiomics features (domain 3): When the main study endpoint was survival and the selected radiomics features were further correlated with non-radiomic features, i.e. MGMT (O-6methylguanine-DNA methyltransferase) promoter methylation status, the article earned an additional point. However, if the main outcome was prediction of IDH (isodehydroxygenase) mutation and the radiomics features were selected for IDH mutation only, the article would not gain any additional score, as this would not provide a more holistic radiomics model. 4) Comparison with the gold standard (domain 3): As there is no TNM staging in neuro-oncology, the wellknown parameters of age, Karnofsky performance score, extent of resection [16], IDH, or MGMT status were considered as gold standards for survival analysis.
5) Potential clinical utility (domain 3): According to the consensus statement of the FDA-NIH Biomarker Working Group [13], 'clinical utility' is thought to be achieved when a biomarker leads to net improvement of health outcomes or provides information useful for prevention, diagnosis, treatment, and management of a disease [13,17,18]. For example, a study earned additional points if decision curve analysis was performed and demonstrated net improvement. Discussion of the potential utility of radiomics without proper analysis did not earn additional points. The multicentricity of data source was further investigated by one reviewer (J.E.P.). The study was evaluated whether the model validation was conducted using external dataset. The multicentricity was evaluated whether the data source was from single center, multi-center, and public data, when constructing the training set. In addition, whether data acquisition was on 1.5 Tesla or 3.0 Tesla magnet was evaluated.

Statistical analysis
The total RQS score was calculated for each article and for each component. For all included articles, the total RQS score was calculated (score range, − 8-36) and expressed as median and interquartile range.
For the six domains in the RQS (protocol quality and segmentation, feature selection and validation, biologic/ clinical validation and utility, model performance index, high level of evidence, and open science and data), basic adherence was assigned when a score of at least 1 point was obtained without minus points. The basic adherence rate was then calculated in a descriptive manner using proportions, and the proportion (%) of articles that fulfilled each reporting domain was determined.
A graphical display for the proportion of studies with a basic adherence rate was adopted from the suggested graphical display for Quality Assessment of Diagnostic Accuracy Studies-2 results [19].
Subgroup analyses were performed to determine whether the reporting quality differed according to intended use (diagnostic or prognostic) and published journal (imaging journal or clinical journal). Before subgroup analysis, the RQS was plotted for each journal to observe whether there was a systematic difference between each journals (Additional file 2: Figure S2), and with no systematic difference being observed between journals, this effect was not considered. The nonparametric Mann-Whitney U test was used to compare the RQS score in each group. All statistical analyses were performed using SPSS (SPSS version 22; SPSS, Chicago, IL) and R (R version 3.3.3; R Foundation for Statistical Computing, Vienna, Austria), and a P value < .05 was considered statistically significant.
Other non-glial tumors were studied in 9.8%. The study purposes included molecular or genomic classification (49.0%), survival prediction (25.5%), differential diagnosis of gliomas from non-glial tumors (11.8%), histopathologic grading (9.8%), and assessment of response to treatment (5.9%). One study predicted the occurrence of epilepsy in LGG patients using radiomics analysis [38]. Radiomics analysis was most frequently studied as a diagnostic biomarker (70.6%), followed by use as a prognostic biomarker (25.5%), and as a predictive  Table S2).
Basic adherence rate of the reporting quality according to the six key domains Table 2 summarizes the basic adherence rate of the radiomics studies according to the six key domains. In domain 1, all studies reported well-documented image acquisition protocols or use of publicly available image databases. Multiple segmentations by two readers or automatic segmentation were performed in 14 of the 51 studies (27.4%). Notably, only one study [10] conducted imaging at multiple time points and tested feature robustness. After test-retest analysis, 37.0% of the radiomic features (386 out of the 1043 extracted features in the study) were stable and reproducible [10] over three different sessions on the same machine.
In domain 2, most studies adopted appropriate feature reduction or adjustment for multiple testing (48 out of 51, 94.1%). The studies used either false discovery rate with univariate logistic regression or two-sample t-tests (for binary outcomes), and a variety of statistical and machine learning methods such as Least Absolute Shrinkage Selector Operator (LASSO), random forest, recursive feature elimination, and support vector machines. Many of studies performed validation using datasets from the same or a different institute (35 out of 51, 68.6%). Six studies earned the full 5 points for validation [21,24,31,36,54,55], using data from three datasets from distinct institutes or public dataset.
In domain 3, many of studies performed multivariate analysis of the radiomics features with non-radiomic features (62.7%), and most of the studies found biological correlates (74.5%) to provide a more holistic model and imply biological relevance. Less than half of the studies (37.2%) compared results with an existing gold standard.
By contrast, in terms of clinical utility, only one study [50] analyzed a net improvement in health outcomes using decision curve analysis or other statistical tools.
In domain 4, all studies used discriminative statistics, but two studies [40,41] provided hazard ratios and P values from a log-rank test for survival analysis instead of the C-index.
Surprisingly, studies were deficient in demonstrating a high level of evidence such as a prospective design or cost-effectiveness analysis. One two studies partly performed validation using a prospective registry [24,29], but the study per se was conducted in a retrospective manner. For domain 6, only three studies [4,20,47] made their code and/or data publicly available. The adherence rate according to the six key domains is shown in Fig. 2.

Assessment of the Radiomics quality score
The median RQS score of the 51 studies was 11.0, (interquartile range [IQR], 3-12.75), which was 30.5% of the ideal score of 36 (Table 3). The lowest score was − 6 and the highest score was 16 (44.4% of the ideal quality score). Compared with the ideal score, the RQS of the selected studies was lowest in the high level of evidence domain and open science and data domain (0%), followed by biological/clinical validation, and feature reproducibility in image and segmentation.
Both feature reduction and validation were missing from the study [51] with the lowest score. Meanwhile, studies with the highest scores [10,24,29,31,36,55] earned additional points by using publicly available images from the TCIA (The Cancer Imaging Archive) [24,31,36,55], registry or trial data [24,29], multiple segmentation [31], test-retest analysis [10], and calculation of calibration index and bootstrapping [10], with all studies fulfilling requirements for image protocol quality, feature reduction, validation, and use of a discrimination index.

Subgroup analysis
The results of the subgroup analysis according to the journal type and biomarker study design are shown in   Table 3. Studies in clinical journal showed a trend for a higher RQS score than those in imaging journals (median 11.5 vs. 10), but this was not statistically significant. Studies in clinical journals (n = 20) were similar with those in imaging journals (n = 31) in most of the RQS score except for open science and data (P = .03). All three studies [4,20,47] made their code and/or data publicly available were published in the same clinical journal (Neuro-Oncology). Prognostic/predictive studies showed a trend for a higher RQS score than diagnostic studies (median 12 vs. 10), but this was not statistically significant. Prognostic studies received a higher score than diagnostic studies in comparison with biologic correlates (P = .001) and comparison to a 'gold standard' (P < .001). Also, prognostic/ predictive studies used calibration statistics (P = .02) and cut-off analysis (P = .001) more frequently than diagnostic studies, which is potentially useful for future modeling studies.

Discussion
In this study, radiomics studies in neuro-oncologic imaging were evaluated in respect to the quality of both the science and the reporting, using radiomics quality score. Overall, radiomics studies still have room for improvement, with basic adherence rate of 37.1% out of total 16 items. In terms of protocol quality, radiomics studies were particularly deficient for testing image stability using test-retest analysis and a phantom study. Selected radiomics features were often correlated to nonradiomics features or biological phenotype, but linking them to clinical validation and achievement of clinical .57 10 (3-11) 12 (6)(7)(8)(9)(10)(11)(12)(13)(14) .07 Domain 1: Protocol quality and stability in image and segmentation (0 to 5 points) Protocol quality (2)  Feature reduction or adjustment of multiple testing (−3 or 3)

(3-3) (3-3) (3-3) .849 (3-3) (3-3) 1
Validation (−5, 2, 3, 4, or 5) .975 2 (−5-3) 2 (− 5-3) .833 Domain 3: Biologic/clinical validation and utility (0 to 6 points) Non-radiomics features (1) 1 (0-1) 1 (0-1) 1 (0-1) .799 1 (0-1) 1 (0-1) .159 Biologic correlates (1) .001 Comparison to 'gold standard' (2) 0 (0-2) 0 (0-2) 0 (0-2) .97 0 (0-0) 2 (2-2) <.001 Potential clinical utility (2)  utility was insufficient. The radiomics models were often measured with discriminative statistics, while calibration statistics and cut-off analyses were underutilized. A high level of evidence for radiomics studies is critically lacking, with further limitations being demonstrated in their openness to data and code. Our results imply that a low quality of reporting may hamper the use of radiomics utilities as a tool for clinical decision-making, and several key domains in radiomics studies require significant improvement. The six key domains used in this study were designed to support the integration of the RQS to facilitate the use in radiomics approaches. Adopted from the consensus statement of the FDA-NIH Biomarker Working Group [13], the three aspects of technical validation, biological/clinical validation, and assessment of costeffectiveness for imaging biomarker standardization were included in domains 1, 3, and 5, respectively. With regards to technical validation, radiomics approaches are yet to become a reliable measure for the testing of hypotheses in clinical cancer research, with insufficient data supporting their precision or technical bias. Precision analysis using repeatability and reproducibility test was conducted in one study [10], but reproducibility needs to be tested using different geographical sites and different equipment. Furthermore, none of the evaluated studies reported analysis of technical bias using a phantom study, which describes the systemic difference between the measurements of a parameter and its real values [62]. According to our results, further technical validation needs to be achieved before radiomics analysis can be related to clinical outcomes.
Along with technical performance assessment, imaging biomarkers need to pass clinical performance assessment in multicenter studies before they can be considered for clinical use [13,63]. For clinical validation, prospective testing of an imaging biomarker in clinical populations is required [64], but until now little studies have conducted a prospective study in the field of neuro-oncology. After biological/clinical validation, the cost-effectiveness of radiomics needs to be studied to ensure it provides good value for money compared with the other currently available biomarkers. However, no study has conducted cost-effectiveness analysis and only one study demonstrated net benefit improvement. Also, external validation is conducted in only 29.4% of the total studies. From the current standpoint, the clinical use of radiomics may seem far away, and technical and clinical validation is still required.
Subgroup analysis demonstrated that prognostic and predictive radiomics studies showed better quality than diagnostic studies in regard to comparison to 'gold standard', use of calibration statistics, and use of cut-off analysis. These are important for adoption of radiomics modeling, by demonstrating point estimates of prediction and actual data using calibration and by applying cut-off in the future studies. These measures further emphasize utility of radiomics modeling in clinic.
Biological validation of imaging biomarkers occurs relatively late in the development process [13]. In terms of biological validation, 74.5% of studies related radiomics features to biological correlates, such as the molecular subtype of IDH mutation or MGMT methylation status, while 62.7% of studies performed multivariate analysis using both radiomics and non-radiomic features. Nonetheless, biological validation of imaging-pathologic correlations [13] is not currently available for radiomics, as extensive spatial heterogeneity exists [1] and colocalized pathology data are currently not achievable. The unclear relationship with tumor biology is probably the reason why radiomics approaches have not influenced clinical decision-making or exhibited potential clinical utility.
It is important to adhere to the standardization of the radiomics features nomenclature and calculation according to the IBSI (International Biomarker Standardization Initiative) to improve reproducibility of scientific researches. Only 3 of studies made their code open, and many of studies did not provide detailed descriptions of the calculation of radiomics features and did not permit to clarify the details of radiomics calculation. Future studies are needed in terms of adherence to the standardization of radiomics features. Also, studies have utilized public data sources such as the TCIA did not made their own data publicly available. Further determination of the reproducibility of radiomics techniques requires collaborative multicenter studies, which would benefit greatly from the open availability of data and models.
Our study had several limitations. First, the publication of radiomics studies is not limited to the neurooncologic field or to MRI. Limits were placed on the search to permit in-depth analysis of applications for a particular disease where radiomics research seems to be most actively performed. Given the impact and number of populations of the selected studies, our results may actually represent a higher than average quality. Second, radiomics is still a developing imaging biomarker and the suggested RQS may be too 'ideal' to be qualified. The criteria of phantom study and multiple imaging acquisitions may become unrealistic in clinical situation. Also, segmentation stability earned score when it was performed by two readers or more, but recent development of deep learning segmentation may provide more robust result than manual segmentation. However, aiming for a higher level of evidence is necessary for the future use of radiomics approaches in future clinical trials and in the clinic.