A systematic review of test accuracy studies evaluating molecular micro-satellite instability testing for the detection of individuals with lynch syndrome

Background A systematic review was conducted to assess the diagnostic test accuracy of polymerase chain reaction (PCR)-based microsatellite instability (MSI) testing for identifying Lynch syndrome in patients with colorectal cancer (CRC). Unlike previous reviews, this was based on assessing MSI testing against best practice for the reference standard, and included CRC populations that were unselected, age-limited or high-risk for Lynch syndrome. Methods Single- and two-gate diagnostic test accuracy studies, or similar, were identified, assessed for inclusion, data extracted and quality appraised by two reviewers according to a pre-specified protocol. Sensitivity of MSI testing was estimated for all included studies. Specificity, likelihood ratios and predictive values were estimated for studies that were not based on high-risk samples. Narrative synthesis was conducted. Results Nine study samples were included. When MSI-Low results were considered to be negative, sensitivity estimates ranged from 67% (95% CI 47, 83) to 100% (95% CI 94, 100). Three studies contributed to estimates of both sensitivity and specificity, with specificity ranging from 61% (95% CI 57, 65), to 93% (95% CI 89, 95). Good sensitivity was achieved at the expense of specificity. When MSI-L was considered to be positive (effectively lowering the threshold for a positive index test result) sensitivity increased and specificity decreased. Between-study heterogeneity in both the MSI and reference standard testing, combined with the low number of studies contributing to both sensitivity and specificity estimates, precluded pooling by meta-analysis. Conclusions MSI testing is an effective screening test for Lynch syndrome. However, there is significant uncertainty surrounding what balance of sensitivity and specificity will be achieved in clinical practice and how this relates to specific characteristics of the test (such as the panel of markers used or the thresholds used to denote a positive test).


Background
Lynch syndrome is caused by heritable constitutional pathogenic mutations in the mismatch repair (MMR) genes (MLH1, MSH2, MSH6 and PMS2) or, rarely, by certain mutations in nearby genes that affect expression of the adjacent MMR gene, (i.e. EPCAM and MSH2, and LRRFIP2 and MLH1), due to epigenetic silencing caused by promoter methylation [1,2]. It is responsible for around 2.8% of colorectal cancer (CRC), [3] conveys a high risk of colorectal and endometrial cancer, and increases the risk of other cancers, such as ovarian and gastric cancer [4,5]. In people with Lynch syndrome, CRC has an earlier onset than CRC in the general population (44 years, compared with 60-65 years respectively) [5,6]. Currently, the best method for diagnosing Lynch syndrome is comprehensive screening for constitutional mutations in the MMR genes and EPCAM, using a combination of (i) DNA sequencing in order to detect point mutations and small insertions and deletions, and (ii) multiplex ligation-dependent probe amplification (MLPA) to detect large structural DNA abnormalities [7].
Patients with CRC can be selected for comprehensive screening for constitutional mutations by first applying other diagnostic tests. Due to the fact that there is a high probability of loss of MMR function in Lynch syndrome cancers, and that tumours which have lost MMR function display microsatellite instability, one such test is microsatellite instability (MSI) testing. This involves polymerase chain reaction (PCR) amplification of DNA markers (using tumour tissue and healthy tissue). The two samples are compared to assess whether abnormal patterns of microsatellite repeats are observed in the tumour DNA. Mono-and dinucleotide markers are the most frequently used with the Bethesda/NCI markers (BAT-25, BAT-26, D2S123, D5S346, D17S250) often being used [8]. However, other markers are in use (e.g. BAT-40, MYCL, MONO-27, NR-21, NR- 24), and it has been argued that the panel should contain at least three mononucleotide markers [9], and thus individual laboratories may develop their own panels [10]. Microsatellite instability is categorised trimodally (MSI-High, MSI-Low, or MS-Stable) or bimodally (MSI-positive or negative), according to the proportion of markers demonstrating MSI. When a trimodal categorisation is initially used, a decision must then be taken as to whether MSI-Low (MSI-L) will then be further categorised as a positive or negative test result.
A previous Health Technology Assessment in England and Wales evaluated the diagnostic test accuracy of MSI for Lynch syndrome in early-onset (aged under 50 years) CRC patients [7]. However, most of the included studies were at risk of bias because the reference standard was not conducted on all participants. Additionally, because this previous review did not include unselected CRC samples, the results may have been subject to spectrum effects and not generalisable to all CRC patients. Furthermore, this previous review [7], and others before it [4,9,11], have been obliged to include a wide range of techniques as their reference standard rather than focusing on the primary standard of comprehensive screening for constitutional mutations using a DNA sequencing method combined with MLPA or another appropriate technique to detect large structural DNA abnormalities.
This systematic review was, therefore, conducted to address the need for clearer information regarding the diagnostic test accuracy of PCR-based MSI testing (with or without BRAF V600E mutation testing and with or without MLH1 methylation testing) for identifying Lynch syndrome in patients in the general CRC population. The review was conducted as part of a Diagnostics Assessment Report (DAR) which was commissioned by the National Institute for Health Research (NIHR) Health Technology Assessment Programme in England and Wales to support the National Institute for Health and Care Excellence (NICE) Diagnostics Assessment of molecular testing for Lynch syndrome in people with colorectal cancer [https:// www.nice.org.uk/guidance/dg27] [12].

Methods
The systematic review was undertaken in accordance with a predefined protocol. The protocol for the review (and other reviews in the DAR) can be found at http://www.crd.york.ac.uk/PROSPERO/display_record.asp?ID=CRD42016033879. This review departs from the diagnostic test accuracy review described in the original protocol in that it focuses only on PCRbased MSI testing as the index test, whereas the full review also included immunohistochemistry (IHC) as an index test. However, no studies were found that directly compared MSI testing with IHC and the two index tests were, therefore, reviewed in parallel but evaluated separately.

Searches
The following bibliographic databases were searched using population terms for Lynch syndrome and intervention terms for MSI or IHC: Four key systematic reviews [4,7,9,11] (and other systematic reviews identified by the bibliographic database searches) were screened in order to source further relevant studies published before 2006 and additional studies published after 2006. These four key systematic reviews [4,7,9,11] were examined prior to the start of this review and were judged to have sufficiently robust searching methods to identify relevant studies published before 2006. Studies which cited the included studies were identified using Scopus (Elsevier). The reference lists of all included studies were screened in order to identify any additional relevant studies.

Study selection
Titles and abstracts of the studies retrieved from the searches were screened, independently by two reviewers, according to the predefined inclusion criteria specified below. Disagreements between reviewers were resolved by discussion, with arbitration from a third reviewer where necessary. Full texts of included titles/abstracts (from bibliographic database searches, and forward and backward citation chasing), and full texts of studies identified from systematic reviews, were obtained. These were screened in the same way as titles and abstracts.

Inclusion criteria
Studies had to be single-gate (also known as diagnostic cohort studies) or two-gate (also known as diagnostic case-control studies) diagnostic test accuracy studies (or a variation of one of these designs) [13]. They had to recruit individuals with colorectal cancer and investigate the diagnostic test accuracy of molecular MSI testing (with or without BRAF V600E mutation testing and with or without MLH1 methylation testing). MSI must have been compared with the reference standard, which was constitutional MMR mutation testing (including DNA sequencing of MLH1, MSH2 and MSH6 and MLPA or another appropriate technique for detecting large genomic abnormalities as a minimum), by providing sufficient data for at least sensitivity to be estimated. Other outcomes (in addition to sensitivity) were: specificity, likelihood ratios (LR+ and LR-), predictive values (PPV and NPV), concordance (with the reference standard), diagnostic yield, and test failure rates. To be included in the review, studies must have been designed for all participants to receive both the index test and reference standard. However, studies recruiting a representative sample of all patients with CRC (including where an age limit was applied), the reference standard may have been applied to all MSI positive-tumours and to a representative (e.g., random) sample of MSI negative-tumours. Abstracts were included if they reported data from an included study that was published in full.
Data extraction and quality appraisal (risk of bias assessment) All included studies were given a study identification label: first-author date. Where needed for clarity, included studies are identified by their study identification label in the results and discussion sections below. Data extraction and quality appraisal were conducted, for all included studies, by one reviewer and checked by another. Discrepancies were resolved by discussion with the involvement of a third reviewer where necessary. Extracted data included details of the study's design and methods, participants' characteristics and study results. Risk of bias in individual studies was assessed according to criteria in Phase 3 of the QUADAS-2 tool [14].

Analysis and synthesis
The data extracted from the included studies was analysed in STATA 13 (StataCorp LP) using the "diagt" command [15]. For single-gate studies that were not based on high-risk samples (including age-limited population studies), and where data permitted, sensitivity, specificity, LR+, LR-, PPV and NPV, diagnostic yield and concordance with the reference standard (with 95% confidence intervals [CIs]) were calculated. However, for the studies based on high-risk samples, sensitivity (with 95% CI) was calculated (spectrum effects that occur when using a high-risk sample have not been found to lead to significant bias in sensitivity estimates for MSI) [9]. Although not considered an outcome, for illustrative purposes, disease prevalence was also calculated for all included studies, based upon data extracted to 2X2 tables, and representing, therefore, the prevalence in the analysed samples rather than the recruited population. Where extracted data resulted in zero counts in one or more cells, one-sided 95% CIs were calculated.
In primary analyses, MSI-Low was considered as a negative index test result. Unclassified variants (mutations of unknown clinical significance with regards Lynch syndrome), where reported, were considered to be negative reference standard results. The main method of synthesis was narrative.

Results
Ten studies were included in the HTA upon which this review is based (Fig. 1). However, in two of these studies, MSI was not assessed. Therefore, eight studies ( It should be noted that Poynter 2008 [21] had two distinct samples (a population-based sample and a highrisk sample) and, therefore, had two distinct sets of includable data. These two samples were treated separately and both included in this review.

Study and participant characteristics
Of the nine study samples included in this review, three report data from a population-based sample, although only Poynter 2008 recruited an apparently unselected CRC population [21]. The other two studies restricted the population by applying a maximum limit to age of diagnosis: Barnetson 2006 applied an age limit of <55 years and Southey 2005 applied a limit of <45 years [16,23]. All three of these studies used single-gate designs but varied in size with Barnetson 2006 and Poynter 2008 recruiting 1259 participants and 1061 participants respectively but Southey 2005 recruiting only 131 participants [16,21,23]. Disease prevalence in the analysed study samples is provided in Table 1  The remaining six studies selected participants with CRC who were at high-risk for Lynch syndrome (Caldes 2004, Hendriks 2003, Mueller 2009, Overbeek 2007, Shia 2005 and the other sample in Poynter 2008) [17][18][19][20][21][22]. Five of these studies had a single-gate design [17,[19][20][21][22]. The remaining study (Hendriks 2003) was a variation on a two-gate study design; although participants with positive reference standard results were recruited, no reference standard negatives were included [18]. We referred to this as a reference standard positive study design. The six high-risk studies varied in size; the largest study was Poynter 2008 with 172 participants and the smallest study was Hendriks 2003 with 45 participants [18,21]. Further details on study and participant characteristics, including disease prevalence in the analysed samples, are given in Table 1.
There was a great deal of between-study variation in both the reference standard and in the MSI testing methods as well as in the reporting of methods. For example, in the studies by Poynter 2008, Mueller 2009, and Overbeek 2007 microdissection techniques were not reported [19][20][21]. In addition, none of the population-based studies assessed the same panel of markers, with differences existing in both the number and type of markers, see Table 1.   1 Abstracts were excluded when they not linked to an included study and did not provide sufficient methodological information to meet the review inclusion criteria or have data extracted. 2 These studies evaluated IHS and not MSI. 3 One of these studies included two distinct populations, both of which are included in this review. Although there are eight included studies, there are, therefore, nine included datasets  [20,22], and Caldes 2004 used a bimodal categorisation but defined tumours as either MSI-H or MSS [17]. The thresholds used to categorise the MSI status of tumours also varied across studies (Table 1), with some studies using positivity of only 20% of markers as the cut-off between MSI-H and MSI-L (Barnetson 2006) [16], and others requiring 50% (Southey 2005) [23], although different numbers of markers were also used in these two studies. It should also be mentioned that two of the populationbased or age-limited studies (Poynter 2008, Barnetson 2006) and three of the high-risk studies (Caldes 2004, Shia 2005, Hendriks 2003) report on unclassified variants (i.e. mutations where the association with Lynch syndrome is unclear) [16-18, 21, 22]. Notably, all of the nine studies included in this report predate what is now considered to be the definitive interpretation of MMR gene variants [24]. In this review, therefore, unclassified variants have primarily been considered to be reference standard negative results.

Risk of bias in individual studies
None of the included studies displayed any evidence to suggest that they were at high-risk of bias (Table 2). It should be noted that an absence of evidence to suggest that the included studies were at high risk of bias does not mean that the studies were free from bias. In fact, none of the studies adequately reported whether MSI was interpreted without knowledge of the reference standard results, or whether the thresholds used to denote a positive MSI result were pre-specified, so it was unclear whether the conduct of the MSI test would have introduced bias. Likewise, for all of the included studies, it was not clear whether the flow and timing of the study would have introduced bias. In all studies except Hendriks 2003 it was unclear whether the conduct of the reference standard would have introduced bias [18]. Additionally, only Barnetson 2006 and Southey 2005 reported sufficient information to determine that participant selection was unlikely to have introduced bias [16,23]. Table 3 gives sensitivity and specificity estimates from primary analyses, where unclassified variants were considered to be index test negative results and MSI-L was considered to be an index test negative result (for studies using a tri-modal distribution of MSI). Only three studies contributed estimates of both sensitivity and specificity (Barnetson 2006, Southey 2005, Poynter 2008) [16,21,23]. These suggested, in one study (Poynter 2008), that good sensitivity (100.0%, 95% CI 93.9100.0) could be achieved at the expense of specificity (61.1%, 95% CI 57.0, 65.1) [21]. Conversely a second study (Barnetson 2006) achieved good specificity (92.5%, 95% CI 89.1, 95.2) but at the expense of sensitivity (66.7%, 95% CI 47.2, 82.7) [16]. The third study (Southey 2005) had intermediate values of sensitivity (72.2%, 95% CI 46.5, 90.3) and specificity (87.8%, 95% CI 73.8, 95.9) [23]. Although this pattern would be consistent with a threshold effect, it is difficult to establish this because, in addition to using different numbers of unstable markers to denote a positive MSI result, the panel of markers differed between studies ( Table 1).

Sensitivity and specificity
The range of sensitivities in the single gate, high risk samples (Caldes 2004, Mueller 2009, Overbeek 2007, Shia 2005 and the other sample in Poynter 2008) [17,[19][20][21][22] and the reference standard positive study (Hendriks 2003) [18] all fell within the range of sensitivities identified in the three single-gate population-based sample studies (Barnetson 2006, Poynter 2008, Southey 2005 [16,21,23]. In secondary analyses, when MSI-L was considered to be a positive index test result, sensitivity increased (for the six study samples where a tri-modal distribution of MSI was used) and specificity decreased (for the three population-based samples). This is unsurprising; including MSI-L as an index test positive essentially decreases the threshold for a positive test result. Fig. 2 illustrates this effect for the three population based studies.
Further analyses were also conducted where unclassified variants were considered to be positive reference standard results. Only Caldes 2004 and Hendriks 2003 provided sufficient data for these analyses [17,18]. Both of these studies were based on high-risk populations, so only sensitivity estimates were made, and these were largely unchanged from the primary analyses, most likely because of the low absolute numbers of unclassified variants involved.
Pooling of sensitivity and specificity data was considered but rejected, primarily because of the marked betweenstudy methodological and clinical heterogeneity (e.g. differences in sequencing methods, which genes were tested, in techniques used to test for large genomic deletions and alterations, whether unclassified variants were tested and in the number and nature of the microsatellite makers assessed). Further, as only three studies provided both sensitivity and specificity estimates, this precluded the application of potentially useful test accuracy meta-analytic models such as the hierarchical summary ROC model [25]. Although nine study samples provided estimates of sensitivity, pooling of sensitivity alone is not recommended because of the interdependence with specificity [25][26][27].

Secondary outcomes
For the population-based sample in Poynter 2008 and for the other two studies that recruited population-based samples (Barnetson 2006 andSouthey 2005) [16,21,23], LR+, LR−, PPV, NPV, diagnostic yield and concordance with the reference standard were also calculated. These were primarily estimated based on MSI-L being an index test negative result and these results are given in Table 4. None of the studies reported test failure rates.

Discussion
This systematic review was conducted by an independent, experienced research team working to a prespecified protocol. It is notable that, of the nine study samples included in this review, only one (Poynter 2008) appeared to recruit an unselected CRC population, although even in this study it was not clear how the participants were selected [21]. This paucity of large, unselected, population-based studies is unsurprising; it is costly to provide all participants with both the index test Patient selection Was a consecutive or random sample of patients enrolled?
Did the study avoid inappropriate exclusions?
Could the selection of patients have introduced bias?
Is there concern that the included patients do not match the review question?
Is there concern that the index test, its conduct, or interpretation differ from the review question?
Is the reference standard likely to correctly classify the target condition?
Were the reference standard results interpreted without knowledge of the results of the index test?
Could the reference standard, its conduct, or its interpretation have introduced bias?
Is there concern that the target condition as defined by the reference standard does not match the review question?

Flow and timing
Was there an appropriate interval between index test(s) and reference standard?
Did all patients receive a reference standard?
Did patients receive the same reference standard?
Were all patients included in the analysis?
Could the patient flow have introduced bias?
Notes: a Poynter was assessed twice because data were reported for both a population-based sample and a high-risk sample; b A case-control design was only avoided because there was no control group (half a case control study); c An unbiased estimate of sensitivity (but not specificity) can be ascertained from this study design, however an unclear rating is given because it is not clear if a consecutive or random sample was recruited Key: L = low, N = no, U = unclear, Y = yes and the reference standard. However, it would be methodologically acceptable to perform the reference standard on a random sample of index test negatives (and all index test positives), and this would decrease costs. Despite improvements in reference standard techniques and, therefore, stricter definitions of the reference standard in this review compared with previous reviews, a similar range of test accuracy estimates were found [4,7,9,11]. Indeed, across all nine study samples, sensitivity ranged from 66.7% (95% CI 47.2, 82.7) to 100.0% (95% CI 93.9, 100.0) in primary analyses. This is broadly in line with previous reviews [9,11]. In the three population-based studies identified in this review, specificity estimates ranged from 61.1% (95% CI 57.0, 65.1) in Poynter 2008 to 92.5% (95% CI 89.1, 95.2) in Barnetson 2006 [16, 21].
Across all included studies, sensitivity estimates did not appear to be greatly impacted by the type of population; the estimates generated from the high-risk samples [17][18][19][20][21][22] were not obviously dissimilar from the population-based samples [16,21,23]. Similar results have been noted in a previous review, where it was suggested that spectrum bias may not be an issue for estimating sensitivity of MSI testing [9]. There is, however, good evidence that MSI prevalence in sporadic cancers increases with age [28], which could result in increased false positive results in older populations and, therefore, impact upon specificity estimates [29].
There are many possible explanations for the betweenstudy differences in the sensitivity and specificity estimates. It is clear from the current review that MSI testing is not a universally standard procedure; differences exist in the way in which these tests are performed (including different thresholds used in each study to denote an MSI positive result, in the specific genetic testing procedures used for the reference standard, and in the different panel of MSI markers used in each study). These differences may impact upon the number of false negatives and false positives generated, and therefore the sensitivity and specificity of the test. As there is always a trade-off between sensitivity and specificity, and this would likely be influenced by differences in the MSItesting procedures, it is important to consider whether sensitivity should be maximized at the expense of specificity or vice versa. There was not a sufficient body of high-quality evidence to conduct meta-regression, which could make it possible to predict the balance of sensitivity and specificity for a given set of test characteristics, but it is unavoidable that using a lower threshold (e.g. including MSI-L as a positive test result) would increase sensitivity at the expense of specificity.

MSI-L as a negaƟve result
MSI-L as a posiƟve result  Due to the fact that an MSI test is a triage test rather than a definitive diagnostic test, and assuming an aim of maximising the number of individuals with Lynch syndrome who eventually receive a correct diagnosis, it would be preferable to maximize sensitivity (i.e. try to minimize false negative MSI results at the expense of false positives). False positive results from MSI testing are likely to be corrected by subsequent testing, while false negative results are unlikely to be corrected until there is another cancer in the individual or their family. However, false positive results can still have direct and undesirable effects on health (e.g., anxiety related to genetic counselling and genetic testing) and also on the health service (due to the cost of unnecessary testing), and in some patients genetic testing may not be conclusive, which can lead to difficulties in identifying appropriate clinical management [24]. Indeed, it is virtually impossible to estimate the relative harms of false negative and false positive results without some form of evidence synthesis approach, such as decision modelling [30].
One of the key strengths of the current systematic review is that it did include studies from a range of populations and using a range of different MSI testing strategies (with different panels of markers and thresholds). Diagnostic test accuracy estimates are provided for each of the included studies rather than providing pooled estimates that may not apply to a particular population or testing strategy. For clinicians, patients, academics and anyone else wanting estimates of the sensitivity and specificity of MSItesting for identifying pathogenic Lynch syndrome mutations, it may be prudent to look at the estimates from studies whose samples and testing methods are most similar to the population of interest rather than using pooled estimates from heterogeneous studies. Unfortunately, the paucity of similar studies precluded the statistical investigation of factors impacting sensitivity and specificity estimates (e.g., markers, thresholds). If future studies and analyses accurately quantify the trade-off between sensitivity and specificity according to test characteristics, a decision modelling approach could be used to select the appropriate characteristics to maximise the desired objective (e.g., cost-effectiveness).

Conclusion
MSI testing is an effective screening test for Lynch syndrome. However, there is a paucity of studies that evaluate test accuracy in unselected, population-based samples. In addition, the studies that were identified in this review displayed heterogeneity in both the MSI and reference standard testing methods. As such, there is significant uncertainty surrounding what balance of sensitivity and specificity will be achieved in clinical practice and how this relates to specific characteristics of the test (such as the panel of markers used or the thresholds used to denote a positive test).