Kappa statistic considerations in evaluating inter-rater reliability between two raters: which, when and context matters

Li, Ming; Gao, Qian; Yu, Tianfei

doi:10.1186/s12885-023-11325-z

Matters Arising
Open access
Published: 25 August 2023

Kappa statistic considerations in evaluating inter-rater reliability between two raters: which, when and context matters

Ming Li¹,
Qian Gao¹ &
Tianfei Yu²

BMC Cancer volume 23, Article number: 799 (2023) Cite this article

4591 Accesses
7 Citations
Metrics details

A Research to this article was published on 07 December 2022

Abstract

Background

In research designs that rely on observational ratings provided by two raters, assessing inter-rater reliability (IRR) is a frequently required task. However, some studies fall short in properly utilizing statistical procedures, omitting essential information necessary for interpreting their findings, or inadequately addressing the impact of IRR on subsequent analyses’ statistical power for hypothesis testing.

Methods

This article delves into the recent publication by Liu et al. in BMC Cancer, analyzing the controversy surrounding the Kappa statistic and methodological issues concerning the assessment of IRR. The primary focus is on the appropriate selection of Kappa statistics, as well as the computation, interpretation, and reporting of two frequently used IRR statistics when there are two raters involved.

Results

The Cohen’s Kappa statistic is typically utilized to assess the level of agreement between two raters when there are two categories or for unordered categorical variables with three or more categories. On the other hand, when it comes to evaluating the degree of agreement between two raters for ordered categorical variables comprising three or more categories, the weighted Kappa is a widely used measure.

Conclusion

Despite not substantially affecting the findings of Liu et al.?s study, the statistical dispute underscores the significance of employing suitable statistical methods. Rigorous and accurate statistical results are crucial for producing trustworthy research.

Peer Review reports

Assessing inter-rater reliability (IRR) is a common requirement in many research designs, particularly for demonstrating consistency among observational ratings provided by two raters [1]. Unfortunately, some studies misuse statistical procedures, fail to report critical information necessary to interpret their results, or do not adequately address how IRR affects the power of subsequent analyses for hypothesis testing [2]. This matters arising paper examines the recent publication by Liu et al. in BMC Cancer [3], highlighting controversy of the Kappa statistic and methodological concerns related to IRR assessment. The focus is on the selection of appropriate Kappa statistics, as well as computation, interpretation, and reporting of two commonly-used IRR statistics between two raters.

Kappa statistic

Typically, classical statistical techniques like the Kappa statistic, which encompasses Cohen’s Kappa and its adaptations, are utilized to evaluate IRR when dealing with nominal and categorical data.

Cohen’s kappa

Cohen’s Kappa [4] is a frequently employed classical statistical method to assess IRR, and it’s only suitable for fully-crossed designs with precisely two raters. Moreover, Cohen’s Kappa is commonly used for two raters with two categories or for unordered categorical variables with three or more categories [5, 6]. Ordered variables, also known as ordinal variables, possess a natural ordering or hierarchy among their categories. This means that the categories can be ranked or ordered in a meaningful way based on the magnitude or intensity of the variable being measured. Unordered variables, also known as nominal variables, do not have any inherent ordering or ranking among their categories. A Likert scale is a commonly used measurement tool where respondents rate their agreement or disagreement on a scale, typically ranging from “strongly disagree” to “strongly agree.” In this case, the categories have a clear ordering and represent a continuum of responses. For instance, the scale could be “1. Strongly Disagree, 2. Disagree, 3. Neutral, 4. Agree, 5. Strongly Agree.” Here, the variable is ordered because the categories have a logical sequence and can be ranked based on the level of agreement. On the other hand, let’s consider a variable like eye color, which includes categories such as blue, green, brown, and hazel. Unlike Likert scale ratings, eye color categories do not have a natural order or hierarchy. There is no inherent ranking or meaningful sequence among the categories. Therefore, eye color would be considered an unordered variable.

Cohen’s Kappa is calculated as follows:

$${k}_{C}=\frac{\sum _{j=1}^{n}{u}_{jj}\left(i{i}^{{\prime }}\right)-\sum _{j=1}^{n}{p}_{ij}{p}_{{i}^{{\prime }}j}}{1-\sum _{j=1}^{n}{p}_{ij}{p}_{{i}^{{\prime }}j}}$$

(1)

The value of ${u}_{jj}\left(i{i}^{{\prime }}\right)$ is the proportion of objects put in the same category j by both raters $i$ and ${i}^{{\prime }}$. The value of ${p}_{ij}$ is the proportion of objects that rater $i$ assigned to category $j$.

One limitation of Cohen’s Kappa is its sensitivity to the prevalence of agreement in the data. When the categories being rated are imbalanced or when there is a high prevalence of one category, Cohen’s Kappa tends to be biased and may not accurately reflect the true agreement between raters. Another limitation is that Cohen’s Kappa assumes that the raters are independent, meaning their ratings are not influenced by each other. However, in some cases, raters may be influenced by each other’s ratings, leading to inflated agreement estimates [7]. On the other hand, Cohen’s Kappa has several advantages. It accounts for the agreement that would occur by chance, providing a more accurate measure of agreement between raters compared to simple percent agreement [8]. Cohen’s Kappa also allows for the assessment of agreement beyond chance, considering both the observed agreement and the expected agreement by chance. Additionally, Cohen’s Kappa is applicable to categorical variables with two or more categories, making it a versatile measure for a wide range of research fields. It is important for researchers to be aware of the limitations of Cohen’s Kappa and to consider alternative measures, such as weighted Kappa, when dealing with imbalanced data or when there is potential for rater dependency.

Weighted kappa

In cases where there is a need to evaluate the level of agreement between two raters regarding ordered categorical variables that consist of three or more categories, the weighted Kappa is frequently utilized as a measure [9]. Weighted Kappa comes in two forms: linear weighted Kappa (LWK) [10] and quadratic weighted Kappa (QWK) [11]. The LWK extends Cohen’s Kappa statistic by attributing different weights to different categories of agreement and disagreement based on the linear distance between the categories on the rating scale [10]. In contrast, the QWK assigns weights based on the quadratic distance between the categories on the rating scale, allowing for a more nuanced analysis of the agreement between raters [11]. Both LWK and QWK are valuable measures of interrater reliability (IRR) as they provide more information about the agreement between raters than Cohen’s Kappa. The choice between the two depends on the specific situation and the data being analyzed. Reporting both LWK and QWK coefficients is recommended in situations where not all disagreements carry equal weight as this can provide a more comprehensive understanding of the distribution of disagreements [12]. Doing so can ensure a more accurate and detailed evaluation of the consistency and reliability of the data, which is particularly crucial when dealing with complex datasets [13].

Weighted Kappa is calculated as follows:

$${w}_{ij}^{\left(m\right)}=1-{\left(\left.\frac{\left|i-j\right|}{n-1}\right)\right.}^{m}$$

(2)

$${k}_{m}=1-\frac{1-\sum _{i=1}^{n}\sum _{j=1}^{n}{w}_{ij}^{\left(m\right)}{p}_{ij}}{1-\sum _{i=1}^{n}\sum _{j=1}^{n}{w}_{ij}^{\left(m\right)}{p}_{i}{q}_{j}}$$

(3)

Where m ≥ 1, $p$ and $q$ are relative frequencies, which reflect the proportion of frequency to the number of samples. ${p}_{i}={\sum }_{j=1}^{n}{p}_{ij} and {q}_{i}={\sum }_{j=1}^{n}{p}_{ji}$. In special cases, ${k}_{1}$ is the LWK and ${k}_{2}$ is the QWK.

A limitation of weighted Kappa is its complexity and potential subjectivity in assigning weights [14]. The choice of weights relies on expert judgment or empirical evidence, and different weightings can lead to varying results. Additionally, weighted Kappa requires a clear understanding of the underlying data and the appropriate selection of weighting schemes, which can be challenging. However, weighted Kappa offers several advantages [15]. Firstly, it allows for a more nuanced analysis of agreement, taking into account the severity or importance of disagreements. This is particularly valuable when the categories have different levels of relevance or when certain disagreements are more critical than others. Secondly, weighted Kappa can be useful when dealing with ordinal or interval categorical variables, as it captures the inherent ordering of categories. It provides a more accurate representation of the agreement by considering the magnitude of disagreement. Overall, the advantages of using weighted Kappa lie in its ability to capture the relative importance of disagreements and provide a more comprehensive assessment of agreement. However, it requires careful consideration and application of appropriate weighting schemes, making it essential to interpret the results in conjunction with the specific context and research objectives.

The statistical controversy over Liu et al.’s article

Liver metastases occur in about 5% of newly diagnosed cancer patients, leading to reduced survival rates. Treatment options include systemic chemotherapy, ablation, and surgery depending on the stage and source of metastasis. Radiological assessment using computed tomography (CT) or magnetic resonance imaging (MRI) is critical in making treatment decisions, with MRI being superior for hepatic metastasis evaluation and diffusion-weighted imaging (DWI) being useful for tumor assessment. The Response Evaluation Criteria in Solid Tumor 1.1 (RECIST 1.1) is the standard method for evaluating tumor response, but it has variability and challenges. To address this, researchers developed computer-aided systems for automated lesion segmentation. Liu et al. [3] proposed a deep learning-based liver metastases segmentation method that assessed treatment response based on RECIST 1.1 and compared the accuracy of automated segmentation to radiologists’ readings. While the authors’ statement had some merit, the approach requires further evaluation.

After reevaluating the Kappa values in the authors’ data, statistical discrepancies were identified in three groups: R1 vs. reference standard in the testing dataset and validation cohort, as well as R2 vs. reference standard in the testing dataset (Table 1). The authors overestimated the agreement between R1 and the reference standard in the testing dataset. The reassessment showed that R1 and the reference standard had fair agreement in the testing dataset. Our analysis indicated fair agreement with a LWK of 0.38 and a QWK of 0.40, which differed from the authors’ reported moderate agreement with a Kappa value of 0.48. Furthermore, our analysis indicated substantial agreement with LWK of 0.67 and QWK of 0.75, differing from the authors’ reported Kappa value of 0.63. On the other hand, the IRR between R2 and the reference standard showed no agreement for p > 0.05, contradicting the authors’ report fair agreement with Kappa value of 0.30. We suggest that the authors provide further clarification. Our linear weighted Kappa values were consistent with the other three groups.

Table 1 The confusion matrix of the response assessment results with respect to reference standard and the IRRs of treatment response assessment

Full size table

Conclusion

In summary, Cohen’s Kappa is appropriate for assessing agreement between two raters with two categories or for categorical variables with two categories. Weighted Kappa, specifically LWK or QWK, is employed when dealing with ordered categorical variables with three or more categories, considering the magnitude of agreement and disagreement. The selection between Cohen’s Kappa and weighted Kappa depends on whether the data is categorical or ordered, and whether the research question requires a nuanced analysis of the agreement and disagreement.

Liu et al. [3] assessed the level of agreement between two raters for a set of ordered categorical variables comprising three categories: PR, SD, and PD. Weighted Kappa is a more appropriate option in this scenario as opposed to Cohen’s Kappa. After reevaluating the Kappa values, discrepancies were found in three groups: R1 vs. reference standard in the testing dataset and validation cohort, and R2 vs. reference standard in the testing dataset. The authors underestimated agreement between R1 and the reference standard in the testing dataset, while our analysis showed fair agreement. Our analysis also indicated substantial agreement for R1 with LWK of 0.67 and QWK of 0.75, differing from the authors’ reported Kappa value of 0.63. However, there was no agreement between R2 and the reference standard, contradicting the authors’ report of fair agreement with a Kappa value of 0.30. Although the statistical controversy in Liu et al.’s study does not significantly impact the conclusions of their paper, it emphasizes the significance of addressing and resolving these misunderstandings.

Data Availability

Not applicable.

Abbreviations

IRR:: Inter-rater reliability
CT:: Computed tomography
MRI:: Magnetic resonance imaging
DWI:: Diffusion-weighted imaging
RECIST:: Response Evaluation Criteria in Solid Tumor
LWK:: Linear weighted Kappa
QWK:: Quadratic weighted Kappa
R1:: An attending radiologist with 8 year’s reading experience
R2:: A fellow radiologist with 4 year’s reading experience
κ_c :: Cohen’s Kappa value
κ_lw :: Linear weighted Kappa Value
κ_qw :: Quadratic weighted Kappa value
κ*:: Kappa value calculated by Liu et al
CI:: Confidence interval
PR:: Partial response
SD:: Stable disease
PD:: Progressive disease

References

Alavi M, Biros E, Cleary M. A primer of inter-rater reliability in clinical measurement studies: pros and pitfalls. J Clin Nurs. 2022;31(23–24):e39–e42.
PubMed Google Scholar
Hughes J. Sklar’s omega: a gaussian copula-based framework for assessing agreement. Stat Comput. 2022;32(3):46.
Article Google Scholar
Liu X, Wang R, Zhu Z, Wang K, Gao Y, Li J, et al. Automatic segmentation of hepatic metastases on DWI images based on a deep learning method: assessment of tumor treatment response according to the RECIST 1.1 criteria. BMC Cancer. 2022;22(1):1285.
Article PubMed PubMed Central Google Scholar
Cohen J. A coefficient of agreement for nominal scales. Educational & Psychological Measurement. 1960;20(1):37–46.
Article Google Scholar
Kim JW, Park SH, Choi SA, Kim SK, Koh EJ, Won JK, et al. Molecular subgrouping of medulloblastoma in pediatric population using the NanoString assay and comparison with immunohistochemistry methods. BMC Cancer. 2022;22(1):1221.
Article CAS PubMed PubMed Central Google Scholar
Freitas-Junior R, de Oliveira VM, Frasson AL, Cavalcante FP, Mansani FP, Mattar A, et al. Management of early-stage triple-negative breast cancer: recommendations of a panel of experts from the brazilian society of Mastology. BMC Cancer. 2022;22(1):1201.
Article PubMed PubMed Central Google Scholar
Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. J Clin Epidemiol. 1993;46(5):423–9.
Article CAS PubMed Google Scholar
McHugh ML. Interrater reliability: the Kappa statistic. Biochemia Medica (zagreb). 2012;22:276–82.
Oda Y, Tanaka K, Hirose T, Hasegawa T, Hiruta N, Hisaoka M, et al. Standardization of evaluation method and prognostic significance of histological response to preoperative chemotherapy in high-grade non-round cell soft tissue sarcomas. BMC Cancer. 2022;22(1):94.
Article CAS PubMed PubMed Central Google Scholar
Cicchetti DV, Allison T. A New Procedure for assessing reliability of Scoring EEG Sleep Recordings. Am J EEG Technol. 1971;11(3):101–10.
Article Google Scholar
Fleiss JL, Cohen J. The equivalence of weighted Kappa and the intraclass correlation coefficient as measures of reliability. Educ Psychol Meas. 1973;33(3):613–9.
Article Google Scholar
Vanbelle S. A new interpretation of the weighted kappa coefficients. Psychometrika. 2016;81(2):399–410.
Article PubMed Google Scholar
Bayram KB, Şengül İ, Aşkin A, Tosun A. Inter-rater reliability of the australian spasticity Assessment Scale in poststroke spasticity. Int J Rehabil Res. 2022;45(1):86–92.
Article PubMed Google Scholar
Feinstein AR, Cicchetti DV. High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol. 1990;43(6):543–9.
Article CAS PubMed Google Scholar
Cohen J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull. 1968;70(4):213–20.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by the Chinese Ministry of Education “Chunhui Plan” International Scientific Research Cooperation Project (HLJ2019017), the Fundamental Research Funds in Heilongjiang Provincial Universities (145209121) and the Heilongjiang Province Leading Talent Echelon Reserve Leader Funding Project. The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations

Department of Computer Science and Technology, College of Computer and Control Engineering, Qiqihar University, Qiqihar, 161006, China
Ming Li & Qian Gao
Department of Biotechnology, College of Life Science and Agriculture Forestry, Qiqihar University, Qiqihar, 161006, China
Tianfei Yu

Authors

Ming Li
View author publications
You can also search for this author in PubMed Google Scholar
Qian Gao
View author publications
You can also search for this author in PubMed Google Scholar
Tianfei Yu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ML wrote the original draft of the manuscript. QG were involved in the analysis and interpretation of the data. TY contributed to the conception and design of the study. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Tianfei Yu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Li, M., Gao, Q. & Yu, T. Kappa statistic considerations in evaluating inter-rater reliability between two raters: which, when and context matters. BMC Cancer 23, 799 (2023). https://doi.org/10.1186/s12885-023-11325-z

Download citation

Received: 07 May 2023
Accepted: 22 August 2023
Published: 25 August 2023
DOI: https://doi.org/10.1186/s12885-023-11325-z

Kappa statistic considerations in evaluating inter-rater reliability between two raters: which, when and context matters

Abstract

Background

Methods

Results

Conclusion

Kappa statistic

Cohen’s kappa

Weighted kappa

The statistical controversy over Liu et al.’s article

Conclusion

Data Availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

BMC Cancer

Contact us

Kappa statistic considerations in evaluating inter-rater reliability between two raters: which, when and context matters

Abstract

Background

Methods

Results

Conclusion

Kappa statistic

Cohen’s kappa

Weighted kappa

The statistical controversy over Liu et al.’s article

Conclusion

Data Availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Cancer

Contact us