Interrater reliability: the kappa statistic

被引:12978
作者
McHugh, Mary L. [1 ]
机构
[1] Natl Univ, Dept Nursing, San Diego, CA USA
关键词
kappa; reliability; rater; interrater;
D O I
10.11613/bm.2012.031
中图分类号
R446 [实验室诊断]; R-33 [实验医学、医学实验];
学科分类号
1001 ;
摘要
The kappa statistic is frequently used to test interrater reliability. The importance of rater reliability lies in the fact that it represents the extent to which the data collected in the study are correct representations of the variables measured. Measurement of the extent to which data collectors (raters) assign the same score to the same variable is called interrater reliability. While there have been a variety of methods to measure interrater reliability, traditionally it was measured as percent agreement, calculated as the number of agreement scores divided by the total number of scores. In 1960, Jacob Cohen critiqued use of percent agreement due to its inability to account for chance agreement. He introduced the Cohen's kappa, developed to account for the possibility that raters actually guess on at least some variables due to uncertainty. Like most correlation statistics, the kappa can range from -1 to +1. While the kappa is one of the most commonly used statistics to test interrater reliability, it has limitations. Judgments about what level of kappa should be acceptable for health research are questioned. Cohen's suggested interpretation may be too lenient for health related studies because it implies that a score as low as 0.41 might be acceptable. Kappa and percent agreement are compared, and levels for both kappa and percent agreement that should be demanded in healthcare studies are suggested.
引用
收藏
页码:276 / 282
页数:7
相关论文
共 11 条
[1]  
Bluestein D, 2008, AM FAM PHYSICIAN, V78, P1186
[2]   Intrarater Reliability of Dual-Energy X-Ray Absorptiometry-Based Measures of Vertebral Height in Postmenopausal Women [J].
Bonnyman, Alison M. ;
Webber, Colin E. ;
Stratford, Paul W. ;
MacInyre, Norma J. .
JOURNAL OF CLINICAL DENSITOMETRY, 2012, 15 (04) :405-412
[3]   A COEFFICIENT OF AGREEMENT FOR NOMINAL SCALES [J].
COHEN, J .
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1960, 20 (01) :37-46
[4]   METAANALYSIS OF PAP TEST ACCURACY [J].
FAHEY, MT ;
IRWIG, L ;
MACASKILL, P .
AMERICAN JOURNAL OF EPIDEMIOLOGY, 1995, 141 (07) :680-689
[5]   An interrater reliability study of the assessment of pressure ulcer risk using the Braden scale and the classification of pressure ulcers in a home care setting [J].
Kottner, Jan ;
Halfens, Ruud ;
Dassen, Theo .
INTERNATIONAL JOURNAL OF NURSING STUDIES, 2009, 46 (10) :1307-1312
[6]  
Marston L., 2010, INTRO STAT HLTH NURS
[7]  
Marusteri M, 2010, BIOCHEM MEDICA, V20, P15
[8]  
Simundic AM, 2008, BIOCHEM MEDICA, V18, P154
[9]   Comparison of visual vs. automated detection of lipemic, icteric and hemolyzed specimens: can we rely on a human eye? [J].
Simundic, Ana-Maria ;
Nikolac, Nora ;
Ivankovic, Valentina ;
Ferenec-Ruzic, Dragica ;
Magdic, Bojana ;
Kvaternik, Marina ;
Topic, Elizabeta .
CLINICAL CHEMISTRY AND LABORATORY MEDICINE, 2009, 47 (11) :1361-1365
[10]  
Stemler S.E., 2004, PRACTICAL ASSESSMENT, V9, P4, DOI [DOI 10.7275/96JP-XZ07, 10.7275/96JP-XZ07]