One can distinguish between two possible uses of kappa: as a way to test rater independence (i.e. as a test statistic), and as a way to quantify the level of agreement (i.e., as an effect-size measure). The first use involves testing the null hypothesis that there is no more agreement than might occur by chance given random guessing; that is, one makes a qualitative, "yes or no" decision about whether raters are independent or not. Kappa is appropriate for this purpose (although to know that raters are not independent is not very informative; raters are dependent by definition, inasmuch as they are rating the same cases).
It is the second use of kappa--quantifying actual levels of agreement--that is the source of concern. Kappa's calculation uses a term called the proportion of chance (or expected) agreement. This is interpreted as the proportion of times raters would agree by chance alone. However, the term is relevant only under the conditions of statistical independence of raters. Since raters are clearly not independent, the relevance of this term, and its appropriateness as a correction to actual agreement levels, is very questionable.
Thus, the common statement that kappa is a "chance-corrected measure of agreement" misleading. As a test statistic, kappa can verify that agreement exceeds chance levels. But as a measure of the level of agreement, kappa is not "chance-corrected"; indeed, in the absence of some explicit model of rater decision making, it is by no means clear how chance affects the decisions of actual raters and how one might correct for it.
A better case for using kappa to quantify rater agreement is that, under certain conditions, it approximates the intra-class correlation. But this too is problematic in that (1) these conditions are not always met, and (2) one could instead directly calculate the intra class correlation.
Pros and Cons
Pros
- Kappa statistics are easily calculated and software is readily available (e.g., SAS PROC FREQ).
- Kappa statistics are appropriate for testing whether agreement exceeds chance levels for binary and nominal ratings.
Cons
- Kappa is not really a chance-corrected measure of agreement (see above).
- Kappa is an omnibus index of agreement. It does not make distinctions among various types and sources of disagreement.
- Kappa is influenced by trait prevalence (distribution) and base-rates. As a result, kappas are seldom comparable across studies, procedures, or populations (Thompson & Walter, 1988; Feinstein & Cicchetti, 1990).
- Kappa may be low even though there are high levels of agreement and even though individual ratings are accurate. Whether a given kappa value implies a good or a bad rating system or diagnostic method depends on what model one assumes about the decisionmaking of raters (Uebersax, 1988).
- With ordered category data, one must select weights arbitrarily to calculate weighted kappa (Maclure & Willet, 1987).
- Kappa requires that two rater/procedures use the same rating categories. There are situations where one is interested in measuring the consistency of ratings for raters that use different categories (e.g., one uses a scale of 1 to 3, another uses a scale of 1 to 5).
- Tables that purport to categorize ranges of kappa as "good," "fair," "poor" etc. are inappropriate; do not use them.
Tidak ada komentar:
Posting Komentar