Cohen's kappa coefficient is a statistical measure of
inter-rater agreement or inter-annotator agreement] for qualitative
(categorical) items. It is generally thought to be a more robust measure than
simple percent agreement calculation since κ takes into account the agreement
occurring by chance. Some researchers[2][citation needed] have expressed
concern over κ's tendency to take the observed categories' frequencies as
givens, which can have the effect of underestimating agreement for a category
that is also commonly used; for this reason, κ is considered an overly
conservative measure of agreement.
Others[3][citation needed] contest the assertion that kappa
"takes into account" chance agreement. To do this effectively would
require an explicit model of how chance affects rater decisions. The so-called
chance adjustment of kappa statistics supposes that, when not completely
certain, raters simply guess—a very unrealistic scenario.
Calculation
Cohen's kappa measures the agreement between two raters who
each classify N items into C mutually exclusive categories. The first mention
of a kappa-like statistic is attributed to Galton (1892)
The equation for κ is:
\kappa = \frac{\Pr(a) - \Pr(e)}{1 - \Pr(e)}, \!
where Pr(a) is the relative observed agreement among raters,
and Pr(e) is the hypothetical probability of chance agreement, using the
observed data to calculate the probabilities of each observer randomly saying
each category. If the raters are in complete agreement then κ = 1. If there is
no agreement among the raters other than what would be expected by chance (as
defined by Pr(e)), κ = 0.
The seminal paper introducing kappa as a new technique was
published by Jacob Cohen in the journal Educational and Psychological
Measurement in 1960.
A similar statistic, called pi, was proposed by Scott
(1955). Cohen's kappa and Scott's pi differ in terms of how Pr(e) is
calculated.
Note that Cohen's kappa measures agreement between two
raters only. For a similar measure of agreement (Fleiss' kappa) used when there
are more than two raters, see Fleiss (1971). The Fleiss kappa, however, is a
multi-rater generalization of Scott's pi statistic, not Cohen's kappa.
Example
Suppose that you were analyzing data related to a group of
50 people applying for a grant. Each grant proposal was read by two readers and
each reader either said "Yes" or "No" to the proposal.
Suppose the data were as follows, where rows are reader A and columns are
reader B:
B B
Yes No
A Yes 20 5
A No 10 15
Note that there were 20 proposals that were granted by both
reader A and reader B, and 15 proposals that were rejected by both readers.
Thus, the observed percentage agreement is Pr(a) = (20 + 15) / 50 = 0.70
To calculate Pr(e) (the probability of random agreement) we
note that:
Reader A said "Yes" to 25 applicants and
"No" to 25 applicants. Thus reader A said "Yes" 50% of the
time.
Reader B said "Yes" to 30 applicants and
"No" to 20 applicants. Thus reader B said "Yes" 60% of the
time.
Therefore the probability that both of them would say
"Yes" randomly is 0.50 · 0.60 = 0.30 and the probability that both of
them would say "No" is 0.50 · 0.40 = 0.20. Thus the overall
probability of random agreement is Pr(e) = 0.3 + 0.2 = 0.5.
So now applying our formula for Cohen's Kappa we get:
\kappa = \frac{\Pr(a) - \Pr(e)}{1 - \Pr(e)} =
\frac{0.70-0.50}{1-0.50} =0.40 \!
Same percentages but different
numbers
A case sometimes considered to be a problem with Cohen's
Kappa occurs when comparing the Kappa calculated for two pairs of raters with
the two raters in each pair having the same percentage agreement but one pair
give a similar number of ratings while the other pair give a very different
number of ratings.[6] For instance, in the following two cases there is equal
agreement between A and B (60 out of 100 in both cases) so we would expect the
relative values of Cohen's Kappa to reflect this. However, calculating Cohen's
Kappa for each:
Yes No
Yes 45 15
No 25 15
\kappa = \frac{0.60-0.54}{1-0.54} = 0.1304
Yes No
Yes 25 35
No 5 35
\kappa = \frac{0.60-0.46}{1-0.46} = 0.2593
we find that it shows greater similarity between A and B in
the second case, compared to the first.
Significance and magnitude
Statistical significance makes no claim on how important is
the magnitude in a given application or what is considered as high or low
agreement.
Statistical significance for kappa is rarely reported,
probably because even relatively low values of kappa can nonetheless be
significantly different from zero but not of sufficient magnitude to satisfy
investigators. Still, its standard error has been described and is computed by various computer programs.
If statistical significance is not a useful guide, what
magnitude of kappa reflects adequate agreement? Guidelines would be helpful,
but factors other than agreement can influence its magnitude, which makes
interpretation of a given magnitude problematic. As Sim and Wright noted, two
important factors are prevalence (are the codes equiprobable or do their probabilities
vary) and bias (are the marginal probabilities for the two observers similar or
different). Other things being equal, kappas are higher when codes are
equiprobable. On the other hand Kappas are higher when codes are distributed
assymetrically by the two observers. In contrast to probability variations, the
effect of bias is greater when Kappa is small than when it is large.
Another factor is the number of codes. As number of codes
increases, kappas become higher. Based on a simulation study, Bakeman and
colleagues concluded that for fallible observers, values for kappa were lower
when codes were fewer. And, in agreement with Sim & Wrights's statement
concerning prevalence, kappas were higher when codes were roughly equiprobable.
Thus Bakeman et al. concluded that "no one value of kappa can be regarded
as universally acceptable."[11]:357 They also provide a computer program
that lets users compute values for kappa specifying number of codes, their
probability, and observer accuracy. For example, given equiprobable codes and
observers who are 85% accurate, value of kappa are 0.49, 0.60, 0.66, and 0.69
when number of codes is 2, 3, 5, and 10, respectively.
Nonetheless, magnitude guidelines have appeared in the
literature. Perhaps the first was Landis and Koch,[12] who characterized values
< 0 as indicating no agreement and 0–0.20 as slight, 0.21–0.40 as fair,
0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect
agreement. This set of guidelines is however by no means universally accepted;
Landis and Koch supplied no evidence to support it, basing it instead on
personal opinion. It has been noted that these guidelines may be more harmful
than helpful.[13]Fleiss's[14]:218 equally arbitrary guidelines characterize
kappas over 0.75 as excellent, 0.40 to 0.75 as fair to good, and below 0.40 as
poor.
Weighted kappa
Weighted kappa lets you count disagreements differently and is especially useful when codes are ordered. Three matrices are
involved, the matrix of observed scores, the matrix of expected scores based on
chance agreement, and the weight matrix. Weight matrix cells located on the
diagonal (upper-left to bottom-right) represent agreement and thus contain
zeros. Off-diagonal cells contain weights indicating the seriousness of that
disagreement. Often, cells one off the diagonal are weighted 1, those two off
2, etc.
The equation for weighted κ is:
\kappa = 1- \frac{\sum_{i=1}^{k} \sum_{j=1}^{k}w_{ij}x_{ij}}
{\sum_{i=1}^{k} \sum_{j=1}^{k}w_{ij}m_{ij}}
where k=number of codes and w_{ij}, x_{ij}, and m_{ij} are
elements in the weight, observed, and expected matrices, respectively. When
diagonal cells contain weights of 0 and all off-diagonal cells weights of 1,
this formula produces the same value of kappa as the calculation given above.
Kappa maximum
Kappa assumes its theoretical maximum value of 1 only when
both observers distribute codes the same, that is, when corresponding row and
column sums are identical. Anything less is less than perfect agreement. Still,
the maximum value kappa could achieve given unequal distributions helps
interpret the value of kappa actually obtained. The equation for κ maximum
is:
\kappa_{\max} =\frac{P_{\max} - P_{\exp}}{1-P_{\exp}}
where P_{\exp} = \sum_{i=1}^k P_{i+}P_{+i}, as usual,
P_{\max} = \sum_{i=1}^k \min(P_{i+},P_{+i}),
k = number of codes, P_{i+} are the row probabilities, and
P_{+i} are the column probabilities.
Tidak ada komentar:
Posting Komentar