Research Article Received 04 June 2013,

Accepted 22 January 2014

Published online 16 February 2014 in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/sim.6113

Kappa statistic for clustered matched-pair data Zhao Yanga * † and Ming Zhoub Kappa statistic is widely used to assess the agreement between two procedures in the independent matched-pair data. For matched-pair data collected in clusters, on the basis of the delta method and sampling techniques, we propose a nonparametric variance estimator for the kappa statistic without within-cluster correlation structure or distributional assumptions. The results of an extensive Monte Carlo simulation study demonstrate that the proposed kappa statistic provides consistent estimation and the proposed variance estimator behaves reasonably well for at least a moderately large number of clusters (e.g., K > 50). Compared with the variance estimator ignoring dependence within a cluster, the proposed variance estimator performs better in maintaining the nominal coverage probability when the intra-cluster correlation is fair . > 0:3/, with more pronounced improvement when  is further increased. To illustrate the practical application of the proposed estimator, we analyze two real data examples of clustered matched-pair data. Copyright © 2014 John Wiley & Sons, Ltd. Keywords:

kappa statistic; clustered matched-pair data; confidence interval; agreement; delta method

1. Introduction The kappa statistic (or the Cohen’s kappa statistic) [1], a landmark in the development of agreement theory, is used for summarizing the cross-classification of two nominal variables with identical categories in the matched-pair data [2]. The kappa statistic was introduced as a measure of inter-rater agreement between two observers classifying subjects into mutually exclusive categories. It adjusts the observed proportional agreement by first subtracting the amount of agreement that would be expected by chance, that is, the amount by which the observed agreement exceeds that expected by chance alone, then normalizing this difference by its maximum possible value. The kappa statistic has been widely applied to cross-classifications encountered in psychometrics, educational measurement, epidemiology [3], and diagnostic imaging [4]. Various extensions of the kappa statistic have been developed, including, weighted kappas for ordinal categories [5], multi-rater kappas [6–8], kappas using stratified samples [9], kappas comparing a single rater with a consensus group of raters [10], kappas allowing for multiple observations per subject and multiple categorizations [11], kappas for groups of raters [12, 13], kappas for sample survey data [14], kappas for longitudinal study data [15], kappas for data with hierarchical structure [16], and correlated kappa statistics [17–23]. King and Chinchilli [24] investigated the kappa statistic from a different perspective by unifying it as the generalized concordance correlation coefficient. Recently, to evaluate the ratings agreement between physicians and their patients regarding their discussions on the same event, Kang et al. [25] proposed a simulation-based cluster bootstrap method to analyze the correlated clustered binary data. That is, patients seen by the same physician form a cluster, within which patients’ responses, and the responses of their physician tend to be more similar. Zhou and Yang [26] called this type of data the ‘physician-patients data’ and suggested that it is plausible to assume as in [25] that responses from patients of the same physician are conditionally independent given their physician’s responses. This assumption of conditional independence enables Zhou and Yang [26] to develop an asymptotic variance estimator with the flavor of being semi-parametric for the kappa statistic without further parametric assumptions. a UCB

2612

BioSciences, Inc., 8010 Arco Corporate Drive, Raleigh, NC 27617, U.S.A. Biometric Sciences, Bristol-Myers Squibb, Hopewell, NJ 08534, U.S.A. *Correspondence to: Zhao Yang, UCB BioSciences, Inc., 8010 Arco Corporate Drive, Raleigh, NC 27617, U.S.A. † E-mail: [email protected] b Global

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

Z. YANG AND M. ZHOU

In practice, there are many cases in which sampling units are recruited in clusters, such as dental studies with multiple teeth (sampling units) of a subject (cluster). In such cases, units from the same cluster tend to have similar outcomes. To facilitate further understanding of the data structure, consider a study, where the objective is to investigate the agreement between computerized axial tomography (CAT) and position emission tomography (PET) scan images of patients with metastatic colorectal cancer. For each patient, there are 10 sites separately scanned using CAT and PET. Hence, in this envisioned study, each patient is a cluster, each site of a patient is a unit, and the CAT and PET scans are the diagnostic procedures. Without loss of generality, we call any evaluation method a ‘procedure’, which is broad enough to cover other commonly used terms, such as rater, observer, scorer, and coder. The key characteristic for the clustered matched-pair data is that every unit is assessed by both procedures. Although some methods are available for the clustered matched-pair data, existing research has focused primarily on testing marginal homogeneity or confidence interval (CI) construction ([27–29] and references therein), and there are very limited work performed to estimate kappa. Oden [30], Schouten [31], and Dixon et al. [32] proposed a pooled kappa statistic for situations where two raters evaluate a set of paired units, such as pairs of eyes; however, the apparent limitations of their methods emerge from the varying cluster sizes. Admittedly, to estimate kappa for the clustered matched-pair data, the most challenging task is to determine the asymptotic variance of the kappa statistic while accounting for the correlation within clusters. In practice, often little is known about the appropriateness of any assumed correlation structures; hence, analysis methods that adjust for multiple units within a cluster, yet avoid specific assumptions on within-cluster correlation structures, are desired. On the basis of sampling techniques [33] and to test the marginal homogeneity, Obuchowski [34] presented a method to adjust for the clustering effect without assuming specific distribution or within-cluster correlation structures. This strategy grants flexibility in situations of varying cluster sizes and situations where it is infeasible to assume homogeneous within-cluster correlation. Moreover, it provides us a tool to fill the research gap with a proposal of the kappa statistic and its corresponding variance estimator. The proposed method does not postulate any within-cluster correlation structure or distributional assumptions, and can also be applied to the physician-patients data as a nonparametric approach [26]. We investigate the performance of the proposed estimator in terms of empirical coverage probability and mean squared error (MSE). Section 2 presents the kappa statistic for the traditional matched-pair data. Section 3 presents the proposed kappa statistic and its variance. Section 4 summarizes the Monte Carlo simulation study for the evaluation of the proposal. In Section 5, we provide two detailed real data analyses to illustrate the application of the methods. We give further discussion and a final summary in Section 6, followed by the Appendix covering all the technical details.

2. Kappa statistic for independent data For the traditional matched-pair data summarized in a g  g contingency table with total sample size N , assume that two independent procedures assess each of N items (subjects or objects), and assign it to one of g exhaustive and mutually exclusive categories of a nominal or ordinal scale. Let nij be the number of items that the first and second procedures assigned to categories i and j , respectively, ni C be the total number of items assigned to the ith category by procedure 1 and nCj be similarly defined for the j th category by procedure 2. Also, we denote the probability of falling into the ith category of procedure 1 and the jPth category of procedure 2 as pij and define the marginal probabilP P P ities pi C D j pij and pCj D i pij . Clearly i j pij D 1. Note that pij is often estimated by b p ij D nij =N , which is also the maximum likelihood estimator (MLE) under the multinomial model. When there are only two categories (binary case), the g  g table reduces to a 2  2 contingency table, where the two categories are often labeled as 0 and 1. b o , subTo define the agreement index, Cohen [1] considered the observed proportion of agreement, P b tracted by the proportion of agreement expected by chance, P e . The result is then normalized to obtain a value of 1 for a perfect agreement, a value of 0 when the agreement is only due to chance, and a negative value when the observed proportion of agreement is lower than the proportion of agreement expected by chance. Specifically, the kappa statistic is

Copyright © 2014 John Wiley & Sons, Ltd.

(1)

Statist. Med. 2014, 33 2612–2633

2613

b b b D Po Pe; K be 1P

Z. YANG AND M. ZHOU

b o and P b e are where P bo D P

g X

b pi i D

i D1

g 1 X ni i ; N

be D P

i D1

g X

b p i Cb p Ci D

i D1

g 1 X ni C nCi ; N2 i D1

and b pi C D

g X

b p ij D

j D1

g 1 X nij ; N

b p Ci D

j D1

g X

b pj i D

j D1

g 1 X nj i ; for i D 1; : : : ; g: N j D1

The kappa statistic is a measure of agreement with desirable properties [35]. On the other hand, although the kappa statistic is widely used to assess the agreement between two procedures in the independent matched-pair data, researchers showed that the kappa statistic may not perform satisfactorily from the perspective of practical interpretation. For example, it may exhibit high proportion of observed agreeb when the distribution of marginal totals is ‘highly symmetrically b o but low kappa value K ment P unbalanced’ [36, 37], or not be able to distinguish various types and sources of disagreements [38]. Shoukri [39](p.168–173) provided some useful illustrative examples to demonstrate the limitations of the kappa statistic. b Fleiss et al. originally developed and presented the first For estimating the asymptotic variance of K, correct expression (2) in 1969 (Equation 13). [40] as 8 ˆ g g g X   2 X  2=  2 b e 2P bo boP b e CP bo p j C Cb C 1P b p ij b p Ci  P : > ; i D1 j D1 j ¤i

(2) For convenience of subsequent comparison, we call it the ‘Fleiss’ estimator. The second alternative b was derived using the delta method and first given by expression (3) of the variance estimator for K Bishop et al. ([41], p.502) (a reprint of the 1975 edition published by Massachusetts Institute of Technology Press). Although (2) can also be found in literature, for example, Shoukri ([39], p.157), the expression in (3) seems to be more broadly used, for example, in the two influential books by Agresti ([2], p.434) and Congalton and Green ([42], p.106). Appendix A provides straightforward algebra to establish the equivalence between (2) and (3).   8 # g bo " <     2 1P X 1 b D  b D  Var bo 1  P boP be  bo C  c K  2P V b p i i .b pi C C b p Ci / 2 P be be : 1P i D1 N 1P 2 2  39 > g g b = 1Po XX  2 25 b 4 C b p C b p  4 P b p (3) ij jC Ci 2 e > ; be i D1 j D1 1P D

1 SWST ; N

(4)

where ST denotes the transpose of matrix S and 0 B SD@

1 1b Pe

0

; 

b Po 1 Pe 1b

1

C 2 A ;

b P o .1  b P o/

2614

B B B WDB B g   BX @ b pi i b p iC C b p Ci  2b P ob Pe iD1

Copyright © 2014 John Wiley & Sons, Ltd.

g X iD1 g X



g X

iD1 j D1



b p ij b pj C

1



P ob Pe C  2b C C C: C 2 C 2A b Cb p  4P

b pi i b p iC C b p Ci

Ci

e

Statist. Med. 2014, 33 2612–2633

Z. YANG AND M. ZHOU

To facilitate the development for the clustered matched-pair data, we formulate the large sample variance b as an alternative expression in (4). We give the detailed derivation of (4) using the delta method for K given in Appendix B. This derivation not only presents a formal proof for asymptotic variance (3) of b but also interestingly provides the basis for the subsequent development of variance estimator for the K kappa statistic under the clustered matched-pair data.   d b is V bK  b D 1=2 K The asymptotic distribution of K ! N .0; 1/, where K is the true kappa. With q b b b D , where b K and its variance estimator V D , an approximate 100.1  ˛/% CI for K is K ˙ ´1˛=2 V ´1˛=2 is the upper 100.1  ˛=2/% percentage point of the standard normal distribution.

3. Kappa statistic for clustered data As illustrated in Table I, suppose there are K independent clusters in the study population where nk .k D P 1; : : : ; K/ is the fixed sample size for the kth cluster with K kD1 nk D N . Building on earlier notations, for the kth cluster, we denote the corresponding number of observations as nij;k ; ni C;k ; and nCj;k , the probability of falling into the P P ith row and j th column as pij;k , and the marginal probabilities as pi C;k D j pij;k and pCj;k D i pij;k . Moreover, we denote

!k D nk

K X

!1 nk

g X b o;k D 1 ni i;k ; P nk

nk ; N

D

kD1

b p i C;k D

i D1

ni C;k ; nk

b p Ci;k D

nCi;k : nk

Assuming homogeneity among the K clusters, that is, E.nij;k =nk / D pij for k D 1; : : : ; K, we expect to have the same parameters pij ; pi C ; pCi ; Po ; Pe for all K clusters. For the clustered matched-pair bc using (1) with P b o and P b e being data, we can calculate the kappa statistic K K X

bo D P

kD1

g K X X b o;k D 1 !k P ni i;k ; N

be D P

kD1 i D1

g X

b p i Cb p Ci ;

(5)

i D1

where b pi C D

K X

!k b p i C;k

kD1

K 1 X D ni C;k ; N

b p Ci D

kD1

K X

!k b p Ci;k

kD1

K 1 X D nCi;k : N kD1

Remark 1 bc is identical to K b in (1) calculated using all the data by ignoring the within-cluster The kappa statistic K correlation.

Table I. Summary information for gg table of the kth cluster. For each cell, the number in the parenthesis is the probability. Procedure 2 Procedure 1

1

2



j



g

Total

n11;k .p11;k /

n12;k .p12;k /



n1j;k .p1j;k /



n1g;k .p1g;k /

n1C;k .p1C;k /

n21;k .p21;k / :: : ni1;k .pi1;k / :: : ng1;k .pg1;k /

n22;k .p22;k / :: : ni2;k .pi2;k / :: : ng2;k .pg2;k /

 :: : 



 

n2j;k .p2j;k / :: : nij;k .pij;k / :: : ngj;k .pgj;k /

  :: : 

n2g;k .p2g;k / :: : nig;k .pig;k / :: : ngg;k .pgg;k /

n2C;k .p2C;k / :: : niC;k .piC;k / :: : ngC;k .pgC;k /

Total

nC1;k .pC1;k /

nC2;k .pC2;k /



nCj;k .pCj;k /



nCg;k .pCg;k /

nk

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

2615

1 2 :: : i :: : g

Z. YANG AND M. ZHOU

bc , we use the delta method and follow the similar process as To derive the asymptotic variance of K T T   p 1C ; : : : ; b p gC ; b pC D b p C1 ; : : : ; b p Cg , and define the vector in Appendix B. First, let b pC D b T  b bo; b p C ; b pC , which can be written as D P b D

X k

0

1 b o;k P B C !k @ b p C;k A : b p C;k

Then by the Cramér–Wold device and Lindeberg’s central limit theorem [43], under mild regularity conditions, when K ! 1 and 2 !1=2 3   X K 5 ! 0; lim 4 max nk n2k K!1

16k6K

kD1

b  is asymptotically normally distributed with variance 0 V11 V 12   1 Var b  D @ V 21 V 22 K V 31 V 32

1 V 13 V 32 A : V 33

(6)

  For the independent data as shown in (10) of Appendix B, the explicit form of Var b  relies on the multinomial model. Without distributional assumption, we use the sampling techniques from [33, 34] to derive the estimator for (6). To proceed, we denote   bo D P b o;1 ;    ; P b o;K I ! D .!1 ; : : : ; !K /T P b C D .b b C D .b pC;1 ;    ; b pC;K /gK I P p C;1 ;    ; b pC;K /gK P       2  D I K  !1TK1 diag !12 ; : : : ; !K I K  1K1 !T : b C !; b b C !; and the elements in (6) can be estimated by Then b p C D P pC D P b 11 D V

K 2 2 K 2 X 2 b b To b o P bo D K P !k P o;k  P K 1 K 1 kD1

b T21 D b 12 D V V

K  T K 2 X 2 b K2 O b TC ; bo b !k P o;k  P pC D pC;k  b P o P K 1 K 1 kD1

b T31 D b 13 D V V

K  T K 2 X 2 b K2 b bT bo b !k P o;k  P pC D P o P C pC;k  b K 1 K 1 kD1

K X   T K2 b b T32 D K b TC b 23 D V !k2 b pC b p C D p C;k  b pC;k  b V P C P K 1 K 1 2

(7)

kD1

b 22 D V b 33 D V

K2 K 1

K X kD1

  T K2 b b TC !k2 b pC b pC D pC;k  b pC;k  b P C P K 1

K  T K2 X 2  K2 b b TC : !k b pC b pC D P C P pC;k  b pC;k  b K 1 K 1 kD1

2616

Following similar derivation in Appendix B, we found that the estimator of the asymptotic variance of bc is K   bc D 1 SUST ; b D Var c K V (8) K Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

Z. YANG AND M. ZHOU

where 0

1 b Po 1 C B 1 ;  SD@ 2 A ; b 1Pe 1P be 1 0 b 13 p C b 12 pC C V V VO11 A UD@ b 21 C pT V b 31 pT V b 22 pC C pT V b 33 pC C 2pT V b 23 pC p TC V C C C C   0 1 b o P b To bT P bT b bo P b P P C C C P C P C ! B  C   C K2 B B!T P bT P bT P bT P bT P b C C P b C P b C P b C C P b C P b CC: bT P b To !T P bT P D C C C C C C B C K 1 @ A  bT P bT P b C P b C ! C2P C C

For convenience of subsequent comparison, we call this estimator as ‘Proposal’. It can be shown (detailed in Appendix C) that when there is exactly one unit per cluster .nk D 1/, the variance estimator b F / multiplied by a correction factor, K=.K  1/. The asymptotic distribution b .or V in (8) reduces to V  d  D 1=2 bc is V bc  K  b of K ! N .0; 1/, and an approximate 100.1  ˛/% CI for K can be constructed as K p bc ˙ ´1˛=2 V b. K Remark 2 The weights !k .k D 1; : : : ; K/ are used to conveniently express the kappa statistic in (1), for example, b o can be written as a weighted average of the observed proportions of agreement for all individual P b e . Zhou and Yang [26] heuristically demonstrated that for any given N , having clusters, similarly for P equal weights (i.e., !1 D !2 D    D !K ) is beneficial in the sense of good large sample variance property.

4. Monte Carlo simulation In this section, we provide a detailed description of the data generation procedure for the simulation study based on the clustered matched-pair 3  3 data structure; the simulation study was performed and evaluated using SAS version 9.2 via proc iml and macro language. The calculation of the kappa statistic, estimation of its variance, and construction of CIs follow the proposal in Section 3. 4.1. Generating clustered matched-pair 3  3 data To evaluate the correlation within a cluster, the within-cluster correlation structure R2nk 2nk in (9) investigated by Obuchowski [34] are used herein: (i) intra-cluster correlation between units that respond to procedure 1, r1 ; (ii) intra-cluster correlation between units that respond to procedure 2, r2 ; (iii) correlation between procedures for the same unit, r3 ; and (iv) correlation between procedures for different units within a cluster, r4 . ! .1  r1 /I C r1 J .r3  r4 /I C r4 J R2nk 2nk D (9) .r3  r4 /I C r4 J .1  r2 /I C r2 J

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

2617

where I is the nk  nk identity matrix and J is the nk  nk square matrix with all elements being 1. The components of a simulation configuration include as follows: (i) number of clusters .K D 15; 25; 50; 100/; (ii) cluster sizes nk ; (iii) correlation related parameters r1 and r3 ; and (iv) p1C ; p2C ; pC1 ; and pC2 . For each configuration, we generate 2000 independent clustered matchedpair data sets. then we evaluate the performance of the candidate estimators by the empirical coverage probability of their CIs. Clearly, the dimensions of R depend on the cluster sizes nk . We allow varying cluster sizes by generating from a binomial distribution, with success probability of 0:60 and substitute any generated 0 by 1. In the simulation, we investigate three scenarios: fixed cluster sizes nk D 2, varying cluster sizes

Z. YANG AND M. ZHOU

of nk 6 5 (with average cluster size 0:6  5 C 0:45  3:01), and nk 6 10 (with average cluster size 0:6  10 C 0:410  6). For the kth cluster, each of the nk units is assessed by both procedures 1 and 2, and has responses from one of the three categories. Let the categorical responses of unit i be Xi1 and Xi 2 for procedures 1 and 2, respectively. Meanwhile, denote their corresponding latent continuous responses of unit i as Yi1 and Yi 2 . To better understand the relationship between the categorical responses and the latent continuous responses, let us take Xi1 and Yi1 as an example, the latent continuous responses Yi1 is related to categorical response Xi1 via the ‘threshold concept’. That is, for three categories .C D 3/, the threshold values can be set as 1 and 2 .0 D 1 and 3 D 1/; and Xi1 takes category c if c1 < Yi1 6 c ; c D 1; 2; 3. In this way, the correlation between Yi1 and Yi 2 , characterized by r3 , is then related to the kappa statistic calculated from the categorical responses Xi1 and Xi 2 . In order to have R as a valid correlation matrix, it has to be nonnegative definite. In Appendix D, we provide its sufficient and necessary condition on R in terms of r1 ; r2 ; r3 ; and r4 . Let r1 ; r2 ; r3 ; and r4 be all nonnegative,   r1 D r2 D r; r4 be no more than r3 (i.e., r4 6 r3 ), and r4 D r1 =2 D r2 =2 D 0:5r. Then, the domain of r3 varies for different values of r. For investigated scenarios of r D 0; 0:1; 0:3; 0:5; and 0:8, the corresponding ranges of r3 are (i) 0 6 r3 6 1 for r D 0; (ii) 0:05 6 r3 6 0:95 for r D 0:1; (iii) 0:15 6 r3 6 0:80 for r D 0:3; (iv) 0:25 6 r3 6 0:75 for r D 0:5; and (v) 0:4 6 r3 6 0:6 for r D 0:8, respectively. Here, we increase r3 with steps of 0:05 in each scenario. Specifically, we generate a 2nk 1 random vector from the aforementioned multivariate normal distribution with mean zero and covariance matrix R in (9). Then, for each i D 1; : : : ; nk , we couple the ith and .nk C i/th elements from the generated 2nk  1 normal response into a bivariate vector, say .Yi1 ; Yi 2 /, which is also bivariate normally distributed with off-diagonal correlation r3 . Note that there are nk such pairs, with the correlation between different pairs characterized by r1 ; r2 ; and r4 . Moreover, we also generate data with heterogeneous within-cluster correlation matrix R. We set r3 with increment steps of 0:05 in the range 0:4 6 r3 6 0:6, and for each fixed true kappa, we select the correlation matrix R as a mixture of varying r1 values across clusters. That is, we create cluster identification sequence numbers from 1 to K, and then we divide the clusters into five approximately equal-sized groups such that (i) r1 D 0 for cluster sequence # 6 K=5; (ii) r1 D 0:1 for K=5 < cluster sequence # 6 2K=5; (iii) r1 D 0:3 for 2K=5 < cluster sequence # 6 3K=5; (iv) r1 D 0:5 for 3K=5 < cluster sequence # 6 4K=5; and (v) r1 D 0:8 for 4K=5 < cluster sequence # 6 K. In the simulation, three scenarios are investigated for the fixed marginal probabilities p1C ; p2C ; pC1 ; and pC2 : (scenario 1) p1C D 0:35; p2C D 0:25; pC1 D 0:3; pC2 D 0:2; (scenario 2) p1C D 0:2; p2C D 0:4; pC1 D 0:35; pC2 D 0:25; (scenario 3) p1C D 0:45; p2C D 0:35; pC1 D 0:5; pC2 D 0:25. All three scenarios can be commonly encountered in the application. In our current settings, there can be a total of 2520 simulation configurations with homogeneous within-cluster correlation matrix R. We summarize those settings in Table II. To visualize those three scenarios for procedure 1, we can scatter-plot .1; p1C /; .2; p2C /; .3; p3C / and connect the three dots to form a line. Similarly, we can do this for procedure 2. Then, the shapes of the connected lines from two procedures are different for those three scenarios. It is a U shape for scenario 1 for both procedures, a decreasing shape for scenario 3 for both procedures, and a U shape versus an increasing shape between two procedures in scenario 2.

Table II. Summary of total 2520 simulation configurations with homogeneous withincluster correlation. Scenario #

Number of clusters (K)

Cluster size nk

Within-cluster correlation (r1 )

r3 (with increment of 0:05)

Number of configurations

Scenario 1

15

2

0 0.1 0.3 0.5 0.8

0 6 r3 6 1 0:05 6 r3 6 0:95 0:15 6 r3 6 0:80 0:25 6 r3 6 0:75 0:40 6 r3 6 0:60

21 19 14 11 5

:: :

:: :

:: :

:: :

:: :

:: :

2618

Note: Cartesian combination of three scenarios, four cases in number of clusters (K = 15; 25; 50; 100), and three cases in cluster sizes (nk D 2, nk 6 5, and nk 6 10) results in the total number of configurations: 3  4  3  .21 C 19 C 14 C 11 C 5/ D 2; 520.

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

Z. YANG AND M. ZHOU

The purpose of looking into three different scenarios is to investigate the impact of different shapes of marginal probabilities on the performance of different methods. Denote a generic pair of bivariate random vector from the aforementioned procedure as .Y1 ; Y2 / and the quantile ˆ1 .p/ of the standard normal distribution as Qp . Then select the appropriate cutoff points for Y1 and Y2 to generate three response categories for both procedures on the basis of the requirements for p1C ; p2C ; pC1 ; and pC2 . With the information of r3 ; p1C ; p2C ; pC1 ; and pC2 , the true probability for the diagonal elements p11 ; p22 ; and p33 in the 3  3 table can be calculated from the bivariate normally distributed pair .Y1 ; Y2 / as   p11 D Pr Y1 < Qp1C ; Y2 < QpC1 ;   p22 D Pr Qp1C < Y1 < Qp1C Cp2C ; QpC1 < Y2 < QpC1 CpC2     D Pr Y1 < Qp1C Cp2C ; Y2 < QpC1 CpC2  Pr Y1 < Qp1C ; Y2 < QpC1 CpC2    Pr Y1 < Qp1C Cp2C ; Y2 < QpC1 C p11 ;   p33 D Pr Y1 > Qp1C Cp2C ; Y2 > QpC1 CpC2 ;       D 1  Pr Y1 < Qp1C Cp2C  Pr Y2 < QpC1 CpC2 C Pr Y1 < Qp1C Cp2C ; Y2 < QpC1 CpC2 : For example, let r3 D 0:6; p1C D 0:35; p2C D 0:25; pC1 D 0:3; pC2 D 0:2, using the SAS function PROBBNRM; p11 ; p22 ; and p33 can be calculated as p11 D 0:1916; p22 D 0:0613; p33 D 0:2987, and

0.4

0.6

0.8

1.0

0.4

0.8

1.0

0.0

Scenario 3 At r1 = 0.3 At r1 = 0.5

0.4

MSE

0.002

0.006

MSE

0.010

0.012 0.008

0.6

0.8

1.0

r3

0.0

0.2

0.4

0.6

r3

0.6

0.8

1.0

At r1 = 0.8 True Kappa

(a.5) : K = 25, nk ≤ 5

0.004

0.4

0.2

r3

0.014

0.016

0.6

r3

(a.4) : K = 25, nk = 2

0.2

0.6

0.8 0.2

At r1 = 0 At r1 = 0.1

0.0

0.4 0.0

0.0

r3

0.8

1.0

(a.6) : K = 25, nk ≤ 10

0.002 0.004 0.006 0.008 0.010 0.012

0.2

0.2

Mean Kappa Statistic

0.6 0.4 0.0

0.2

Mean Kappa Statistic

0.6 0.4 0.2

Mean Kappa Statistic

0.0 0.0

MSE

(a.3) : K = 25, nk ≤ 10

0.8

(a.2) : K = 25, nk ≤ 5

0.8

(a.1) : K = 25, nk = 2

0.0

0.2

0.4

0.6

0.8

1.0

r3

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

2619

Figure 1. Mean kappa statistic and mean squared error (MSE) for scenario 3: p1C D 0:45; p2C D 0:35, pC1 D 0:5; pC2 D 0:25. The number of clusters is K D 25, with cluster sizes being nk D 2, nk 6 5, and nk 6 10, varying r1 D 0; 0:1; 0:3; 0:5, and 0:8, and r3 taking values in the constrained range in the simulation setup with increment of 0.05. (a.1) to (a.3) for mean kappa statistic, (a.4) to (a.6) for MSE.

Z. YANG AND M. ZHOU

the true kappa is K D 0:3047. The true kappa is then used to evaluate the MSE and empirical coverage probability. 4.2. Simulation evaluation The evaluation procedure involves the average kappa statistic, MSE, and empirical coverage probability of the CI. The MSE is calculated as R 2 1 X b MSE D Ki  K ; R i D1

where R D 2000 is the number of Monte Carlo replications in each simulation configuration, K is the bi is the estimated kappa statistic from the ith replication, i D 1;    ; R. underlying true kappa, and K Let ˛ D 0:05 denote the nominal level, and the nominal coverage probability is then 0.95. The empirical coverage probability of the CI is then estimated as Number of times the true kappa .K/ lies in the CI : Total number of simulations .R/   bi;l < K < K bi;u =R, where K bi;u bi;l and K That is, we calculate the empirical coverage probability by # K are the lower and upper limits of the 95% CI obtained from the ith replication. Ideally, with a good Empirical coverage probability D

0.6

0.8

1.0

0.3

0.12 0.40

Scenario 3 nk ≤ 5

0.6

0.8

1.0

r3

0.50

0.55

0.60

nk ≤ 10 (a.6) : K = 25, r1 = 0.8

Ratio of Bias to SD

0.08

0.07 0.06 0.05

0.05

0.04

Ratio of Bias to SD

0.04

0.03 0.4

0.45

r3

0.02 0.2

0.10

0.7

(a.5) : K = 25, r1 = 0.3

0.08

(a.4) : K = 25, r1 = 0

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07

Ratio of Bias to SD

0.5

r3

nk = 2

0.0

0.08 0.04

0.1

r3

0.09

0.4

0.07

0.2

0.06

0.0

(a.3) : K = 15, r1 = 0.8

0.06

Ratio of Bias to SD

0.07 0.04

0.05

0.06

Ratio of Bias to SD

0.08

0.08 0.06 0.04 0.02 0.00

Ratio of Bias to SD

(a.2) : K = 15, r1 = 0.3

0.09

(a.1) : K = 15, r1 = 0

0.1

0.3

0.5

r3

0.7

0.40

0.45

0.50

0.55

0.60

r3

2620

Figure 2. Ratio of bias to estimate of standard deviation of the kappa statistic for scenario 3: p1C D 0:45, p2C D 0:35; pC1 D 0:5; pC2 D 0:25. The number of clusters is K D 15 or K D 25 with cluster size nk D 2, nk 6 5, and nk 6 10, varying r1 D 0; 0:3, and 0:8, and r3 taking values in the constrained range in the simulation setup with increment of 0.05.

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

Z. YANG AND M. ZHOU

b the estimated size of the coverage probability should be closer to the estimate of the variance for K, target nominal coverage probability of 0.95. Because of sampling error, with R D 2000, it is expected to fall within an approximate 95% p CI of the nominal level of 0:95.0:9404I 0:9596/, that is, using 0:95 as the true coverage probability and .:05/.:95/=2000 as the standard error (SE) to obtain the lower and upper limits. 4.3. Simulation results The extensive simulation results from Figures 1–5, Table III, and the Supporting information/Web appendix enable us to conclude that results from the three investigated scenarios for p1C ; p2C ; pC1 , and pC2 are largely compatible with respect to the mean kappa statistic, its SE, and the empirical coverage probability. That is, the performance of different methods is largely consistent under different shapes of marginal probabilities. Moreover, the results suggest that (1) All the results (the line of true kappa overlaying the lines of estimated kappa statistic under different values of r1 ) for the mean kappa statistic [e.g., (a.1) to (a.3) in Figure 1] suggest that the proposed kappa estimator provides a quite consistent estimation for the true value, even when the number of clusters is small, for example, K D 15 or K D 25. This statement is also supported by Figure 2 where the ratio of bias to the estimate of the standard deviation for the kappa statistic is

0.6

0.8

1.0

0.5

0.40

0.45

Scenario 3 Proposal, nk ≤ 5 Fleiss, nk ≤ 5

0.8

1.0

0.55

0.60

Proposal, nk ≤ 10 Fleiss, nk ≤ 10 (a.6) : K = 25, r1 = 0.8

90 85

Coverage Probability(%)

95

96 94 92 90

70

86

88

Coverage Probability(%)

96 94 92 90

0.6

r3

0.50

r3

(a.5) : K = 25, r1 = 0.3

88

0.4

85

0.7

r3

(a.4) : K = 25, r1 = 0

0.2

90

95 0.3

Proposal, nk = 2 Fleiss, nk = 2

0.0

80

Coverage Probability(%)

70 0.1

r3

80

0.4

75

0.2

75

94 92 90 86

88

Coverage Probability(%)

94 92 90

Coverage Probability(%)

88 0.0

Coverage Probability(%)

(a.3) : K = 15, r1 = 0.8

96

(a.2) : K = 15, r1 = 0.3

96

(a.1) : K = 15, r1 = 0

0.1

0.3

0.5

r3

0.7

0.40

0.45

0.50

0.55

0.60

r3

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

2621

Figure 3. The empirical coverage probabilities (%) for scenario 3: p1C D 0:45; p2C D 0:35; pC1 D 0:5, pC2 D 0:25. The number of clusters is K D 15 or K D 25 with varying r1 D 0; 0:3, and 0:8, and r3 taking values in the constrained range in the simulation setup with increment of 0.05. The three horizontal dotted lines correspond to the nominal coverage probability 95% and its lower and upper limits of 95% CI as 94:04% and 95:96%, respectively.

Z. YANG AND M. ZHOU

0.6

0.8

1.0

0.5

0.40

0.45

Scenario 3 Proposal, nk ≤ 5 Fleiss, nk ≤ 5

0.8

1.0

0.55

0.60

Proposal, nk ≤ 10 Fleiss, nk ≤ 10 (a.6) : K = 100, r1 = 0.8

85

Coverage Probability(%)

90

95

96 94 92 90

70

Coverage Probability(%)

86

88

97 96 95 94 93

0.6

r3

0.50

r3

(a.5) : K = 100, r1 = 0.3

92

0.4

85

0.7

r3

(a.4) : K = 100, r1 = 0

0.2

90

95 0.3

Proposal, nk = 2 Fleiss, nk = 2

0.0

80 70

0.1

r3

80

0.4

75

0.2

75

Coverage Probability(%)

94 92 90 86

88

Coverage Probability(%)

96 95 94 93 92

Coverage Probability(%)

0.0

Coverage Probability(%)

(a.3) : K = 50, r1 = 0.8

96

(a.2) : K = 50, r1 = 0.3

97

(a.1) : K = 50, r1 = 0

0.1

0.3

0.5

r3

0.7

0.40

0.45

0.50

0.55

0.60

r3

Figure 4. The empirical coverage probabilities (%) for scenario 3: p1C D 0:45; p2C D 0:35; pC1 D 0:5, pC2 D 0:25. The number of clusters is K D 50 or K D 100 with varying r1 D 0; 0:3, and 0:8, and r3 taking values in the constrained range in the simulation setup with increment of 0.05. The three horizontal dotted lines correspond to the nominal coverage probability 95% and its lower and upper limits of 95% CI as 94:04% and 95:96%, respectively.

2622

displayed; Efron and Tibshirani ([44], p.128) suggested that, if the estimate of the bias is small compared with the estimate of the standard deviation, we do no need to worry about the bias; that is, a bias of less than 0.25 of the standard deviation can be ignored; (2) Because of the (asymptotic) unbiasedness of the kappa estimator, the magnitude of MSE is primarily due to the variance of the estimator. The results [e.g., (a.4) to (a.6) in Figure 1] indicate that increasing the number of clusters or cluster sizes leads to decreased MSE and provides better precision for the kappa estimation. Moreover, increasing the intra-cluster correlation () is accompanied with increased MSE; (3) Assuming independence within a cluster, that is, intra-cluster correlation  D 0, the performance b F in (2) and V b in (8) are both satisfactory for in terms of empirical coverage probability using V b K > 50. Moreover, V F can even provide acceptable performance for small number of clusters, for example, K D 15 or K D 25, especially when the cluster sizes are moderate, for example, nk 6 10. Meanwhile, r3 , the correlation between latent responses from both procedures for the b in (8) same unit in a cluster, has minimal impact on the empirical coverage probability from V b F in (2), especially when the intra-cluster correlation becomes but is influential on that from V large. When the assumption of independence does not hold, the empirical coverage probability b F in (2) decreases rapidly when the average cluster sizes increase [e.g., (a.3) and (a.6) from V in Figures 3–5]. Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

Z. YANG AND M. ZHOU

94 92 90 86

88

Coverage Probability(%)

94 92 90 88 86

Coverage Probability(%)

96

(a.2) : K = 25

96

(a.1) : K = 15

0.40

0.45

0.50

0.55

0.60

0.40

0.45

r3

0.55

0.60

r3

Scenario 3 Proposal, nk ≤ 5 Fleiss, nk ≤ 5

Proposal, nk = 2 Fleiss, nk = 2

Proposal, nk ≤ 10 Fleiss, nk ≤ 10

94 92 90 88 86

86

88

90

92

Coverage Probability(%)

94

96

(a.4) : K = 100

96

(a.3) : K = 50

Coverage Probability(%)

0.50

0.40

0.45

0.50

r3

0.55

0.60

0.40

0.45

0.50

0.55

0.60

r3

Figure 5. The empirical coverage probabilities (%) for scenario 3: p1C D 0:45; p2C D 0:35; pC1 D 0:5, pC2 D 0:25. The number of clusters is K D 15; 25; 50, and 100 with heterogeneous within-cluster correlation matrix R and r3 taking values in the constrained range [0.4, 0.6] with increment of 0.05. The three horizontal dotted lines correspond to the nominal coverage probability 95% and its lower and upper limits of 95% CI as 94:04% and 95:96%, respectively.

(4) With the increase of within-cluster correlation, the performance of empirical coverage probability b F in (2) becomes poorer, especially for situation with at least fair correlation ( > 0:3). from V This is true even for relatively large number of clusters, for example, K D 100. On the other hand, b in (8) is reasonably well for modthe performance of empirical coverage probability based on V erately large number of clusters, K > 50 (e.g., Figures 3 and 4), and this is true for the situation with heterogeneous within-cluster correlation (e.g., Figure 5). For small number of clusters such as K D 15 or K D 25, the results indicate that the proposed variance estimator may also exhibit the underestimation issue, leading to the liberal empirical coverage probability.

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

2623

Finally, note that the empirical coverage probability is expected to fall in the interval .0:9404I 0:9596/, which is the basis to define four intervals as < 0:9404; Œ0:9404; 0:95; .0:95; 0:9596; and > 0:9596. Because the performance of different methods is largely consistent under different shapes of marginal probabilities from the three investigated scenarios, the simulation results for all the 2520 simulation configurations with homogeneous within-cluster correlation are pooled together and summarized by these four intervals in Table III. The CI from an variance estimator is conservative if its empirical size is b F in > 0:9596 and is liberal if its empirical size is < 0:9404. Because there are only two cases from V (2) with the empirical coverage probability exceeding 0.9596, we do not present the summary results for interval > 0:9596 in Table III. Clearly, the results from Table III support the aforementioned conclusions from another perspective about the empirical coverage probability.

Z. YANG AND M. ZHOU

Table III. Simulation summary of percentages of empirical coverage probabilities falling into the specified intervals based on 2000 Monte Carlo replications. Percentage (Average coverage probability, %) Setting K 6 25

6 0:94

.0:94; 0:95

.0:95; 0:96/

6 0:94

.0:94; 0:95

.0:95; 0:96/

0

126

0.1–0.3

198

71.43 (93.20) 85.35 (93.14) 100.00 (91.36)

28.57 (94.36) 14.65 (94.32) — —

— — — — — —

91.27 (92.87) 93.43 (92.90) 83.33 (92.82)

8.73 (94.27) 6.57 (94.28) 16.67 (94.37)

— — — — — —

43.65 (93.52) 69.70 (92.77) 100.00 (88.03)

48.41 (94.41) 28.28 (94.37) — —

7.94 (95.27) 2.02 (95.11) — —

90.48 (92.83) 92.42 (92.63) 97.92 (92.06)

9.52 (94.32) 7.58 (94.22) 2.08 (94.13)

— — — — — —

30.16 (93.56) 78.28 (91.75) 100.00 (82.26)

54.76 (94.48) 20.20 (94.53) — —

15.08 (95.21) 1.52 (95.15) — —

91.27 (92.81) 96.46 (92.31) 96.88 (92.03)

8.73 (94.22) 3.54 (94.16) 3.13 (94.12)

— — — — — —

r1

>0.3

65

126

0.1–0.3

198

2

126

0.1–0.3

198

126

7.94 (93.72)

72.22 (94.61)

19.84 (95.28)

12.70 (93.73)

72.22 (94.53)

15.08 (95.25)

0.1–0.3

198

25.25 (93.59) 92.71 (92.33)

62.12 (94.51) 7.29 (94.11)

12.12 (95.26) — —

13.13 (93.82) 11.46 (93.81)

70.71 (94.53) 77.08 (94.50)

16.16 (95.25) 11.46 (95.30)

1.59 (93.53) 40.91 (93.20) 100.00 (89.70)

57.14 (94.67) 44.44 (94.54) — —

41.27 (95.31) 14.65 (95.24) — —

7.14 (93.88) 17.17 (93.79) 13.54 (93.68)

67.46 (94.56) 71.72 (94.51) 79.17 (94.52)

25.40 (95.25) 11.11 (95.23) 7.29 (95.31)

3.17 (93.84) 58.59 (92.08) 100.00 (83.83)

64.29 (94.67) 37.37 (94.52) — —

31.75 (95.28) 4.04 (95.23) — —

15.87 (93.80) 21.72 (93.71) 16.67 (93.76)

73.02 (94.49) 73.74 (94.49) 77.08 (94.47)

11.11 (95.20) 4.55 (95.19) 6.25 (95.28)

96

0

126

0.1–0.3

198

>0.3

6 10

96

0

>0.3

65

96

0

>0.3

> 50

96

0

>0.3

6 10

Proposal

n

nk 2

Fleiss

96

0

126

0.1–0.3

198

>0.3

96

2624

Note: The numbers in the parenthesis is the average empirical coverage probability.  Summary of simulation configurations is given in Table II.  Little n is the total number of simulation configurations for combination of K, n , and r categories and is the 1 k denominator for the percentage of each row.  For each case, the Fleiss method has one record with the empirical coverage probability being greater than 0.9596.

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

Z. YANG AND M. ZHOU

5. Examples In this section, we present two real data analysis examples assessing the agreement between two procedures. we provide a SAS macro ClusterKappa in the Supporting information/Web appendix or available and maintained at http://works.bepress.com/zyang/18/, and a brief introduction of the macro is given in Appendix E. For both examples, we present the summary data and key results related to the b and the 95% CIs, in Table IV; results from testing bo; P be ; K kappa statistic calculation, for example, P marginal homogeneity can be found in [27]. The first example uses data from a diagnostic clinical trial study comparing two principal gamma imaging techniques in nuclear medicine: PET and single photon emission computed tomography (SPECT). The data consist of 21 subjects with 51 glands confirmed at surgery not to have hyperparathyroidism. For each subject, the cluster size nk varies from 1 to 4 with an average of 2.4 glands. Although PET provides good resolution images in detecting abnormal parathyroid, it is expensive, while SPECT is an optional technique utilizing lower (less expensive) resolution. These data have been frequently used to illustrate the statistical testing for marginal homogeneity in clustered matched-pair data ([28] and references therein). The second example from [45] aims to compare the responses of psychiatrists and their patients on issues relating to patient concerns and treatment. Each of 135 patients, matched with one of 29 psychiatrists, was administered a checklist of patient problems, goals, and methods for treatment. One of these items was the perceived relevance to treatment of the ‘patients not being able to be in touch with reality’. Twenty-nine psychiatrists were each asked to identify, in their opinion, whether this item was relevant to each of their patients. The 135 patients receiving treatment from these psychiatrists were also asked this question, as if it were applied to themselves. Each psychiatrist has different number of patients, that is, varying cluster sizes nk from 1 to 8 with an average of 4.7 patients. We summarize the results in Table IV, where C.1/=  .0/ corresponds to ‘Identifies item’ or ‘Does not identify item’, respectively. For the kappa statistic, only values between 0 and 1 have useful meaning. Landis and Koch [10] suggested different ranges of values for the kappa statistic with respect to the degree of agreement; for example, values between 0.40 and 0.75 may be taken to represent fair to good agreement beyond chance. Therefore, for the first example, after taking clustering into considerations, the estimated kappa statistic bDK bc D 0:422 indicates that there is good agreement for the diagnosis of parathyroid status between K bDK bc D 0:016, suggesting the PET and SPECT procedures. In contrast, for the second example, K agreement between the patient and psychiatrist’s assessments on the psychological state of patient is even worse than the chance agreement. This may also indicate the propensity of a patient to avoid the psychological state assessment made by the psychiatrist.

Table IV. Summary information of data analysis for the kappa statistic. PET Parathyroid status

Psychiatrist Psychological state

C(1)

(0)

Total

C(1)

39 (0.7647)

7 (0.1373)

46 (0.9020)

(0)

1 (0.0196)

4 (0.0784)

5 (0.0980)

40 (0.7843)

11 (0.2157)

N D 51

SPECT

Total

C(1)

(0)

Total

C(1)

25 (0.1852)

21 (0.1556)

46 (0.3407)

(0)

50 (0.3704)

39 (0.2889)

89 (0.6593)

75 (0.5556)

60 (0.4444)

N D 135

Patient

Total Parameters

Examples

b K.SE/

95% CI for b K

b Kc .SE/

95% CI for b Kc

0.8431, 0.7286 0.4741, 0.4823

0.4221(0.1606) 0.0159(0.0784)

(0.1073, 0.7369) (0.1696, 0.1378)

0.4221(0.1554) -0.0159 (0.0928)

(0.1176, 0.7266) (0.1977, 0.1659)

PET, position emission tomography; SPECT, single photon emission computed tomography; SE, standard error.

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

2625

1 2

b P o, b Pe

Z. YANG AND M. ZHOU

bc is slightly smaller than the SE (D0.1606) for Note in the first example that the SE (D0.1554) for K b K, which seems to be anti-intuitive. To obtain a better understanding of this result, we perform a simulation study to mimic the setting in this example. Following the simulation setup in Section 4, we fix the parameters in the within-cluster correlation structure R as r1 D r2  b  D 0:45795 [28], r3 D 0:78094 b D 0:422), and r4 D 0:6b (derived to obtain K . Other used parameters are K D 21; nk 6 4 with an average cluster size of 2.4, p1C D 0:9020; p2C D 0:0980; pC1 D 0:7843; and pC2 D 0:2157. On the basis bc is estimated to be 0.1593, and the of the results from 2000 Monte Carlo replications, the mean SE for K b mean SE for K is 0.1527. Therefore, the results intuitively support that the SE from the method assuming independence tends to underestimate the SE of the kappa statistic on average. Moreover, the simulation results also suggest that the SE from the method assuming independence can be larger than the SE from the proposed method (38% of the simulation replications), which indicates that this counter-intuitive situation may happen in the real data analysis.

6. Conclusions Matched-pair data collected in clusters are frequently encountered in medical research. For such data, it is expected that the SE of the kappa statistic assuming within-cluster independence tends to underestimate the standard deviation of the kappa statistic on average, particularly when there is a strong within-cluster correlation. The results of our extensive Monte Carlo simulation study show that this underestimation yields CIs that are often too narrow and have poor performance in terms of the empirical coverage probability. For the clustered matched-pair data, we performed very limited research to obtain the estimator of kappa and its SE. On the basis of the delta method and sampling techniques, we propose a nonparametric variance estimator for the kappa statistic, which provides a general solution without requiring structural within-cluster correlation or distributional assumptions. The proposed kappa statistic produces fairly consistent estimate of the true kappa, even when the number of clusters is as small as K D 15. For a moderately large number of clusters (e.g., K > 50), the empirical coverage probabilities from the proposed variance estimator are reasonably close to the nominal level after taking into account withincluster correlation. Although the performance of the proposal is satisfactory if the number of clusters is large, when the number of clusters is small, say K < 50, an alternative strategy to develop or to improve the variance estimation for the kappa statistic is worth exploring in future research. Meanwhile, we notice that a limitation of the current research is that it is based on the assumption of homogeneity of the kappa across K clusters. In practice, this assumption may not always be true, and it needs a formal testing. For example, if we consider a hypothetical scenario from the second example in Section 5, because different numbers of patients are matched to each psychiatrist, if a more experienced psychiatrist are assigned to evaluate patients with more severe psychological state, a higher degree of disagreement with those patients may be expected. Therefore, this hypothetical scenario may lead to the heterogeneity of kappa, and the proposed kappa statistic and its corresponding variance estimator would not be appropriate. Following similar ideas for testing homogeneity of independent kappa statistics by Donner et al. [46] and Nam [47], we will explore this potential research topic to develop omnibus tests of homogeneity of kappa statistics for the clustered matched-pair data. When the test of homogeneity admits heterogeneous kappa across the K clusters, it will be interesting to develop improved estimators to appropriately account for cluster heterogeneity by utilizing the idea from Barlow [9], or to develop some diagnosis tools to identify the outlying cluster(s) and to exclude these clusters from the summary. Finally, prudent efforts at the study design stage in carefully choosing the clusters can help avoid the violation of the homogeneity assumption.

Appendix A: Equivalence of Vb F and Vb D To establish the equivalence between (2) and (3), it is helpful to write

2626

g X g X

g g X g X  2 X  2 2 b p ij b p Ci D b p i i .b p Ci / C b p ij b p Ci : pj C C b pj C C b pi C C b

i D1 j D1

Copyright © 2014 John Wiley & Sons, Ltd.

i D1

i D1

j D1 j ¤i

Statist. Med. 2014, 33 2612–2633

Z. YANG AND M. ZHOU

From (3), we have " # g   4   2   X b b b b b b b b b V D N 1  P e D P o 1P o 1P e C2 1P o 1P e 2P o P e  b p i i .b p Ci / p i C Cb 2 3 g X g 2 X   2 b 2e 5 bo 4 b p ij b p Ci  4P pj C C b C 1P

i D1

i D1 j D1



   2  2 2 bo 1  P bo C 1  P b 2o be P be P bo 1  P be  1  P D P " g   2  X bo  2 1  P bo 1  P be P be b p i i .b pi C C b p Ci / C 1P i D1

# g g g X   2 X 2 X  2 2 bo bo C 1P pj C C b b p i i .b p Ci / C 1  P b p ij b p Ci pi C C b i D1

i D1

j D1 j ¤i

   2   2 2 2 b b b b b b b b  1  P e P o  4 1  P o 1  P e P oP e C 4 1  P o P e  4 be : bF N 1  P DV Therefore, the variance estimator (2) by Fleiss et al. [6] is just an alternative expression of the variance estimator in (3), which was derived using the delta method. b Appendix B: Derivation of asymptotic variance of K Before proceeding with the detailed description of the derivation of the asymptotic variance (3) for b using the delta method, we list two straightforward lemmas to be used in the subsequent proof for K Proposition 1. Lemma 1  P Suppose X1 ; : : : ; Xg follows a multinomial distribution with parameters p1 ; : : : ; pg and giD1 Xi D N . Then (1) (2) (3) (4)

The MLE for pi is pOi D Xi =N . Xi  Binomial.N; pi /, for any i D 1;    ; g. The covariance of Xi and Xj is Cov.Xi ; Xj / D Npi pj C Npi I fi D j g: Collapsing any two cells, say Xi 1 ; Xi without loss of generality, and let Y D Xi 1 C Xi , then  X1 ;    ; Xi 2 ; Y; Xi C1 ;    ; Xg /  multinomial.N I p1 ; : : : ; pi 2 ; pi 1 C pi ; pi C1 ;    ; pg . 

Lemma 2 Let a is a scalar and b; c are g  1 vectors, for a continuous transformation f .a; b; c/   a ; f .a; b; c/ D bT c  T is The gradient of f with respect to a; bT ; c T  J.a; b; c/ D

1 0T 0 cT

0T bT

 :

 

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

2627

For the matched-pair data in Section 2, assuming a multinomial model as nij  multinomial N;       pij g 2 1 , under mild regularity conditions, the MLE of pij g 2 1 is b D b p 11 ; : : : ; b p 1;g ; : : : ; b p g1 ; T : : : ;b p g;g1 , which follows an asymptotic multivariate normal distribution.

Z. YANG AND M. ZHOU

 T T T   bo; b Let b pC D b p 1C ; : : : ; b p gC and b pC D b p C1 ; : : : ; b p Cg , the quantity b  D P pC ; b pC can be written as a linear transformation of b ; that is, there exists some constant matrix A such that b  D Ab . Therefore, b  follows an asymptotic multivariate normal distribution and is the MLE of p  ! .   .Po ; pC ; pC /T . By consistency of MLE, b Proposition 1 Assuming a multinomial sampling model as nij T  bo; b pC ; b pC , and the asymptotic variance of b  is P 0 V11   1 @ V 21 Var b  D N V 31



V 12 V 22 V 32

    multinomial N; pij g 2 1 , let b  D

1 V 13 V 23 A ; V 33

(10)

where V11 is a scalar, and both V 22 and V 33 are g  g matrices. The elements of Var.b / are V11 D Po .1  Po /

  bo; b V 12 D V T21 D N Cov P pC D .p11 ; : : : ; pgg /  Po pTC   bo; b V 13 D V T31 D N Cov P pC D .p11 ; : : : ; pgg /  Po pTC V 23 D V T32 D N Cov .b pC ; b pC / D .pij /gg  pC p TC V 22 D N Var.b p C / D diag.p1C ;    ; pgC /  pC pTC V 33 D N Var.b p C / D diag.pC1 ;    ; pCg /  pC pTC : Proof P P b o D N 1 g ni i , by item 4 in Lemma 1, g ni i  Binomial.N; Po /; thus, V11 D Note that P i D1 i D1 Po .1  Po /. For V 12 and V 21 , ! g X 1 bo; b pC / D Cov ni i ; .n1C ; : : : ; ngC /0 V 12 DN Cov.P N i D1 ! !! g g X X 1 D ni i ; n1C ; : : : ; Cov ni i ; ngC Cov N i D1 i D1 ! g g X g g X   1 XX D Cov .ni i ; n1k / ; : : : ; Cov ni i ; ngk : N i D1 kD1

i D1 kD1

By item 3 in Lemma-1, it is noted that g X g X

g X g X g g X  X  Cov ni i ; nj k D pi i I fi D j; i D kg  Npi i pj k

i D1 kD1

i D1 kD1

D Npi i I fi D j g  N

i D1 kD1

g X

!

pi i pj C

i D1

D Npi i I fi D j g  NPo pj C : Therefore,   V 12 D V T21 D p11  Po p1C ; : : : ; pgg  Po pgC D .p11 ; : : : ; pgg /  Po pTC :

2628

By symmetry of p C and p C ; V 13 and V 31 can be similarly derived as   V 13 D V T31 D p11  Po pC1 ; : : : ; pgg  Po pCg D .p11 ; : : : ; pgg /  Po pTC : Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

Z. YANG AND M. ZHOU

It is also noted by item 3 in Lemma-1 that g g X X   1 Cov b pi C; b p Cj D 2 Cov ni k ; nmj N mD1

D

1 N2

g X

kD1 g X

˚

kD1 mD1

!

 1  Npij I fi D m; k D j g  Npi k pmj D pij  pi C pCj : N

Therefore, for V 23 and V 32 ,

   pC ; b pC / D N Cov b p Cj gg D .pij /gg  pC pTC pi C; b V 23 D V T32 D N Cov .b

For V 22 and V 33 , by item 4 in Lemma 1, it is observed that N b p C  multinomial.N; pC / and Nb pC  multinomial.N; pC /, and this leads to     V 22 DN 1 Var .N b pC / D pi C I fi D j g  pi C pj C gg D diag p1C ;    ; pgC  pC pTC     V 33 DN 1 Var .N b pC / D pCi I fi D j g  pCi pCj gg D diag pC1 ;    ; pCg  pC pTC :  On the basis of the quantities in Proposition 1, the following identities are derived and useful for the b in (3): simplification of the final variance of K p TC V22 pC D p TC V33 pC

D

g X

2 pCi pi C  p TC pC p TC pC D

i D1

i D1

g X

g X

pj2C pCj pTC pC pTC pC

D

j D1

p TC V23 pC D

g X

g X g X

2 pCi pi C  Pe2 D

g X g X

2 pCi pij  Pe2 ;

i D1 j D1

pj2C pCj Pe2

D

j D1

pCi pij pj C  pTC pC pTC pC D

i D1 j D1

g g X X

pij pj2C  Pe2 ; (11)

i D1 j D1 g X g X

pCi pij pj C  Pe2 :

i D1 j D1

bo; P b e /T can be written as a function of The calculation in (1) suggests that the quantity D .P bo; b bo; P b e /T D f .P bo; b .P pC ; b p C /T , that p is, .P pC ; b pC /, by Lemma 2 and the delta method; the T b b asymptotic variance of N .P o ; P e / is !    bo  T  P  J Po ; pC ; pC Var . / D N Var D J Po ; pC ; pC N Var b b Pe After some straightforward simplification and applying the results in (11), we have Var . / D

! 01

V11

V 12

V 13

pTC V 21 C pTC V 31

p TC V 22 C pTC V 32

pTC V 23 C pTC V 33

0 D@

1 0T @0 pC A 0 pC 1

V11

V 12 pC C V 13 pC

pTC V 21 C pTC V 31

pTC V 22 pC C pTC V 33 pC C 2pTC V 23 p C

0 Po .1  Po / B B B DB g BX @ pi i .pi C C pCi /  2Po Pe i D1

g X

A

1

pi i .pi C C pCi /  2Po Pe C C C C: g g X C X pij .pCi C pj C /2  4Pe2 A i D1

i D1 j D1

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

2629

Finally, because K is a function about Po and Pe , that is, K D h.Po ; Pe / D .Po  Pe /=.1  Pe /, then b can be presented as (3) after applying the b b e /, the estimator of the asymptotic variance of K bo; P K D h.P delta method.

Z. YANG AND M. ZHOU

Appendix C: Special case of all cluster sizes nk D 1 Consider the case where the cluster size is 1 for all clusters in Section 3, that is, nk D b o;k ; b all k D 1; : : : ; K. Then nij;k ; P p i C;k ; b p Ci;k 2 f0; 1g for all i; j; k and ND

K X

nk D K;

kD1

1 ! D 1K1 ; K

1 D 2 K



1 I  11T K

P

i;j

nij;k D 1 for

 :

Denote e ni i D

K X

ni i;k ;

e ni C D

kD1

K X

ni C;k ;

e nCi D

kD1

K X

nCi;k ;

kD1

b e are the same as those in (1) b o and P It can be easily verified that P g K 1 Xb 1 X b Po D e ni i ; P o;k D K K i D1

kD1

g 1 X b Pe D 2 e ni Ce nCi ; K i D1

b 11 , we have and each element in (6) can be simplified to (10). For V ! ! K K X X 1 1 b 11 D b 2o D b 2o D K P b o /: b 2  KP b o;k  K P b o .1  P V P P o;k K 1 K 1 K 1 kD1

kD1

b 12 can be simplified as Similarly, the term V ! K !# " K K X X X 1 1 b 12 D b o;k b b o;k b V P P p TC;k  p TC;k K 1 K kD1 kD1 kD1 ! K !# " K K X X 1 X 1 T b .n11;k ; : : : ; ngg;k /  b p o;k D pC;k K 1 K kD1 kD1 kD1 o K n b ob D p gg /  P .b p 11 ; : : : ; b pTC ; K 1 b 13 can be shown as, and by symmetry, V b 13 D V

o K n b ob p gg /  P pTC : .b p 11 ; : : : ; b K 1

b 23 , we have Notice that b pC;k b p i C;k b p Cj;k /gg and b p i C;k b p Cj;k D n2ij;k D nij;k , for V pTC;k D .b b 23 V

( K X 1 1 b D p C;k b pTC;k  K 1 K kD1

K X kD1

! b pC;k

K X kD1

!) b pTC;k

D

o K n  pCb b p ij gg  b pTC : K 1

    Finally, it is noted that D b p i C;k b p j C;k gg D ni C;k I fi D j g gg and b pC;k b pTC;k D    p Cj;k gg D nCi;k I fi D j g gg , then it can be derived that b p Ci;k b b pC;k b pTC;k 



 K ˚ p gC  b pCb diag b p 1C ; : : : ; b pTC ; K 1 

˚  b 33 D K V pCb p Cg  b diag b p C1 ; : : : ; b pTC : K 1

b 22 D V

2630

We noticed that the only difference between these simplified estimators and those in the Proposition 1 is the finite correction factor, K=.K  1/, which is negligible when the number of clusters K is large. b D (or Therefore, when the cluster size nk D 1, it is expected that the final variance estimator will be V b V F ) multiplied by the finite correction factor, K=.K  1/. Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

Z. YANG AND M. ZHOU

Appendix D: Restrictions for correlation matrix R In this section, we derive some restrictions and justifications for the parameters of r1 ; r2 ; r3 , and r4 from the correlation matrix R in Section 4.1. One can apply these restrictions to assure the nonnegative definiteness of R when it is used to generate the correlated multivariate normal responses. One can obtain the eigenvalues  of R by solving equation n onk 1 det .R  I/ D Œ  .1  r1 / Œ  .1  r2 /  .r3  r4 /2 n o  Œ  .nk  1/ r1  1 Œ  .nk  1/ r2  1  Œr3 C .nk  1/ r4 2 D 0; which leads to

 q 1 .2  r1  r2 / ˙ .r1  r2 /2 C 4 .r3  r4 /2 ; and 2 q 1 Œ2 C .nk  1/ .r1 C r2 / ˙ .nk  1/2 .r1  r2 /2 C 4 Œr3 C .nk  1/ r4 2 : D 2 D

Assume the domain of r1 ; r2 ; r3 ; and r4 is Œ1; 1, it can be verified that R is nonnegative definite if and only if all of the following conditions are satisfied, C1 W .1  r1 / .1  r2 / > .r3  r4 /2 ; C2 W2 C .nk  1/ .r1 C r2 / > 0; C3 W Œ1 C .nk  1/ r1  Œ1 C .nk  1/ r2  > Œr3 C .nk  1/ r4 2 : It is noted that condition .C1/ is independent of the cluster size nk , if the domain of r1 ; r2 ; r3 ; and r4 is Œ0; 1 which is used in the simulation, conditions .C2/ and .C3/ are always satisfied.

Appendix E: SAS macro for calculating the kappa statistics The SAS macro ClusterKappa is developed to calculate the kappa statistics for clustered matchedpair data, and parameters included in the macro are listed as %macro ClusterKappa (dsin = , /* the input dataset */ clusV = , /* the cluster variable, need to be numeric*/ unitV = , /* the unit variable, need to be numeric*/ trt1V = , /* Variable for procedure 1 */ Resp1V =, /* Response variable for procedure 1, need to be numeric integer and > 0 */ trt2V = , /* Variable for procedure 2 */ Resp2V = /* Response variable for procedure 2, need to be numeric integer and > 0 */); ... %mend;

To perform the data analysis and call the macro, it is expected to have the following data structure as shown in the example SAS dataset response

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

2631

/***** Here is the example-1 ********/; data response; input Cluster Subject Trt2 Resp2 Trt1 Resp1; cards; 1 1 1 0 2 0 1 2 1 0 2 1 1 3 1 0 2 1 ......... run; data response; set response; if Resp1 = 1 then Resp1 = 2; else Resp1 = 1; if Resp2 = 1 then Resp2 = 2; else Resp2 = 1; run; %ClusterKappa(dsin = response, clusV = cluster, unitV = subject, trt1V = trt1, Resp1V = Resp1, trt2V = trt2, Resp2V = Resp2);

Z. YANG AND M. ZHOU

The final SAS outputs from the macro contain two parts: the first part presents the results from SAS procedure proc freq by assuming within-cluster independence, and the results have information for the kappa statistic, asymptotic standard error, and lower and upper limits of 95% CI; and the second part includes results using the proposed estimators to calculate the statistics for the clustered matched-pair data. The sample outputs are demonstrated as follows for Example 1 in Section 5. *** Part 1: Kappa statistics assuming independence *** Simple Kappa Coefficient -------------------------------Kappa 0.4221 ASE 0.1606 95% Lower Conf Limit 0.1073 95% Upper Conf Limit 0.7369 Sample Size = 51 *** Part 2: Kappa statistics for clustered matched-pair data *** P_O P_E kappa Standard Error 0.8431373 0.7285659 0.4220963 0.155379 95% CI Lower limit 95% CI Upper limit 0.1175534 0.7266392

Acknowledgements The authors are grateful to the editor, the associate editor, and two anonymous referees for their careful reading of the manuscript. Their valuable, constructive criticisms and suggestions improved the quality and presentation of this material.

References

2632

1. Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 1960; 20(1):37–46. 2. Agresti A. Categorical Data Analysis. John Wiley & Sons: New Jersey, 2002. 3. Jakobsson U, Westergren A. Statistical methods for assessing agreement for ordinal data. Scandinavian Journal of Caring Sciences 2005; 19(4):427–431. 4. Kundel HL, Polansky M. Measurement of observer agreement1. Radiology 2003; 228(2):303–308. 5. Cohen J. Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin 1968; 70(4):213–220. 6. Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin 1971; 76(5):378–382. 7. Conger AJ. Integration and generalization of kappas for multiple raters. Psychological Bulletin 1980; 88(2):322–328. 8. Warrens MJ. Inequalities between multi-rater kappas. Advances in Data Analysis and Classification 2010; 4(4):271–286. 9. Barlow W, Lai M-Y, Azen S. A comparison of methods for calculating a stratified kappa. Statistics in Medicine 1991; 10(9):1465–1472. 10. Landis J, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33(1):159–174. 11. Kraemer HC. Extension of the kappa coefficient. Biometrics 1980; 36(2):207–216. 12. Vanbelle S, Albert A. Agreement between two independent groups of raters. Psychometrika 2009; 74(3):477–491. 13. Vanbelle S, Albert A. Agreement between an isolated rater and a group of raters. Statistica Neerlandica 2009; 63(1):82–100. 14. Lin H-M, Kim H-Y, Williamson JM, Lesser VM. Estimating agreement coefficients from sample survey data. Survey Methodology 2012; 38(1):63–72. 15. Ma Y, Tang W, Feng C, Tu XM. Inference for kappas for longitudinal study data: applications to sexual health research. Biometrics 2008; 64(3):781–789. 16. Vanbelle S, Mutsvari T, Declerck D, Lesaffre E. Hierarchical modeling of agreement. Statistics in Medicine 2012; 31(28):3667–3680. 17. McKenzie DP, Mackinnon AJ, Péladeau N, Onghena P, Bruce PC, Clarke DM, Harrigan S, McGorry PD. Comparing correlated kappas by resampling: is one level of agreement significantly different from another?. Journal of Psychiatric Research 1996; 30(6):483–492. 18. Vanbelle S, Albert A. A bootstrap method for comparing correlated kappa coefficients. Journal of Statistical Computation and Simulation 2008; 78(11):1009–1015. 19. Donner A, Shoukri MM, Klar N, Bartfay E. Testing the equality of two dependent kappa statistics. Statistics in Medicine 2000; 19(3):373–387. 20. Klar N, Lipsitz SR, Ibrahim JG. An estimating equations approach for modelling kappa. Biometrical Journal 2000; 42(1):45–58. 21. Williamson JM, Lipsitz SR, Manatunga AK. Modeling kappa for measuring dependent categorical agreement data. Biostatistics 2000; 1(2):191–202. 22. Gonin R, Lipsitz SR, Fitzmaurice GM, Molenberghs G. Regression modelling of weighted  by using generalized estimating equations. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2000; 49(1):1–18.

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

Z. YANG AND M. ZHOU 23. Barnhart HX, Williamson JM. Weighted least-squares approach for comparing correlated kappa. Biometrics 2002; 58(4):1012–1019. 24. King TS, Chinchilli VM. A generalized concordance correlation coefficient for continuous and categorical data. Statistics in Medicine 2001; 20(14):2131–2147. 25. Kang C, Qaqish B, Monaco J, Sheridan SL, Cai J. Kappa statistic for clustered dichotomous responses from physicians and patients. Statistics in Medicine 2013; 32(21):3700–3719. 26. Zhou M, Yang Z. A note on the kappa statistic for clustered dichotomous data. Statistics in Medicine 2014, in press, DOI: 10.1002/sim.6098. 27. Yang Z, Sun X, Hardin JW. A note on the tests for clustered matched-pair binary data. Biometrical Journal 2010; 52(5):638–652. 28. Yang Z, Sun X, Hardin JW. Confidence intervals for the difference of marginal probabilities in clustered matched-pair binary data. Pharmaceutical Statistics 2012; 11(5):386–393. 29. Yang Z, Sun X, Hardin JW. Testing marginal homogeneity in clustered matched-pair data. Journal of Statistical Planning and Inference 2011; 141(3):1313–1318. 30. Oden NL. Estimating kappa from binocular data. Statistics in Medicine 1991; 10(8):1303–1311. 31. Schouten HJ. Estimating kappa from binocular data and comparing marginal probabilities. Statistics in Medicine 1993; 12(23):2207–2217. 32. Dixon SN, Donner A, Shoukri MM. Adjusted inference procedures for the interobserver agreement in twin studies. Statistical Methods in Medical Research 2013. DOI: 10.1177/0962280213476771. 33. Rao J, Scott A. A simple method for the analysis of clustered binary data. Biometrics 1992; 48(2):577–585. 34. Obuchowski NA. On the comparison of correlated proportions for clustered data. Statistics in Medicine 1998; 17(13):1495–1507. 35. Von Eye A, Mun EY. Analyzing Rater Agreement: Manifest Variable Methods. Psychology Press: New Jersey, 2004. 36. Kraemer HC. Ramifications of a population model for  as a coefficient of reliability. Psychometrika 1979; 44(4):461–472. 37. Feinstein AR, Cicchetti DV. High agreement but low kappa: I. the problems of two paradoxes. Journal of Clinical Epidemiology 1990; 43(6):543–549. 38. Thompson WD, Walter SD. A reappraisal of the kappa coefficient. Journal of Clinical Epidemiology 1988; 41(10):949–958. 39. Shoukri MM. Measures of interobserver agreement and reliability. Taylor & Francis: New York, 2010. 40. Fleiss JL, Cohen J, Everitt B. Large sample standard errors of kappa and weighted kappa. Psychological Bulletin 1969; 72(5):323–327. 41. Bishop YM, Fienberg SE, Holland PW. Discrete Multivariate Analysis: Theory and Practice. Springer: New York, 2007. 42. Congalton RG, Green K. Assessing the Accuracy of Remotely Sensed Data: Principles and Practices. CRC press: New York, 2008. 43. Athreya KB, Lahiri SSN. Measure Theory and Probability Theory. Springer: New York, 2006. 44. Efron B, Tibshirani. An Introduction to the Bootstrap. CRC press: New York, 1993. 45. Donner A, Petryshen P. The statistical analysis of matched data in psychiatric research. Psychiatry Research 1989; 28(1):41–46. 46. Donner A, Eliasziw M, Klar N. Testing the homogeneity of kappa statistics. Biometrics 1996; 52(1):176–183. 47. Nam J. Homogeneity score test for the intraclass version of the kappa statistics and sample-size determination in multiple or stratified studies. Biometrics 2003; 59(4):1027–1035.

Supporting information Additional supporting information may be found in the online version of this article at the publisher’s web site.

2633

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2612–2633

Kappa statistic for clustered matched-pair data.

Kappa statistic is widely used to assess the agreement between two procedures in the independent matched-pair data. For matched-pair data collected in...
732KB Sizes 18 Downloads 0 Views