Biometrical Journal 56 (2014) 1, 129–140

DOI: 10.1002/bimj.201300048

129

Stratified Fisher’s exact test and its sample size calculation Sin-Ho Jung∗,1,2 1 2

Samsung Cancer Research Institute, Samsung Medical Center, Seoul, Korea Department of Biotatistics and Bioinformatics, Duke University, Durham, NC 27710, USA

Received 6 March 2013; revised 4 July 2013; accepted 3 September 2013

Chi-squared test has been a popular approach to the analysis of a 2 × 2 table when the sample sizes for the four cells are large. When the large sample assumption does not hold, however, we need an exact testing method such as Fisher’s test. When the study population is heterogeneous, we often partition the subjects into multiple strata, so that each stratum consists of homogeneous subjects and hence the stratified analysis has an improved testing power. While Mantel–Haenszel test has been widely used as an extension of the chi-squared test to test on stratified 2 × 2 tables with a large-sample approximation, we have been lacking an extension of Fisher’s test for stratified exact testing. In this paper, we discuss an exact testing method for stratified 2 × 2 tables that is simplified to the standard Fisher’s test in single 2 × 2 table cases, and propose its sample size calculation method that can be useful for designing a study with rare cell frequencies.

Keywords: Conditional type I error; Exact test; Hypergeometric distribution; Many 2 × 2 tables; Odds ratio.

1 Introduction Suppose that we want to compare the response probabilities between two groups, experimental (or case) and control. Oftentimes in a two group comparison, the characteristics of study subjects may be heterogeneous. In this case, the heterogeneity is characterized by some stratification factors, and a stratified method is applied in the final analysis. When the distribution of the stratification factors is identical between two groups, an unstratified testing ignoring the population heterogeneity controls the type I error rate but loses the efficiency. If the distribution of the stratification factors is different between two groups, however, an unstratified testing does not control the type I error rate. We want to test if the two groups have equal response probabilities or not while accounting for heterogeneity of the population defined by strata. Various asymptotic testing methods have been proposed for testing on stratified 2 × 2 tables. Under the assumption that the odds ratios are identical among strata, Cochran (1954) proposes an asymptotic method for testing if the common odds ratio is 1 or not. Under the assumption of constant risk ratios across strata, Gart (1985) proposes an asymptotic method for testing if the common risk ratio is 1 or not. Woolson et al. (1986) and Nam (1992) propose sample size calculation methods for the Cochran test, and Nam (1998) proposes a sample size method for the Gart test. In order to use these tests for testing on many 2 × 2 tables, we have to check the assumptions of common odds ratios or risk ratios in advance. For testing the common odds ratio assumption, Zelen (1971) proposes an exact method, which is implemented by StatXact, and Breslow and Day (1980) propose an asymptotic method. ∗ Corresponding author: e-mail: [email protected], Phone: +1-919-668-8658; Fax: +1-919-668-5888

 C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

130

S.-H. Jung: Stratified Fisher’s exact test

If the assumptions of common odds ratios or risk ratios do not seem to be valid, we need a robust test requiring no assumptions on the primary parameters for testing. Mantel and Haenszel (1959) propose an asymptotic test for testing if two groups have equal response probabilities without any assumption of common odds ratio or common risk ratio. Jung et al. (2007) propose a sample size calculation method for the Mantel–Haenszel test. With rare cell frequencies, however, these asymptotic tests may not be able to control the type I error rate accurately. Addressing this issue, Fisher’s (1935) exact test has been most popularly used when the patient population is homogeneous. In this paper, we extend Fisher’s exact test for testing stratified 2 × 2 tables with rare cell frequencies, and propose its sample size calculation method. These methods can be used in designing and analyzing small case-control studies or clinical trials. In a real data example presented in Section 2, the one-sided p-value by the stratified Fisher’s exact test is much larger than that by Mantel–Haenszel test possibly because of the very small numbers of failures across the strata that can lead to a biased p-value for the asymptotic Mantel–Haenszel test. In sample size calculation, the input parameters to be specified for stratified Fisher’s exact test are exactly the same as those for the sample size calculation of the Mantel–Haenszel test. We will compare the performance of the proposed test with that of the asymptotic Mantel–Haenszel test and the standard Fisher’s exact test ignoring strata under some practical settings. Our stratified exact test is simplified to the standard Fisher’s exact test in a single 2 × 2 table case.

2 Stratified Fisher’s exact test Suppose that there are J strata. The frequency data in stratum j(= 1, . . . , J ) can be summarized as  in Table 1. Let N denote the total sample size, and n j the sample size in stratum j ( Jj=1 n j = N). Among n j subjects in stratum j(= 1, . . . , J ), m j are allocated to group 1 (case or experimental) and ¯ j = n j − m j to group 2 (control). For stratum j, group 1 has a response probability p j and group m 2 has a response probability q j . Let p¯ j = 1 − p j , q¯ j = 1 − q j , and θ j = p j q¯ j /(q j p¯ j ) denote the odds ratio in stratum j. Suppose that we want to test H0 : θ1 = · · · = θJ = 1 against H1 : θ j > 1 for some j = 1, . . . , J. For stratum j(= 1, . . . , J ), let x j and y j denote the numbers of responders for groups 1 and 2, respectively, and z j = x j + y j denote the total number of responses. Table 1 Frequency data of 2 × 2 table for stratum j(= 1, . . . , J ). Here, n j is the total sample size, x j ¯ j patients of case and control groups, respectively, and y j are the number of responders among m j and m and z j = x j + y j is the total number of responders from stratum j. Group Response Yes No Total

Case

Control

Total

xj mj − xj

yj ¯ j − yj m

zj nj − zj

mj

¯j m

nj

 C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Biometrical Journal 56 (2014) 1

131

 We propose to reject H0 in favor of H1 if S = Jj=1 x j is large. Under H0 , conditioning on the margin totals (z j , m j , n j ), x j has the hypergeometric distribution m n −m  j

j

j

z j −x j

xj

f0 (x j | z j , m j , n j ) = m j+

i=m j−

m n −m  i

j

j

z j −i

j

for m j− ≤ x j ≤ m j+ , where m j− = max(0, z j − n j + m j ) and m j+ = min(z j , m j ). Let z = (z1 , . . . , zJ ),  m = (m1 , . . . , mJ ), and n = (n1 , . . . , nJ ). Given (z, m, n), the conditional p-value for s = Jj=1 x j , pv = pv(s | z, m, n), is obtained by pv = P(S ≥ s | z, m, n, H0 ) ⎛ ⎞ m1+ mJ+ J J   

= ··· I⎝ i j ≥ s⎠ f0 (i j | z j , m j , n j ). i1 =m1−

iJ =mJ−

j=1

j=1

Given type I error rate α ∗ , we reject H0 if pv < α ∗ . Similarly for the other one-sided alternative hypothesis H2 : θ j < 1 for some j = 1, . . . , J, the conditional p-value given (z, m, n) is obtained by pv = P(S ≤ s | z, m, n, H0 ) ⎛ ⎞ m1+ mJ+ J J   

⎝ ⎠ = ··· I ij ≤ s f0 (i j | z j , m j , n j ). i1 =m1−

iJ =mJ−

j=1

j=1

A two-sided p-value may be calculated as two times the minimum of the two one-sided p-values. Without loss of generality, we focus our discussions on the one-sided alternative hypothesis H1 in our paper. Note that the Mantel–Haenszel test also rejects H0 in favor of H1 for a large value of S, and its p-value is calculated using the standardized test statistic S−E W = √ V   that is asymptotically N(0, 1) under H0 , where E = Jj=1 E j , V = Jj=1 V j , E j = z j m j /n j and V j = ¯ j (n j − z j )/{n2j (n j − 1)}. Westfall et al. (2002) propose a permutation procedure for the stratified zjmjm Mantel–Haenszel test, which permutes the two-sample binary data within each stratum in the context of multiple testing. Their permutation maintains the margin totals for 2 × 2 tables, {(z j , m j , n j ), 1 ≤ j ≤ J}, and E j and V j depend on the margin totals only, so that the permutation-based Mantel– Haenszel test will be identical to our stratified Fisher’s exact test if they go through all the possible J j=1 (m j+ − m j− + 1) permutations. Their permutation test is implemented by SAS. Compared to our exact test, the permutation test requires a much longer computing time. Furthermore, a permutation test often randomly selects partial permutations to approximate the exact p-value. In this case, the resulting approximate p-value will be different depending on the selected seed number for random  C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

132

S.-H. Jung: Stratified Fisher’s exact test

Table 2 Response to thymosin in bronchogenic carcinoma patients (T = thymosin, P = placebo). Stratum 1

Success Failure

T

P

10 1

12 1

11

13

Stratum 2 T

P

22 2

9 0

11 1

24

9

12

Stratum 3 T

P

20 1

8 0

7 3

15 3

21

8

10

18

number generation or the number of permutations, while the exact method always provides a constant exact p-value. A real data example is taken from Li et al. (1979), where the investigators are interested in whether thymosin (experimental), compared to placebo (control), has any effect in the treatment of bronchogenic carcinoma patients receiving radiotherapy. Table 2 summarizes the data for three strata. The one-sided p-values are 0.1563 by the stratified Fisher’s exact test, 0.1577 for (unstratified) Fisher’s test, and 0.0760 by Mantel–Haenszel test. Note that stratified and unstratified Fisher’s tests have similar p-values since the three strata have similar allocation proportions. Stratified Fisher’s test has a larger p-value than Mantel–Haenszel test because of its conservative type I error control as demonstrated in Section 4 or because of the very small numbers of failures across the strata that can lead to a biased p-value for the asymptotic Mantel–Haenszel test.

3 Power and sample size calculation Jung et al. (2007) propose a sample size calculation method for the Mantel–Haenszel test. In this section, we derive a sample size formula for stratified Fisher’s exact test by specifying the values of the same input parameters as those for Mantel–Haenszel test by Jung et al. (2007). Following are input parameters to be specified for a sample size calculation. _____________________________________________________________________________________ Input parameters

r r r

Type I and II error probabilities: (α ∗ , β ∗ ). Success probabilities for group 2 (control): (q1 , . . . , qJ ). Odds ratios: (θ1 , . . . , θJ ) under H1 , where θ j > 0. Note that, given q j and θ j , the success probability for group 1 (experimental) is given as p j = θ j q j /(q¯ j + θ j q j ) in stratum j(= 1, . . . , J ). r Prevalence for each stratum: (a1 , . . . , aJ ), where a j = E (n j /N ). Note that a j > 0 and Jj=1 a j = 1. r Allocation probability for group 1 (experimental) within each stratum, (b1 , . . . , bJ ), where b j = E (m j /n j ) with 0 < b j < 1. _____________________________________________________________________________________

3.1

When group and stratum allocations are random

Given N, the stratum size m j is often randomly chosen by the prevalence of the stratum in the population. Also, we may randomly allocate each patient between two groups. In this case, given N, {(x j , z j , m j , n j ), 1 ≤ j ≤ J} are random variables with following marginal or conditional probability mass functions that are indexed by the above input parameters.  C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Biometrical Journal 56 (2014) 1

133

_____________________________________________________________________________________ Distribution functions

r

Conditional distribution of x j given (z j , m j , n j ): m 

¯ j  xj m θ z j −x j j

j

xj

f j (x j | z j , m j , n j ) = m j+

i=m j−

r

m  m¯  i j θ j i z −i j j

¯ j ), m j+ = min(z j , m j ) and j = 1, . . . , J. Under for m j− ≤ x j ≤ m j+ , where m j− = max(0, z j − m H0 , this is simplified to f0 (x j | z j , m j , n j ). Conditional distribution of z j given (m j , n j ): ¯ j , q j ) are independent, so that the conditional probGiven (m j , n j ), x j ∼ B(m j , p j ) and y j ∼ B(m ability mass function of z j = x j + y j is expressed as m j+

 ¯ −z +x z −x m ¯j m j x m j −x m p j p¯ j q j j q¯ j j j g j (z j | m j , n j ) = x zj − x x=m j−

for z j = 0, 1, . . . , n j and j = 1, . . . , J, where B(m, p) denotes the binomial distribution with number of trials m and success probability p. Under H0 , this is simplified to n −z j

z

g0 j (z j | m j , n j ) = q j j q¯ j j

m j+



 ¯j mj m . x zj − x x=m j−

r

 Note that 00 p0 (1 − p)0 = 1 for p ∈ (0, 1). Conditional distribution of m j given n j : At the moment, we assume that, given a total sample size n j of stratum j, the sample size of group 1 m j is a binomial random variable with probability mass function

n j mj b (1 − b j )n j −m j mj j

h j (m j | n j ) =

r

for 0 ≤ m j ≤ n j and j = 1, . . . , J. Conditional distribution of (n1 , . . . , nJ ) given N is multinomial with probability mass function N! lN (n1 , . . . , nJ ) = J j=1

J

nj!

n

ajj

j=1

 for 0 ≤ n1 ≤ N, . . . , 0 ≤ nJ ≤ N and Jj=1 n j = N. _____________________________________________________________________________________  C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

134

S.-H. Jung: Stratified Fisher’s exact test

We first derive the power function for a given sample size N using these distribution functions. Given (z, m, n) and type I error rate α ∗ , the critical value cα∗ = cα∗ (z, m, n) is the smallest integer c satisfying m

P(S ≥ c | z, m, n, H0 ) =

⎛ ⎞ J J 

I⎝ i j ≥ c⎠ f0 (i j | z j , m j , n j ) ≤ α ∗ .

m

1+ 

J+ 

···

i1 =m1−

iJ =mJ−

j=1

j=1

Note that s ≥ cα∗ (z, m, n) if and only if pv(s | z, m, n) ≤ α ∗ . We call α(z, m, n) = P(S ≥ cα∗ | z, m, n, H0 ) the conditional type I error rate given (z, m, n). Similarly, the conditional power 1 − β(z, m, n) given (z, m, n) is obtained by m

P(S ≥ cα∗ | z, m, n, H1 ) =

m

1+ 

J+ 

···

i1 =m1−

iJ =mJ−

⎛ ⎞ J J 

⎝ ⎠ I i j ≥ cα ∗ f j (i j | z j , m j ). j=1

j=1

For a chosen N, the marginal type I error rate and power are given as αN ≡ E{α(z, m, n) | H0 } = E n (E m [E z {α(z, m, n) | m, n, H0 } | n]) =

n1  

···

n∈D N m1 =0

×

⎧ J ⎨



nJ n1  

···

mJ =0 z1 =0

g0 j (z j | m j , n j )

j=1

nJ 

α(z1 , . . . , zJ ; m1 , . . . , mJ ; n1 , . . . ., nJ )

zJ =0

⎫⎧ J ⎬ ⎨

⎭⎩

j=1

⎫ ⎬

h j (m j | n j ) lN (n1 , . . . , nJ ) ⎭

(1)

and 1 − βN ≡ E{1 − β(z, m, n) | H1 } = E n (E m [E z {1 − β(z, m, n) | m, n, H1 } | n]) =

n1   n∈DN m1 =0

×

⎧ J ⎨



···

nJ n1  

···

mJ =0 z1 =0

g j (z j | m j , n j )

j=1

nJ 

{1 − β(z1 , . . . , zJ ; m1 , . . . , mJ ; n1 , . . . ., nJ )}

zJ =0

⎫⎧ J ⎬ ⎨

⎭⎩

j=1

⎫ ⎬

h j (m j | n j ) lN (n1 , . . . , nJ ), ⎭

(2)

 respectively, where DN = {(n1 , . . . , nJ ) : 0 ≤ n1 ≤ N, . . . , 0 ≤ nJ ≤ N, Jj=1 n j = N} and E w (·) denotes the expected value with respect to a random vector w. Since α(z, m, n) ≤ α ∗ for all (z, m, n), we have αN ≤ α ∗ . Given power 1 − β ∗ , the required sample size is chosen by the smallest integer N satisfying 1 − βN ≥ 1 − β ∗ . In other words, while the statistical testing controls the conditional type I error α(z, m, n), the sample size is determined to guarantee a specified level of marginal power. In summary, a sample size is calculated as follows.  C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Biometrical Journal 56 (2014) 1

135

_____________________________________________________________________________________ Sample size calculation (A) Specify input parameters: J, (α ∗ , β ∗ ), (q1 , . . . , qJ ), (θ1 , . . . , θJ ), (a1 , . . . , aJ ), (b1 , . . . , bJ ). (B) Starting from the sample size for the Mantel–Haenszel test NMH , do the following by increasing N by 1,  (B1) For j = 1, . . . , J, z j ∈ [0, n j ], m j ∈ [0, n j ], n j ∈ [0, N], and Jj=1 n j = N, 1. Find cα∗ = cα∗ (z, m, n). 2. Calculate 1 − β(z, m, n) = P(S ≥ cα∗ | z, m, n, H1 ) (B2) Calculate 1 − βN = E{1 − β(z, m, n) | H1 }. (C) Stop (B) if 1 − βN ≥ 1 − β ∗ . This N is the required sample size. _____________________________________________________________________________________ 3.2

When stratum allocation is fixed

In a case-control study or a clinical trial, one may want to assign a fixed number of subjects to stratum j given N regardless of its prevalence. In this case, when calculating αN and 1 − βN , n j are fixed at Na j and the step to calculate the expectations with respect to n are omitted. That is, given N, we set  n j = [Na j ], . . . , nJ−1 = [NaJ−1 ], nJ = N − J−1 j=1 n j , and calculate (1) and (2) by αN =

n1 

···

m1 =0

× and 1 − βN =



g0 j (z j | m j , n j )

nJ n1  

···

···

mJ =0 z1 =0

⎧ J ⎨

α(z1 , . . . , zJ ; m1 , . . . , mJ ; n1 , . . . ., nJ )

zJ =0

j=1

m1 =0



nJ 

···

mJ =0 z1 =0

⎧ J ⎨

n1 

×

nJ n1  

nJ 

⎫⎧ J ⎬ ⎨

⎭⎩

j=1

g j (z j | m j , n j )



{1 − β(z1 , . . . , zJ ; m1 , . . . , mJ ; n1 , . . . ., nJ )}

zJ =0

j=1

h j (m j | n j )

⎫ ⎬

⎫⎧ J ⎬ ⎨

⎭⎩

j=1

⎫ ⎬

h j (m j | n j ) , ⎭

where [a] is the round-off of a. 3.3

When group allocation is fixed

In a case-control study or a clinical trial, one may accrue all eligible patients from each stratum while randomizing a fixed proportion of subjects to stratum j given n j . In this case, in calculating αN and 1 − βN , we have m j = [n j b j ] given n j , and calculate (1) and (2) by αN =

n1   n∈DN z1 =0

×

⎧ J ⎨



j=1

···

nJ  zJ =0

α(z1 , . . . , zJ ; m1 , . . . , mJ ; n1 , . . . ., nJ ) ⎫ ⎬

g0 j (z j | m j , n j ) lN (n1 , . . . , nJ ) ⎭

 C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

136

S.-H. Jung: Stratified Fisher’s exact test

and n1  

1 − βN =

n∈DN z1 =0

×

3.4

⎧ J ⎨



j=1

···

nJ  zJ =0

{1 − β(z1 , . . . , zJ ; m1 , . . . , mJ ; n1 , . . . ., nJ )} ⎫ ⎬

g j (z j | m j , n j ) lN (n1 , . . . , nJ ). ⎭

When both group and stratum allocations are fixed

In a more simplified study design, we may want to further fix the number of patients from each stratum (n) given N and the number of patients in each group (m) within each stratum given n. In this case, given N, n j , and m j are fixed at [Na j ] and [Na j b j ], respectively, and the calculation of (1) and (2) is further simplified to αN =

n1 

···

z1 =0

nJ 

α(z1 , . . . , zJ ; m1 , . . . , mJ ; n1 , . . . ., nJ )

zJ =0

J

g0 j (z j | m j , n j )

j=1

and 1 − βN =

n1  z1 =0

···

nJ 

{1 − β(z1 , . . . , zJ ; m1 , . . . , mJ ; n1 , . . . ., nJ )}

zJ =0

J

g j (z j | m j , n j ).

j=1

4 Numerical studies We want to compare the small sample performance of stratified Fisher’s test and Mantel–Haenszel test using simulations. We generate B = 10,000 simulation samples of size N = 25, 50 or 75 with J = 2 strata under a1 = 0.25, 0.5 or 0.75; (b1 , b2 ) = (1/4, 3/4), (1/2, 1/2) or (3/4, 1/4); q1 = 0.1, q2 = 0.3 or 0.7; (θ1 , θ2 ) = (1, 1), (5, 10), (7.5, 7.5) or (10, 5). Stratified Fisher’s test, the standard (unstratified) Fisher’s test and Mantel–Haenszel test are applied to each simulation sample, and empirical power for each test is calculated as the proportion of simulation samples rejecting H0 with one-sided α ∗ = 0.05. The exact type I error rate and power for stratified Fisher’s test can be calculated by using the methods in Section 3, but through simulations we want to compare the performance of the these testing methods applied to the same datasets. We consider large odds ratios to investigate the performance of Fisher’s tests and Mantel–Haenszel test with small sample sizes. Table 3 summarizes the simulation results. Under H0 : θ1 = θ2 = 1, stratified Fisher’s test is conservative overall. With 10,000 simulations and α ∗ = 0.05, the 95% confidence limits for the empirical type I error rate are 0.05 ± 0.004. Due to the discreteness of the exact tests and the conservative control of conditional type I error at all possible outcomes, stratified Fisher’s test is always conservative as expected, especially with a small sample size (N = 25). Unstratified test has a similar type I error rate to stratified Fisher’s test when allocation proportions are identical between two strata (i.e. b1 = b2 = 1/2). However, if more patients are allocated in the stratum with higher response probabilities (i.e. b1 = 1/4

 C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Biometrical Journal 56 (2014) 1

137

Table 3 Empirical power of stratified Fisher’s test/unstratified Fisher’s test/Mantel–Haenszel test with one-sided α ∗ = 0.05, J = 2 strata, and q1 = 0.1. a1

(b1 , b2 )

(a) N = 25 .25 (1/4, 3/4) (1/2, 1/2) (3/4, 1/4) .5

(1/4, 3/4) (1/2, 1/2) (3/4, 1/4)

.75

(1/4, 3/4) (1/2, 1/2) (3/4, 1/4)

(b) N = 50 .25 (1/4, 3/4) (1/2, 1/2) (3/4, 1/4) .5

(1/4, 3/4) (1/2, 1/2) (3/4, 1/4)

.75

(1/4, 3/4) (1/2, 1/2) (3/4, 1/4)

(c) N = 75 .25 (1/4, 3/4) (1/2, 1/2) (3/4, 1/4)

q2

(θ1 , θ2 ) = (1, 1)

(5,10)

(7.5,7.5)

(10,5)

.3 .7 .3 .7 .3 .7 .3 .7 .3 .7 .3 .7 .3 .7 .3 .7 .3 .7

.009/.035/.037 .015/.179/.057 .018/.018/.052 .016/.019/.049 .017/.005/.060 .006/.001/.031 .010/.056/.048 .016/.299/.067 .012/.012/.046 .016/.021/.051 .013/.003/.054 .005/.000/.025 .010/.038/.056 .012/.212/.067 .010/.011/.043 .010/.017/.048 .003/.001/.032 .002/.000/.016

.481/.786/.681 .246/.716/.506 .616/.588/.760 .261/.240/.500 .445/.282/.645 .085/.021/.252 .375/.798/.590 .229/.814/.474 .485/.461/.665 .256/.218/.480 .326/.174/.539 .095/.015/.268 .283/.651/.504 .224/.682/.446 .357/.368/.548 .251/.227/.453 .190/.132/.392 .100/.026/.270

.430/.723/.633 .256/.714/.501 .576/.576/.718 .291/.269/.520 .406/.323/.605 .107/.035/.293 .409/.785/.622 .285/.836/.537 .537/.525/.696 .344/.291/.560 .363/.270/.572 .155/.031/.367 .381/.688/.596 .332/.750/.562 .474/.481/.654 .398/.351/.595 .290/.247/.501 .198/.059/.410

.347/.599/.539 .240/.678/.464 .467/.485/.626 .280/.267/.485 .335/.322/.522 .119/.044/.302 .386/.707/.592 .316/.832/.556 .505/.513/.669 .398/.345/.604 .347/.327/.558 .202/.051/.429 .438/.683/.648 .405/.797/.636 .555/.557/.714 .501/.446/.686 .348/.344/.565 .294/.108/.520

.3 .7 .3 .7 .3 .7 .3 .7 .3 .7 .3 .7 .3 .7 .3 .7 .3 .7

.019/.081/.042 .023/.387/.053 .029/.027/.054 .022/.028/.046 .025/.005/.055 .017/.000/.047 .019/.135/.052 .023/.600/.059 .021/.021/.050 .024/.029/.053 .020/.001/.051 .014/.000/.041 .018/.102/.055 .020/.464/.062 .021/.021/.053 .021/.025/.052 .014/.002/.046 .008/.000/.032

.827/.984/.907 .561/.970/.742 .934/.917/.966 .661/.504/.793 .827/.575/.905 .436/.033/.624 .734/.986/.846 .529/.987/.710 .855/.825/.917 .625/.470/.761 .719/.374/.828 .425/.016/.597 .602/.939/.749 .496/.953/.670 .730/.703/.828 .603/.493/.738 .577/.324/.712 .414/.034/.574

.792/.971/.882 .570/.970/.743 .906/.896/.946 .686/.561/.806 .789/.634/.873 .465/.061/.643 .767/.985/.865 .611/.991/.773 .882/.867/.934 .736/.596/.840 .754/.554/.848 .549/.044/.702 .741/.958/.849 .663/.977/.804 .852/.839/.915 .789/.683/.876 .702/.541/.811 .610/.113/.742

.685/.916/.795 .532/.956/.712 .834/.835/.898 .643/.558/.766 .679/.617/.790 .457/.087/.618 .742/.967/.837 .666/.993/.805 .861/.856/.917 .779/.668/.866 .733/.658/.838 .607/.085/.739 .788/.958/.882 .763/.986/.871 .898/.894/.943 .876/.796/.933 .760/.687/.857 .718/.206/.828

.3 .7 .3 .7 .3 .7

.025/.127/.047 .028/.558/.055 .026/.028/.047 .029/.032/.052 .025/.003/.056 .023/.000/.047

.955/.999/.978 .757/.997/.865 .992/.987/.996 .866/.701/.922 .955/.768/.976 .710/.039/.814

.934/.998/.964 .769/.997/.871 .984/.981/.992 .873/.750/.924 .934/.825/.965 .731/.086/.823

.876/.989/.925 .729/.995/.837 .956/.954/.973 .844/.754/.900 .870/.816/.920 .701/.124/.800

 C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

138

S.-H. Jung: Stratified Fisher’s exact test

Table 3 Continued a1

(b1 , b2 )

q2

(θ1 , θ2 ) = (1, 1)

(5,10)

(7.5,7.5)

(10,5)

.5

(1/4, 3/4)

.3 .7 .3 .7 .3 .7 .3 .7 .3 .7 .3 .7

.023/.197/.047 .027/.787/.055 .024/.026/.050 .028/.031/.054 .023/.001/.049 .021/.000/.048 .024/.152/.054 .024/.653/.057 .024/.024/.049 .025/.032/.053 .019/.002/.047 .017/.000/.041

.904/.999/.947 .725/.999/.838 .966/.951/.982 .839/.659/.904 .898/.545/.941 .665/.013/.774 .806/.992/.884 .698/.995/.814 .909/.891/.949 .807/.666/.878 .797/.482/.867 .644/.040/.758

.921/.999/.959 .807/.999/.895 .977/.969/.987 .910/.779/.950 .915/.741/.952 .779/.047/.862 .902/.996/.947 .853/.998/.921 .967/.959/.984 .935/.852/.963 .892/.736/.932 .830/.158/.900

.906/.998/.949 .840/.999/.915 .969/.966/.981 .929/.840/.961 .903/.830/.941 .824/.112/.889 .928/.997/.963 .917/.999/.958 .983/.979/.991 .975/.931/.987 .926/.870/.958 .904/.303/.948

(1/2, 1/2) (3/4, 1/4) .75

(1/4, 3/4) (1/2, 1/2) (3/4, 1/4)

and b2 = 3/4), then unstratified Fisher’s test becomes anticonservative. On the other hand, if more patients are allocated to the stratum with smaller response probabilities (i.e. b1 = 3/4 and b2 = 1/4), then unstratified Fisher’s test becomes very conservative. In this sense, a testing ignoring the strata cannot control the type I error rate unless the allocation proportions are identical across strata. With N = 25 or 50, Mantel–Haenszel test is anti-conservative with q2 = 0.7 (i.e. when two strata have very different response rates) or with a1 = 0.75 (i.e. when a small number of subjects are allocated to the stratum with large response probabilities). The anti-conservativeness diminishes as N increases, but is still of some issue with a1 = 0.75 and N = 75. When allocation proportions are equal across the strata, ignoring the strata results in a slight loss of statistical power. Stratified Fisher’s test is less powerful than Mantel–Haenszel test, but the difference in power decreases in N. For all three testing methods, the power increases when more subjects are allocated to the stratum with the larger odds ratio, for example θ1 < θ2 and a1 < a2 . A referee suggested to conduct some additional simulations under (θ1 , θ2 ) = (1, 10). We have conducted simulations under N = 50; a1 = 0.25, 0.5 or 0.75; b1 = b2 = 1/2; (q1 , q2 ) = (0.1, 0.7); (θ1 , θ2 ) = (1, 10). The power of the stratified Fisher’s test was observed to be 0.46, 0.23 and 0.07 with a1 = 0.25, 0.5 and 0.75, respectively. Note that, even with θ1 = 1, the stratified test has some power although it decreases as a1 increases. Note that a1 = 1 corresponds to a setting of the null hypothesis. Table 4 reports sample sizes for Mantel–Haenszel test and stratified Fisher’s test. Also reported are sample sizes for stratified Fisher’s test by fixing (m, n) or only n at their expected values. The design parameters are set at one-sided α ∗ = 0.05; 1 − β ∗ = 0.9; J = 2 strata; a1 = 0.25, 0.5, or 0.75; (b1 , b2 ) = (0.25, 0.25), (0.25,0.75), (0.5,0.5), (0.75,0.25), or (0.75,0.75); (q1 , q2 ) = (0.1, 0.3); (θ1 , θ2 ) = (5, 10), (7.5, 7.5), or (10, 5). For stratified Fisher’s test, fixing (m, n) at their expected values reduces N, while fixing only n requires almost the same N compared to the case with random (m, n). The sample sizes are minimized with a balanced allocation, that is b1 = b2 = 1/2. We also observe that the cases of (b1 , b2 ), (1 − b1 , b2 ), (b1 , 1 − b2 ), and (1 − b1 , 1 − b2 ) require similar sample sizes. That is, when the allocation between two groups is unbalanced, the required sample size does not much depend on whether the larger group is control or experimental across the different strata. Under each setting, the sample size for stratified Fisher’s test is about 30% larger than that of Mantel–Haenszel test. This difference results from the conservative type I error and power control of stratified Fisher’s test. For example, from Table 3, with (a1 , b1 , b2 , q1 , q2 ) = (0.5, 0.25, 0.75, 0.1, 0.3), stratified Fisher’s test controls the type I error at 0.0230 with N = 75 and has a power of 0.9042

 C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Biometrical Journal 56 (2014) 1

139

Table 4 Sample size for Mantel–Haenszel test/stratified Fisher’s test with (m, n) fixed/stratified Fisher’s test with n fixed/stratified Fisher’s test under J = 2 strata, (q1 , q2 ) = (0.1, 0.3), one-sided α ∗ = 0.05, and 1 − β ∗ = 0.9. (θ1 , θ2 ) a1 0.25

0.5

0.75

(b1 , b2 )

(5,10)

(7.5,7.5)

(10,5)

(0.25,0.25) (0.25,0.75) (0.5,0.5) (0.75,0.25) (0.75,0.75) (0.25,0.25) (0.25,0.75) (0.5,0.5) (0.75,0.25) (0.75,0.75) (0.25,0.25) (0.25,0.75) (0.5,0.5) (0.75,0.25) (0.75,0.75)

46/59/61/61 45/53/60/61 36/43/45/45 46/59/61/61 46/53/60/61 58/72/75/76 58/65/75/75 45/53/56/54 59/72/75/76 59/69/75/76 78/96/97/98 77/89/97/97 61/70/73/73 80/96/98/99 80/85/98/99

51/64/66/66 50/62/66/66 39/48/49/49 51/64/66/67 51/60/66/66 55/72/72/71 54/65/70/71 43/50/53/53 56/66/71/71 55/63/71/71 59/75/76/76 59/69/76/76 47/55/57/57 61/75/77/77 61/69/77/77

65/80/82/82 65/79/82/82 50/59/62/62 65/80/83/83 65/76/83/83 58/72/75/75 58/72/75/75 45/54/56/56 59/72/76/78 59/68/76/76 52/64/67/67 52/64/67/67 41/49/51/51 53/65/69/69 53/62/69/69

at (θ1 , θ2 ) = (5, 10). Under this design setting, stratified Fisher’s test requires a sample size N = 75 with (α ∗ , 1 − β ∗ ) = (0.05, 0.9) from Table 4. For Mantel–Haenszel test, the required sample size with (α ∗ , 1 − β ∗ ) = (0.0230, 0.9042) under the same design setting is N = 73 which is close to N = 75 required for stratified Fisher’s test. In other words, the conservativeness of the Fisher test results from the discreteness of the exact testing distributions. Mantel–Haenszel test approximates this exact distribution when the sample size is large. Crans and Schuster (2008) propose to conduct Fisher’s test with a larger type I error α ∗ = α +  ( > 0) so that the maximal marginal type I error rate within the whole range [0, 1] of the response probability under H0 becomes close to the intended α level. Suppose that we want to design a study similar to that of Li et al. (1979). Since this is a balanced randomized study, we fix (n1 , n2 , n3 ) at (N/3, N/3, N/3) and (m1 , m2 , m3 ) at (N/6, N/6, N/6). We further assume that (q1 , q2 , q3 ) = (0.9, 0.75, 0.6), and (θ1 , θ2 , θ3 ) = (1, 30, 30). (The estimates from Table 2 are θˆ1 = 0.833 and θˆ2 = θˆ3 = ∞.) In order to control the one-sided conditional type I error at α ∗ = 0.1 and the marginal power at 1 − β ∗ = 0.8, we need N = 62. Under the design, this sample size provides marginal αN = 0.0565 and power 1 − βN = 0.8105.

5 Discussions Numerous testing methods have been proposed to test on two binomial proportions adjusting for stratum effect based on different assumptions. For example, the Cochran (1954) test assumes common odds ratios across strata and Gart (1985) assumes common relative risks. The Mantel–Haenszel test makes no assumption on the parameters. These methods are based on large sample theories, so that their testing results may be distorted with a small sample size or sparse data. In this paper, we propose to use an exact test extending Fisher’s test to the analysis of many 2 × 2 tables together with its sample size calculation method. This test does not make any assumption of large sample size or equal parameter values across strata, so that it does not require to check any  C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

140

S.-H. Jung: Stratified Fisher’s exact test

assumption before conducting a testing. The power and sample sizes are compared between the exact test and Mantel–Haenszel test using simulations and the proposed sample size formulas. Exact tests are to calculate p-values exactly and to maintain the type I error rate no higher than a prespecified level, rather than to control it exactly at the specified level. Due to the discreteness of exact tests, their type I error control may be conservative. Since the stratified Fisher’s exact test is a conditional test and maintains the conditional type I error rate for all values of the conditioning variable, that is the total number of responders, below the specified level, the marginal type I error rate will be conservative. In contrast, asymptotic tests, like Mantel–Haenszel test can be anti-conservative as well as conservative with a small sample size or sparse data. When the effect size is so large that the required sample size is small (say, about N = 70 or smaller), the exact test needs about 20% to 30% larger sample size than Mantel–Haenszel test. However, due to the small sample sizes, the increase in sample size in this case is not very large in absolute number (say, 10–20), so that, for robustness of the testing results, we propose to use the exact test by slightly increasing the sample size rather than obtaining a biased result by an asymptotic test. If J ≥ 3, the sample size calculation for stratified Fisher’s test requires a long computing time. We found, in calculating the marginal type I error rate and power, that conditioning the sizes of strata (n1 , . . . , nJ ) on their expected numbers provides very accurate sample sizes for the stratified Fisher’s test even when (n1 , . . . , nJ ) are random, while drastically saving the computing time. The computing time of Table 4 with J = 2 is very short. With J = 3, it takes about 5 s to calculate the sample size of Section 5 using a laptop computer with an Intel core at 2.8 GHz. If we fix n j or n j given N in this example, it takes about 30 min for a sample size calculation. Conflict of Interest The authors have declared no conflict of interest.

References Breslow, N. E. and Day, N. E. (1980). The Analysis of Case-Control Studies. No. 32, IARC Scientific Publications, Lyon, France. Cochran, W. C. (1954). Some methods of strengthening the common χ 2 tests. Biometrics 10, 417–451. Crans, G. G. and Schuster, J. J. (2008). How conservative is Fisher’s exact test? A quantitave evaluation of the two-sample comparative binomiial trial. Statistics in Medicine 27, 3598–3611. Fisher, R. A. (1935). The logic of inductive inference (with discussion). Journal of Royal Statistical Society 98, 39–82. Gart, J. J. (1985). Approximate tests and interval estimation of the common relative risk in the combination of 2 × 2 tables. Biometrika 72, 673–677. Jung, S. H., Chow, S. C. and Chi, E. M. (2007). A note on sample size calculation based on propensity analysis in nonrandomized trials. Journal of Biopharmaceutical Statistics 17, 35–41. Li, S. H., Simon, R. M. and Gart, J. J. (1979). Small sample properties of the Mantel–Haenszel test. Biometrika 66, 181–183. Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute 22, 719–748. Nam, J. M. (1992). Sample size determination for case-control studies and the comparison of stratified and unstratified analyses. Biometrics 48, 389–395. Nam, J. M. (1998). Power and sample size for stratified prospective studies using the score method for testing relative risk. Biometrics 54, 331–336. Westfall, P. H., Zaykin, D. V. and Young, S. S. (2002). Multiple tests for genetic effects in association studies. In: Stephen Looney (Ed.), Methods in Molecular Biology, vol. 184, Biostatistical Methods, Humana Press, Toloway, NJ, pp. 143–168. Woolson, R. F., Bean, J. A. and Rojas, P. B. (1986). Sample size for case-control studies using Cochran’s statistic. Biometrics 42, 927–932. Zelen, M. (1971). The analyses of several 2x2 contingency tables. Biometrika 58, 129–137.

 C 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Stratified Fisher's exact test and its sample size calculation.

Chi-squared test has been a popular approach to the analysis of a 2 × 2 table when the sample sizes for the four cells are large. When the large sampl...
137KB Sizes 0 Downloads 0 Views