MAIN PAPER (wileyonlinelibrary.com) DOI: 10.1002/pst.1632

Published online 22 July 2014 in Wiley Online Library

Detecting qualitative interactions in clinical trials with binary responses Andreas Kitsche* This study considers the detection of treatment-by-subset interactions in a stratified, randomised clinical trial with a binary-response variable. The focus lies on the detection of qualitative interactions. In addition, the presented method is useful more generally, as it can assess the inconsistency of the treatment effects among strata by using an a priori-defined inconsistency margin. The methodology presented is based on the construction of ratios of treatment effects. In addition to multiplicity-adjusted p-values, simultaneous confidence intervals are recommended to use in detecting the source and the amount of a potential qualitative interaction. The proposed method is demonstrated on a multi-regional trial using the open-source statistical software R. Copyright © 2014 John Wiley & Sons, Ltd. Keywords: qualitative interaction; heterogeneity; inconsistency; simultaneous conf idence intervals

1. INTRODUCTION

Pharmaceut. Statist. 2014, 13 309–315

Institut für Biostatistik, Leibniz Universität Hannover, Herrenhäuser Straße 2, Hannover, Germany *Correspondence to: Andreas Kitsche, Institut für Biostatistik, Leibniz Universität Hannover, Herrenhäuser Straße 2, 30419 Hannover, Germany. E-mail: [email protected]

Copyright © 2014 John Wiley & Sons, Ltd.

309

In biomedical research, treatment-by-subset interaction—the evaluation of the difference in the treatment effects among subsets of patients—plays an important role in the analysis of randomised clinical trials. Peto [1] distinguished two types of treatment-by-subset interactions for the assessment of the heterogeneity of patient populations: quantitative interactions, which are present when the treatment effect varies among subsets in its magnitude but not in its sign, and qualitative interactions, which occur when the treatment effect differs in its magnitude and in its sign among subsets of patients. As stated by Peto [1], quantitative interactions are to be expected and are not important in a clinical context. In contrast, qualitative interactions are very unlikely but more relevant in a clinical context. Therefore, within this paper, the focus of interest is in detecting a directional change of the treatment effect, rather than detecting a difference in the size of the effect across subsets of patients. In clinical studies, a number of stratification factors exist. These factors lead to different subsets of patients: (i) multi-regional trials, where the subjects are incorporated from many countries or regions [2,3]; (ii) multi-centre trials, in which the strata refer to medical centres or clinics [4,5]; (iii) subgroup analysis, where the subsets of patients are defined by prognostic factors (e.g. age or disease severity) [6]; (iv) meta analysis, in which the strata are different studies of the same sort [7]; (v) biomarker studies, where treatment-effect-modifying biomarkers define subgroups of patients [8]; and (vi) genome-wide association studies with focus on gene–gene and gene–environment interactions, in which the subsets are defined by the genotype of a single nucleotide polymorphism and a potential environmental factor such as smoking status [9]. Later in this paper, the term treatment-by-subset interaction is used to define the heterogeneity of the treatment effect among subsets of patients, where the subsets are defined depending on the context under consideration. In pharmaceutical and biomedical research, the outcome of interest is often an ‘event’ with the data taking a binary form

commonly denoted as success or failure. Agresti and Hartzel [4] published a tutorial article where they discussed strategies for analysing binary-response data when observations occur for several strata. Unfortunately, no recommendation is given for the analysis of qualitative interactions in their paper. Nevertheless, a number of tests for qualitative interactions are applicable for binary-response data; see, for example, Azzalini and Cox [10], Gail and Simon [11] and Piantadosi and Gail [12], among others. All of these tests provide only global inference for the detection of qualitative interactions, meaning that no statement on the source of the qualitative interaction is possible. A further disadvantage of these qualitative interaction tests is that the resulting p-values do not describe the extent of the heterogeneity of the treatment effect among the subgroups. In this study, a method to detect qualitative treatmentby-subset interactions by applying multiple contrast tests for the ratio of risk differences is proposed. Using product-type interaction contrasts as presented by Gabriel et al. [13] allows the detection of the source of the inconsistency of the treatment effects among the subsets. In addition to the multiplicity-adjusted p-values, simultaneous confidence intervals for the ratios of risk differences are provided to quantify the magnitude of heterogeneity and to assess its clinical relevance. This paper is organised as follows. In Sections 2 and 3, the stochastic model and the methodology to detect qualitative treatment-by-subset interactions are introduced. In Section 4, simulation studies will be conducted to investigate the adequacy of the asymptotic approximation of the proposed confidence intervals and the power of the suggested method. The presented approach is applied to a multi-regional clinical trial in Section 5. Finally, some concluding remarks are given in Section 6.

A. Kitsche

Table I. Contingency tables for the supposed study design including one treatment factor (i D 1, : : : , I) and one pre-specified stratification factor ( j D 1, : : : , J). Outcome Treatment 1 2 .. . I Total

Outcome

Success

Failure

Total

y11 y21 .. . yI1 yC1

n11  y11 n21  y21 .. . nI1  yI1 nC1  yC1

n11 n21 .. . nI1 nC1

:::

2. THE MODEL A completely randomised design including one treatment factor and one pre-specified stratification factor is supposed. Let I be the number of groups of the first factor (with index i D 1, : : : , I), that is, representing the treatment groups. Furthermore, let J be the number of groups of the second factor (with index j D 1, : : : , J), that is, representing levels of regions, centres, environments and so on. The number of observations for each factor combination is permitted to vary and is denoted as nij . The total number of obserP P vations in the study is given by N D IiD1 JjD1 nij . The primary endpoint is a binary outcome represented by 1 and 0, with the generic labels success and failure. The total number of successes in each factor combination is given by yij . Representative contingency tables for this study design are given in Table I, where the plus notation means the sum over the corresponding factor. Further on, it is assumed that yij follows a binomial distribution with parameters nij and ij , denoted by bin.nij , ij /. The parameter ij corresponds to the success probability of the ith treatment in the jth subset: ij D P.Y D 1jI D i, J D j/.

(1)

The maximum likelihood estimator for the sample proportions O ij / D is p given by O ij D yij =nij and its standard error by . ij .1  ij /=nij [14]. The vector of success probabilities for each factor combination is defined by  D .11 , 21 , : : : , 1J , 2J /, where the elements of  are primarily ordered according to the stratification factor and within the levels of the stratification factor according to the treatment factor. For the sake of the presented method, the vector  is given the new index l yielding to  D .1 , : : : , L /, where L D I  J the number of treatment-by-subset combinations. For illustrative purposes and without loss of generality, a study with one control and one treatment group (I D 2) is assumed. Furthermore, a treatment effect is defined as the difference of success probabilities between the two treatment groups, also known as risk difference, ˇj D 1j  2j , with its standard r 1j .11j /  .1 / C 2j n2j 2j [14]. error .ˇ O j/ D n1j

3. QUALITATIVE INTERACTIONS 3.1. Review of existing methods

310

Several tests have been published to detect qualitative interactions. In their seminal paper, Gail and Simon [11] proposed a likelihood-ratio test to check for qualitative interactions. A much less well-known approach in biomedical research was previously presented by Azzalini and Cox [10]. Piantadosi and Gail

Copyright © 2014 John Wiley & Sons, Ltd.

Treatment

Success

Failure

Total

1 2 .. . I Total

y1J y2J .. . yIJ yCJ

n1J  y1J n2J  y2J .. . nIJ  yIJ nCJ  yCJ

n1J n2J .. . nIJ nCJ

[12] developed a standardised range test as an extension to the likelihood-ratio test of Gail and Simon. Some extensions of these methods were published later by Ciminera et al. [15], Pan and Wolfe [16] and Li and Chan [17]. Among these methods, only the Azzalini and Cox test is applicable in situations with more than two treatment groups. Furthermore, all of these methods provide only a global decision about the presence or absence of qualitative interactions among subgroups. Recently, Kitsche and Hothorn [18] proposed a method to detect qualitative interactions for normally distributed outcome variables. They suggested using the ratio of treatment effects to test the null hypothesis of no qualitative interaction. This method is able to detect the source of a significant qualitative interaction and to give a statement on the biomedical relevance by providing simultaneous confidence intervals. 3.2. Proposed method The product-type interaction contrasts developed by Gabriel et al. [13] are applicable to detect interactions between the treatment factor and the stratification factor. In the simplest case, product-type interaction contrasts result in pairwise differences of risk differences: .1j  2j /  .1j0  2j0 / D ˇj  ˇj0 , with j ¤ j0 . Nevertheless, this approach does not distinguish between quantitative and qualitative interactions. In contrast, building the ratio of treatment effects allows the detection of qualitative interactions: if the treatment effects differ in their sign (qualitative interaction), the resulting ratio of treatment effects is negative, whereas the ratio of treatment effects is positive in cases where the treatment effects are equal in its sign (quantitative interaction). (For a detailed description, see Kitsche and Hothorn [18].) The ratio of risk differences can also be formulated by the ratio of linear combinations of binomial proportions: PL hlm  i hm  0 D , m D PlD1 L dm  0 lD1 dlm  i

m D 1, : : : , M,

(2)

where hm and dm represent, respectively, the mth numerator and denominator contrasts, which define linear combinations of proportions. The elements of the numerator and denominator contrasts of m consist of the user-defined contrast coefficients hlm and dlm . (For details on the formulation of contrasts, see, e.g. Kirk [19].) To explain the source of the qualitative interaction, a total number of M ratios of treatment effects is considered. The vectors of the M linear combinations are stored in the contrast matrices C Numerator D .h1 , : : : , hM / and C Denominator D .d1 , : : : , dM / of size M  L, where m D 1, : : : , M is the number of contrasts of interest and l D 1, : : : , L is the number of factor combinations. The

Pharmaceut. Statist. 2014, 13 309–315

A. Kitsche choice of the contrast matrices depends on the research question under consideration. For example, in a multi-centre clinical trial, the comparison of each centre’s treatment effect to the overall treatment effect is appropriate (see Appendix A, available online as Supporting Information). The global null hypothesis of no qualitative interaction and the corresponding alternative can be formulated as follows: H0 : .1 >  \ 2 >  \ : : : \ M >  /,

(3)

HA : .1 <  [ 2 <  [ : : : [ M <  /.

(4)

and

To test for the presence of qualitative interaction,  equals 0 resulting in H0 : .1 > 0 \ 2 > 0 \ : : : \ M > 0/. If two treatment effects differ in its sign, the ratio of these effects m is less than zero and therefore indicate a qualitative interaction (Kitsche and Hothorn [18]). 3.2.1. Test for qualitative interaction. The ratios of linear combinations of binomial proportions in Equation (2) can be reformulated as the linear form Lm D .hm   dm /0 O [20]. Using this linear form and dividing it by the corresponding standard error leads to the test statistic: Tm D q

.hm   dm /0 O

,

(5)

O .hm   dm /0 VM.h m   dm /

where M is the diagonal matrix containing the reciprocals of the sample sizes nij and VO is a diagonal matrix containing the estimated group variances. The test PJ Tm approximately PIstatistic follows a t-distribution with  D iD1 jD1 .nij  1/ degrees of freedom. Under H0 , the random vector of test statistics T D .T1 , : : : , TM / jointly follows a central, multivariate t-distribution with  degrees of freedom and a correlation matrix RO D ŒOmm0 . Each element of RO can be described by O .hm  dm /0 VM.h m0  dm0 /

, Omm0 D q q 0O O .hm  dm/0 VM.h m  dm/ .hm0  dm0/ VM.h m0  dm0/ (6)

Pharmaceut. Statist. 2014, 13 309–315

3.2.2. Simultaneous confidence intervals. In contrast to adjusted p-values, simultaneous confidence intervals for the parameters m would provide information on the amount of qualitative interactions and could therefore be used to assess their clinical relevance. The approach published by Dilba et al. [21] to construct simultaneous confidence intervals for multiple ratios is adopted here. The corresponding test statistics are now defined as .hm  m dm /0 O . Tm .m / D q O .hm  m dm /0 VM.h m  m dm /

(7)

The vector of test statistics Tm .m / follows an m-variate t-distribution with  degrees of freedom and a correlation matrix O 1 , : : : , M / are similar to those O 1 , : : : , M /. The correlations in R. R. defined in Equation (6), except that  is replaced by m and m0 . The correlations between two contrasts depend on the user-defined contrasts hm and dm , the sample sizes nij , the estimated group variances and the unknown ratios . This paper considers a plug-in approach proposed by Dilba et al. [21], where O 1 , : : : , M / are replaced by its maxithe unknown ratios in R. mum likelihood estimators. The smallest solution of the quadratic 2 . / D t2 ./ leads to the mth one-sided equation in m from Tm m ˛ upper confidence interval for m [21]. The null hypothesis of no qualitative interaction in Equation (3) is rejected if the upper confidence limit is smaller than the margin  D 0. The tools for the calculation of adjusted p-values and simultaneous confidence intervals for the ratios of user-defined linear combinations of proportions are available in the add-on package mratios [22] for the statistical software package R [23].

4. MONTE CARLO SIMULATION 4.1. Coverage probability and power calculations Monte Carlo simulations for different parameter settings were conducted to investigate the adequacy of the asymptotic approximation of the proposed confidence intervals and the power of the suggested method for detecting qualitative interactions for binomial-response variables. In the first instance, the simultaneous coverage probabilities of the presented one-sided confidence intervals were analysed. Here, the coverage probability was defined as the probability that each true value m , for m D 1, : : : , M ratios of treatment effects of interest, is less than or equal to the corresponding upper bound of the simultaneous confidence intervals (Um ). In mathematical notation, the coverage probability is P.m 2 Œ1, Um , 8m D 1, : : : , M/.

(8)

Second, the empirical power for different parameter settings for the proposed method was calculated. As a comparison method, the frequently applied Gail and Simon test was picked. (For a detailed description of the Gail and Simon test, see Appendix B, available online as Supporting Information.) Both methods test the global null hypothesis of no qualitative interaction. The test statistic of the Gail and Simon test is based on the treatment differences, whereas the test statistic of the proposed method is based on the simultaneous assessment of user-defined ratios of

Copyright © 2014 John Wiley & Sons, Ltd.

311

where m ¤ m0 . The global null hypothesis in Equation (3) is rejected if jTm j > t./ for at least one m. The critical point t./ is the ˛-level equi-coordinate percentage point of an m-variate t-distribution with the correlation matrix RO and  degrees of freedom. The associated, one-tailed, adjusted p-values can be calculated as pm D 1  P.t 6 jTm j/, where the variables t follow the multivariate t-distribution Mtdf D,RO . Please note that in the case of testing for qualitative interaction ( D 0), the information provided by the denominator drops out, meaning that the local null hypotheses in Equation (3) reduces to m D hm  0 > 0. Because the numerator contrast represents the mth treatment effect (see Appendix A, available online as Supporting Information), this test statistic tests the null hypothesis that the mth treatment effect is smaller than zero. This hypothesis directly corresponds to the global one-sided hypothesis of the Gail and Simon test, that all treatment effects are smaller than zero: H0 : ı 2 OC . Therefore, the proposed test to detect a qualitative interaction is only applicable if the direction of the considered treatment effect is a priori known. Nonetheless, if interest is in testing for the heterogeneity of the treatment effect

( > 0) the proposed test is applicable without any assumptions on the direction of the treatment effect.

A. Kitsche the treatment differences. The empirical power was computed as the rate of rejected null hypotheses out of the simulated data sets. For the proposed method, the global null hypothesis was rejected if any of the local null hypotheses was rejected: P.pm 6 0.05; 9m D 1, : : : , M/.

(9)

4.2. Design

4.3. Results

For the simulation study, 10 levels of the stratification factor, that is, representing 10 regions or centres (j D 1, : : : , 10), and two levels of the treatment factor, that is, representing one control and one treatment group (i D 1, 2), were considered. For each simulation study, the vector of true binomial proportions  and the corresponding vector of sample sizes n were defined. For the analysis of the simultaneous coverage probabilities, the proportions in the first subset were set as 11 D 0.2Cı and 12 D 0.1, whereas the proportions in the remaining subsets were set as iD1,j¤1 D 0.1 and iD2,j¤1 D 0.2Cı. This design corresponds to a qualitative interaction due to subset one. The shift ı was set to vary between 0 and 0.5 by increments of size 0.01 to increase the treatment effect in the subsets. Furthermore, the sample size for each treatment-by-subset combination was varied: (i) nij D 40 for all i and j; (ii) nij D 30 for all i and j; and iii) nij D 20 for all i and j. For the power comparisons, the number of reversal subsets were alternated, whereas the number of observations per factor combination was set to nij D 40. The success probabilities for the 1.00 n=40 n=30 n=20

Coverage

treatment groups were set to 1j D 0.3 and 2j D 0.5, resulting in a treatment effect of ˇj D 0.2. To simulate a qualitative interaction, the treatment effect for a specified number of subsets was inverted by an increasing amount ı (1j D 0.5  ı and 2j D 0.5, with ı D 0, : : : , 0.45). The number of reversal subsets was set to 1, 2 and 3. For each parameter setting, 10,000 data sets were simulated.

0.95

0.90

Figure 1 shows the coverage probability, as defined in Equation (8), versus the shift parameter ı for the proposed simultaneous confidence intervals for the parameters m . The confidence level was set to 95%, and the number of observations for each treatment-by-subset combination was specified by nij D 20, 30 and 40. The under-coverage of the proposed intervals decreases as the parameter ı increases. Furthermore, the coverage probability converges to the nominal level of 95% with an increasing number of observations per treatment-by-subset combination. Figure 2 presents the empirical power of the proposed method and of the Gail and Simon test to detect a qualitative interaction against the shift parameter ı. The plots show several scenarios: different numbers of subsets with a reversal treatment effect. As expected, for both methods, the empirical power increases with an increasing amount of qualitative interaction (increasing ı). Furthermore, the empirical power of both methods to detect a qualitative interaction increases as the number of subsets with a reversal treatment effect increases. In the case of one reversal treatment effect, the proposed method is more powerful than the Gail and Simon test (up to 25%). With an increasing amount of reversal treatment effects, the two methods under consideration perform similar. On balance, it is recommended to use the proposed method because it is at least as powerful as the Gail and Simon test. Please note that the major advantage of the proposed method is its provision of additional information on the source and the amount of the qualitative interaction, rather than its gain in power in special situations.

5. ILLUSTRATIVE EXAMPLE

0.85 0.1

0.2

0.3

0.4

0.5

0.6

shift (δ) Figure 1. The coverage probability for the simultaneous confidence intervals for the parameters m . The horizontal black line marks the nominal level of 95%. An increasing shift parameter ı corresponds to the increasing qualitative interaction. The number of observations for each treatment-by-subset combination was set to nij D 20, 30 and 40 (pointed, dashed and solid lines, respectively).

0.8 0.6 0.4 0.2 one reversal subset

0.0

1.0

1.0

0.8

0.8

Empirical Power

Ratio Gail Simon

Empirical Power

Empirical Power

1.0

In this section, the proposed method is demonstrated on a multi-regional trial, namely the Metoprolol Controlled-Release Randomized Intervention Trial in Heart Failure [24]. The large-scale randomised, double-blind, placebo-controlled trial was conducted to investigate the treatment effect of adding once-daily doses of metoprolol controlled-release/extendedrelease (Meto CR/XL) to the optimum standard therapy in terms of

0.6 0.4 0.2

two reversal subsets

0.0 0.0

0.1

0.2

δ

0.3

0.4

0.6 0.4 0.2

three reversal subsets

0.0 0.0

0.1

0.2

δ

0.3

0.4

0.0

0.1

0.2

0.3

0.4

δ

312

Figure 2. The probability of detecting at least one qualitative interaction among 10 subsets of patients. Solid lines represent the proposed approach (ratio) dashed lines represent the Gail and Simon test (Gail Simon). Different columns reflect increasing numbers of subsets with reversal treatment effects (1, 2 and 3).

Copyright © 2014 John Wiley & Sons, Ltd.

Pharmaceut. Statist. 2014, 13 309–315

A. Kitsche

Table II. The number of successes and failures for each region in the Metoprolol Controlled-Release Randomized Intervention Trial in Heart Failure. Outcome

Total

Proportion

Region ( j)

Treatment

Success

Failure

(nij )

Success

Belgium

Meto CR/XL Placebo Meto CR/XL Placebo Meto CR/XL Placebo Meto CR/XL Placebo Meto CR/XL Placebo Meto CR/XL Placebo Meto CR/XL Placebo Meto CR/XL Placebo Meto CR/XL Placebo Meto CR/XL Placebo Meto CR/XL Placebo Meto CR/XL Placebo

3 13 9 17 11 13 19 31 16 29 2 2 6 11 8 8 2 9 14 26 4 9 51 49

65 53 114 107 150 151 233 216 195 183 17 20 91 94 94 94 37 37 285 265 83 74 481 490

68 66 123 124 161 164 252 247 211 212 19 22 97 105 102 102 39 46 299 291 87 83 532 539

0.04 0.20 0.07 0.14 0.07 0.08 0.08 0.13 0.08 0.14 0.11 0.09 0.06 0.10 0.08 0.08 0.05 0.20 0.05 0.09 0.05 0.11 0.10 0.09

Meto CR/XL Placebo

145 217

1845 1784

1990 2001

0.07 0.12

Czech Republic Denmark/Finland Germany Hungary Iceland Norway Poland Sweden The Netherlands/Switzerlnd UK USA

Total

Meto CR/XL, metoprolol controlled-release/extended-release. lowering mortality in patients with symptomatic heart failure. A total number of 3991 patients were randomised into the placebo or the Meto CR/XL group in 14 countries (Table II). According to Quan et al. [25], the data from Finland were combined with the data from Denmark, and the data from the Netherlands were combined with the data from Switzerland because no event was observed in the Meto CR/XL group in Finland and Switzerland. From Table II, a decreasing overall treatment effect is observable, whereas in two regions, Iceland and USA, the treatment effect increases. The goal is now to decide if the regions Iceland and USA represent a significant interaction, or if this heterogeneity of the treatment effect occurs by chance only. According to Wedel et al. [26], significant qualitative interactions were of particular interest, especially significant departures from the overall effect among any of the participating countries. To verify this statement, the data set is further analysed with the Gail and Simon test and with the method proposed within here. Gail Simon test. The test statistic for the one-sided Gail and Simon test can be calculated as Q D 0.024 C 0.078 D 0.102. This value Q D 0.102 is smaller than the critical value 12.60 (˛ D 0.05, J D 12) and therefore cannot reject the null hypothesis of no qualitative interaction. The corresponding p-value is 0.996.

Pharmaceut. Statist. 2014, 13 309–315

hm  , dm 

m D 1, : : : , 12.

To determine the deviation of each region from the overall effect, the parameters m are defined as the ratios of risk difference of each region to the overall risk difference. Therefore, the numerator and denominator contrast matrices from Equation (A.1) and (A.2) are used. The corresponding estimated parameters m , the test statistics, the multiplicity-adjusted p-values and the simultaneous upper confidence limits are presented in Table III. Although the parameters OIceland D 0.405 and OUSA D 0.140 would suggest a qualitative interaction, we cannot reject the null hypothesis of no qualitative interaction from either the adjusted p-values or the simultaneous confidence intervals. The observed reversal treatment effect in the US population of the Metoprolol Controlled-Release Randomized Intervention Trial in Heart Failure trial was already part of a serious discussion in the scientific literature, see, for example, Wedel et al. [26], Moyé [27] and Wittes [28]. The Food and Drug Administration (FDA) decided to perform a treatment-by-country interaction [26]. The FDA interpreted the result as in this quote from Moyé [27]: The finding of adverse United States mortality could of course be attributable to chance, but it could alternatively be a genuine finding, the result of US-differences in demographics or concomitant therapy. The FDA handled the discordant finding by approving the drug and therefore obtains the same conclusions as from the results in Table III.

Copyright © 2014 John Wiley & Sons, Ltd.

313

Ratio of treatment effects. Applying the method proposed in this article to detect the source of a potential qualitative interaction, the interest is now on the ratios of treatment effects:

m D

A. Kitsche

Table III. Estimated parameters of interest Om , resulting test statistic, multiplicity-adjusted p-values to test for qualitative interaction and simultaneous upper confidence limits for the parameters m in the multi-regional trial. Country Belgium Czech Republic Denmark/Finland Germany Hungary Iceland Norway Poland Sweden The Netherlands/Switzerland UK USA

Om 4.315 1.805 0.309 1.415 1.721 0.405 1.211 0.000 4.076 1.200 1.763 0.140

6. DISCUSSION

314

In this study, a post hoc method to detect the type (quantitative versus qualitative) and the source of a potential treatment-by-subset interaction for binomial-outcome variables is presented. Several preliminary global tests for the null hypothesis of the homogeneity of the treatment effect between strata for binary-response variables are available, for example, the Breslow–Day test [29] or a likelihood-ratio test [4,14]. Nevertheless, none of them are able to differentiate between quantitative and qualitative interactions. This paper’s approach provides adjusted p-values and simultaneous confidence intervals for the ratios of linear combinations of binomial proportions to detect qualitative interactions. Within this approach, the difference of proportions was used to measure effects because of its simple interpretation, especially when working with the ratio of treatment effects. The usage of adjusted p-values is limited for the assessment of qualitative interactions, because the test statistic reduces to a test for the numerator. Therefore, the test is only applicable with an a priori assumption on the direction of the treatment effects. This hypothesis corresponds to the one-sided Gail and Simon test. Depending on the stratification factor’s structure and the research question, the investigator is free to choose the appropriate numerator and denominator contrast matrices to define the parameters of interest. This technique makes the approach widely applicable to several biomedical research fields, for example, multi-regional trials, multi-centre trials, meta-analysis and biomarker studies. In this paper, the comparison with the grand mean is chosen to detect a region with a reversal treatment effect. If the number of strata is small, then all pairwise ratios of treatment effects are meaningful. Although this paper focused on a treatment factor with two levels, the methodology is also applicable for treatment factors with more than two levels, for example, one control group and several dose groups. In those cases, the user has to define appropriate contrasts for the treatment factor. Some possible choices for the contrasts of a treatment factor with several dose groups are given in Bretz and Hothorn [30]. The proposed method has its limitations in cases where the denominator of the ratio of treatment effects is not significantly

Copyright © 2014 John Wiley & Sons, Ltd.

Test statistic

Adj. p-value

Upper

1.000 1.000 1.000 1.000 1.000 0.999 1.000 1.000 1.000 1.000 1.000 0.997

13.443 6.199 2.758 4.206 5.082 7.862 5.253 3.121 14.064 3.577 6.674 0.984

2.783 1.648 0.378 1.866 2.045 0.154 1.111 0.000 2.113 2.053 1.529 0.279

different from zero because the simultaneous confidence intervals are not calculable in this case. Nevertheless, the adjusted p-values are available in those cases. Finally, it should be noted that the application of the suggested method is not restricted to detect qualitative interactions (where the margin  in Equation (3) is zero). Depending on the formulation of the null hypothesis, the ratio of treatment differences is applicable to (i) detect general treatment-by-subset interactions ( D 1), (ii) assess the homogeneity (consistency) of the treatment effect ( > 0) or (iii) test for qualitative interaction ( D 0). To test for the homogeneity of the treatment effect over subgroups, the margin  could be set to a medically relevant threshold [31]. The Japanese health authority, the Ministry of Health, Labour and Welfare, defines consistency of the treatment effect for Japanese patients in a multi-regional trial as ıJapan =ıOverall >  , where  > 0.5. This definition correJapan : sponds to a formulation of the local null hypothesis H0 Japan > 0.5 to detect an inconsistent treatment effect for the Japanese region. A global null hypothesis is the union of all region-specific null hypotheses along with its own consistency thresholds.

REFERENCES [1] Peto R. Treatment of cancer. Chapman & Hall: London, 1982. [2] Quan H, Li M, Chen J, Gallo P, Binkowitz B, Ibia E, Tanaka Y, Ouyang SP, Luo X, Li G, Menjoge S, Talerico S, Ikeda K. Assessment of consistency of treatment effects in multiregional clinical trials. Drug Information Journal 2010; 44(5):617–632. [3] Tsou HH, James Hung H, Chen YM, Huang WS, Chang WJ, Hsiao CF. Establishing consistency across all regions in a multi-regional clinical trial. Pharmaceutical Statistics 2012; 11(4):295–299. [4] Agresti A, Hartzel J. Strategies for comparing treatments on a binary response with multi-centre data. Statistics in Medicine 2000; 19(8):1115–1139. [5] Dragalin V, Fedorov V. Design of multi-centre trials with binary response. Statistics in Medicine 2006; 25(16):2701–2719. [6] Wang R, Ware J. Detecting moderator effects using subgroup analyses. Prevention Science 2013; 14(2):111–120.

Pharmaceut. Statist. 2014, 13 309–315

A. Kitsche [7] Higgins J, Thompson S, Spiegelhalter D. A re-evaluation of random-effects meta-analysis. Journal of the Royal Statistical Society. Series A: Statistics in Society 2009; 172(1):137–159. [8] Michiels S, Potthoff RF, George SL. Multiple testing of treatmenteffect-modifying biomarkers in a randomized clinical trial with a survival endpoint. Statistics in Medicine 2011; 30 (13): 1502–1518. [9] Han S, Rosenberg P, Chatterjee N. Testing for gene-environment and gene-gene interactions under monotonicity constraints. Journal of the American Statistical Association 2012; 107(500): 1441–1452. [10] Azzalini A, Cox DR. Two new tests associated with analysis of variance. Journal of the Royal Statistical Society. Series B (Methodological) 1984; 46(2):335–343. [11] Gail M, Simon R. Tests for qualitative interactions between treatment effects and patient subsets. Biometrics 1985; 41(2):361–372. [12] Piantadosi S, Gail M. A comparison of the power of two tests for qualitative interactions. Statistics in Medicine 1993; 12: 1239–1248. [13] Gabriel KR, Putter J, Wax Y. Simultaneous confidence intervals for product-type interaction contrasts. Journal of the Royal Statistical Society Series B - Statistical Methodology 1973; 35(2): 234–244. [14] Agresti A. Categorical data analysis (3rd edn), Wiley series in probability and statistics. Wiley-Interscience: Hoboken NJ, 2013. [15] Ciminera J, Heyse J, Nguyen H, Tukey J. Tests for qualitative treatment-by-centre interaction using a ‘pushback’ procedure. Statistics in Medicine 1993; 12:1033–1045. [16] Pan G, Wolfe D. Test for qualitative interaction of clinical significance. Statistics in Medicine 1997; 16:1645–1652. [17] Li J, Chan ISF. Detecting qualitative interactions in clinical trials: an extension of range test. Journal of Biopharmaceutical Statistics 2006; 16(6):831–841. [18] Kitsche A, Hothorn LA. Testing for qualitative interaction using ratios of treatment differences. Statistics in Medicine 2013; 33(9):1477–1489. [19] Kirk RE. Experimental design: procedures for the behavioral sciences (3rd edn). Brooks/Cole: Pacific Grove and Calif, 1995. [20] Fieller EC. Some problems in interval estimation. Journal of the Royal Statistical Society. Series B (Methodological) 1954; 16(2):175–185. [21] Dilba G, Bretz F, Guiard V. Simultaneous confidence sets and confidence intervals for multiple ratios. Journal of Statistical Planning and Inference 2006; 136(8):2640–2658.

[22] Dilba G, Hasler M, Gerhard D, Schaarschmidt F. mratios: inferences for ratios of coefficients in the general linear model, 2012. Available at: http://CRAN.R-project.org/package=mratios (accessed 15.07.2013). [23] R Core Team. R: a language and environment for statistical computing, 2012. Available at: http://www.R-project.org/ (accessed 15.07.2013). [24] MERIT-HF Study Group. Effect of metoprolol CR/XL in chronic heart failure: Metoprolol CR/XL Randomised Intervention Trial In-congestive Heart Failure (MERIT-HF). Lancet 1999; 353(9169):2001–2007. [25] Quan H, Li M, Shih W, Ouyang S, Chen J, Zhang J, Zhao PL. Empirical shrinkage estimator for consistency assessment of treatment effects in multi-regional clinical trials. Statistics in Medicine 2013; 32 (10):1691–1706, DOI: 10.1002/sim.5543. [26] Wedel H, Demets D, Deedwania P, Fagerberg B, Goldstein S, Gottlieb S, Hjalmarson A, Kjekshus J, Waagstein F, Wikstrand J, MERIT-HF Study Group. Challenges of subgroup analyses in multinational clinical trials: experiences from the MERIT-HF trial. American Heart Journal 2001; 142(3):502–511. [27] Moyé LA. Multiple analyses in clinical trials: fundamentals for investigators, Statistics for biology and health. Springer: New York, NY, USA, 2003. [28] Wittes J. Why is this subgroup different from all other subgroups? Thoughts on regional differences in randomized clinical trials. In Proceedings of the Fourth Seattle Symposium in Biostatistics: Clinical Trials. Springer: New York, NY, USA, 2013. DOI: 10.1007/978-1-4614-5245-4 7. [29] Breslow N, Day N. Statistical methods in cancer research Volume I The analysis of case-control studies. IARC Scientific Publications 1980; 32:5–338. [30] Bretz F, Hothorn LA. Testing dose-response relationships with a priori unknown, possibly nonmonotone shapes. Journal of Biopharmaceutical Statistics 2001; 11(3):193–207. [31] Ministry of Health Labour and Welfare of Japan. Basic concepts for joint international clinical trials, 2007.

SUPPORTING INFORMATION Additional supporting information may be found in the online version of this article at the publisher’s web site.

315

Pharmaceut. Statist. 2014, 13 309–315

Copyright © 2014 John Wiley & Sons, Ltd.

Detecting qualitative interactions in clinical trials with binary responses.

This study considers the detection of treatment-by-subset interactions in a stratified, randomised clinical trial with a binary-response variable. The...
249KB Sizes 0 Downloads 4 Views