STATISTICS IN MEDICINE, VOL. 10, 1-6 (1991)

MULTIPLE COMPARISONS IN OVER-THE-COUNTER DRUG CLINICAL TRIALS WITH BOTH POSITIVE AND PLACEBO CONTROLS RALPH B. DAGOSTINO Department of Mathematics. Boston University, I I 1 Cummingion Street. Boston, M A 02215, U.S.A.

AND TIMOTHY C . HEEREN School

OJ Public

Health, Boston University. 80 E. Concord Street. Boston, MA 02118. U.S.A.

SUMMARY Evaluations of the efficacy of over-the-counter drugs using ANOVA techniques often misuse multiple comparison procedures. Studies that involve both a placebo control and established drugs as positive controls are especially prone to these problems. The most common mistake involves using a procedure which does not control the experimentwise type I error rate, usually the Duncan procedure or some version of multiple t tests. These procedures control comparisonwise type I error rate, but lack the important experimentwise error control. The purpose of this paper is to clarify the issues involved in performing ANOVA followed by a multiple comparison procedure for over-the-counter drug studies involving both placebo and positive controls.

1. INTRODUCTION

This paper is motivated by the author’s (RBD) experience since 1974 as a reviewer for the Food and Drug Administration’s Over-the-counter Drug Division and as a consultant to a number of pharmaceutical companies with over-the-counter (OTC) drugs. Evaluations of the efficacy of OTC drugs using ANOVA techniques often misuse multiple comparison procedures. Studies that involve both a placebo control and established drugs as positive controls are especially prone to these problems. The most common mistake involves use of a procedure that does not control the experimentwise type I error rate, usually the Duncan procedure or some version of multiple t tests (that is, multiple pairwise t tests or the least significant difference procedure). These procedures control the comparisonwise type I error rate, but lack the important experimentwise error control. Unfortunately, selective references to articles or software manuals have been used to justify the inappropriate use of these techniques.’,’ Statements appear in the literature that imply that one can control the experimentwiseerror level by performing multiple testing conditional on rejection of an overall F test from the ANOVA. These statements are i n ~ o r r e c t .Such ~ . ~ conditioning only controls experimentwise type I error under the situation where all means are equal; in the situation where some, but not all, means may be equal (as expected in OTC studies with both placebo and positive controls) the experimentwise type I error rate is uncontrolled. One well 0277-671 5/91/01oOO1~6$05.oO 0 1991 by John Wiley & Sons, Ltd.

Received June 1988 Revised May 1990

2

R. B. DAGOSTINO AND T. C. HEEREN

known text states that if comparisons flow naturally from the study, the t test is appropriate. However, when many comparisons 'flow naturally' from the study, the experimentwise type I error rate will still be unacceptably high. The purpose of this paper is to clarify the issues involved in performing ANOVA followed by a multiple comparison procedure for OTC drug studies involving both placebo and positive controls. While we offer some opinions about specific procedures, it is not our intent to compare various multiple comparison procedures. These procedures are extensively discussed in statistical l i t e r a t ~ r eand ~ . ~presented in many basic texts that cover analysis of variance.2y5s6 2. EXAMPLE For example, suppose we have interest in demonstrating the effectiveness of drug A, an analgesic drug for the treatment of tension or muscular contraction headache pain. Our study also involves one placebo treatment P and two positive control drugs B and C. Suppose we randomize subjects in a double blind parallel sample fashion to the four treatments A, B, C and P. That is, each subject receives only one treatment. Also, suppose there is one primary measurement variable to determine treatment efficacy. Our study has two possible objectives. The first (Objective 1) is to demonstrate that drug A is superior to drugs B and C. The second (Objective 2) is to demonstrate that drug A is equally effective as drugs B and C. We could have interest in showing that drug A is superior to B and equally effective as C, but in this paper we will consider the two objectives as mutually exclusive. We could easily amend our discussion to the situation where both objectives are of interest.

3. STEPS IN THE ANALYSIS We analysed the data from our study first with a one factor analysis of variance, followed by multiple comparisons to investigate specific drug differences. These multiple comparisons accomplish two goals. First, we need to establish 'downside sensitivity' (see Section 6) to validate the experiment. Second, we need to make specific comparisons among drugs to address either Objective 1 or 2. The goals of the two objectives differ, and the choice of an appropriate multiple comparison procedure will depend on which objective we wish to pursue. 4. MULTIPLE COMPARISON PROCEDURES

Unfortunately, there is no one multiple comparison technique that is best in all situations. As a result, a number of such techniques have been proposed. A major distinction between techniques is whether they control the experimentwise or comparisonwise type I error. Control of the experimentwiseerror rate means that the selected significance level describes the chance of at least one type I error for the set of comparisons that are made. Control of comparisonwiseerror means that the selected significance level describes the chance of a type I error for any one comparison. In our example, there are six possible pairwise comparisons. If we make all six comparisons with an experimentwisesignificance level of 0.05, there is a 5 per cent chance of a type I error for the set of comparisons. If we make all six comparisons with a comparisonwise significance level of 0.05, the chance of a type I error for the set of comparisons may be as high as 30 per cent. Among the most commonly used procedures that control the experimentwise type I error are the B~nferroni,~ Scheffe,' and Tukey'. l o (or Tukey/Kramer" when sample sizes differ across treatments) procedures. Dunnett's procedure12 is designed specifically for the comparison of a placebo to all other treatments, and controls the experimentwise error rate in this situation.

OVER-THE-COUNTER DRUG CLINICAL TRIALS

3

Commonly used procedures that control the comparisonwisetype I error are multiple pairwise t tests, the least significant difference method5 and Duncan’s procedure.’ 5. THE OVERALL TEST

The first step in the analysis should be an overall test of the null hypothesis that the treatments under investigation are equal in effectiveness. The overall test should reject the null hypothesis and show significant main effects for treatment. Some analysts might argue that it is reasonable to investigate specific differences, say between drug A and the placebo, even when the overall test fails to show significance. However, these OTC drug evaluations are not exploratory or investigational projects, they are confirmatory studies based on pilot research. We believe that in this situation, with known effectivedrugs in the study, differences between positive controls and the placebo should be large enough to be reflected in the overall test. Non-rejection (that is, acceptance) of the overall null hypothesis should mean termination of the analysis. 6. DOWNSIDE SENSITIVITY

Given that the overall test shows a differencebetween treatments, the study must show that drugs A, B, and C are all more effective than the placebo. This is referred to as establishment of downside sensitivity. While we have primary interest in the effectiveness of drug A, we believe it essential to show that the positive control drugs B and C are effective in the context of this study. Whether due to sample selection or sample size or the vagaries of conditions treated by OTC drugs (for example, headaches may go away spontaneously),it is possible that a study may fail to show the effectiveness of an established drug. If this happens, comparisons among drugs A, B and C are not meaningful, since established effects do not appear in the study. While some analysts would consider the finding of A superior to placebo a positive result, even when B or C fail to demonstrate effectiveness, our opinion is that failure to demonstrate a known effect makes the entire study suspect. There are several approaches to establish downside sensitivity. The primary concern here is control of the experimentwise type I error rate. One approach is to treat the question of downside sensitivity as a separate hypothesis, and compare drugs A, B and C to placebo through Dunnebt’s multiple comparison procedure, a method specifically designed for the comparisons of treatments to a control. A second approach is to include the comparisons for establishment of downside sensitivity in a larger set of comparisons. For example, we may examine all pairwise comparisons between the four treatment groups both to establish downside sensitivity and test Objective 1, whether drug A is superior to either B or C. Arguments abound that favour either of these approaches, or some other approach. For example, while the authors have never seen it in practice, it is theoretically possible to develop a likelihood ratio test to test simultaneously the superiority of drugs A, B and C to the placebo. However, most importantly, one should establish downside sensitivity with a multiple comparison procedure or with some other procedure that controls the experimentwise type I error rate. Techniques that control only the comparisonwise error rate are inappropriate in this situation. 7. STUDIES TO ESTABLISH DRUG DIFFERENCES

When the study’s objective is to establish differences among the drugs, then the primary concern in the selection of a multiple comparison procedure is the control of the experimentwise type I

4

R. B. DAGOSTINO AND T. C. HEEREN

error rate. In making pairwise comparisons only, we recommend Tukey’s procedure. Tukey’s is the more powerful of these procedures for examination of all pairwise comparisons. If one also examines other linear contrasts, then we recommend Scheffe’s procedure. While Bonferroni’s procedure is appropriate in this situation, it has poor power as the number of comparisons increases. Also, one should not use Bonferroni’s procedure to examine new contrasts suggested by the data; adjustment of the rejection rule does not compensate for data examination. In studies to establish drug differences, where the need is for control of experimentwise error rates, we find it natural to use one multiple comparison procedure to make all pairwise comparisons. We would first examine comparisons to establish downside sensitivity. Given the establishment of downside sensitivity, we would then examine comparisons between the active drugs.

8. STUDIES TO ESTABLISH DRUG EQUALITY The establishment of equality in effectivenessof OTC drugs involves different considerations than the establishment of differences. While there are procedures to test directly for treatment e q ~ a l i t y ,-’ ~ examination of treatment equality often entails multiple comparison procedures in the context of ANOVA. Here control of type I1 error for pairwise comparisons of the active treatments becomes essential. Multiple comparison procedures that control compairsonwise type I error rates give better type I1 error protection and so are preferred when acceptance of treatment equality is the desired outcome. Versions of multiple t tests (multiple pairwise t tests or the least significance difference procedure) or Duncan’s procedure are appropriate here. Of these, we recommend the least significant difference procedure, which is based on the t test but uses the pooled variance estimate from the overall ANOVA. Duncan’s procedure has similar power as techniques based on the t test, but we find this procedure more confusing to present. We emphasize here that the ‘acceptance’of the null hypothesis can only lead to a statement of drug equality if there is sufficient sample size to control type I1 error. One advantage with use of a technique based on the t test to establish drug equality is that sample size and power estimates are straightforward. Convention indicates at least 80 per cent power to detect what is considered the minimal clinically meaningful difference. In studies to establish drug equality, we find it natural first to establish downside sensitivity through Dunnett’s procedure or some other technique that controls the experimentwise type I error rate. One can then investigate equality of the active drugs through the least significant difference procedure or some other technique that controls the comparisonwise type I error rate.

9. OTHER MULTIPLE COMPARISON PROCEDURES

There is a set of stepwise multiple comparison procedures (Ryan-Einot-Gabriel-Welsch procedures2,3 * 4 ) for all pairwise comparisons when there are equal sample sizes per treatment group. These procedures control the experimentwise type I error rate and are more powerful than Tukey’s procedure for all pairwise comparisons. Also, there is a modification to the LSD procedure to control the experimentwise error rate. l 7 While these procedures are less commonly used, they are certainly appropriate in studies to establish drug differences. Another commonly used multiple comparison procedure is the Newman-Keuls multiple range test.6 While it is often suggested that this procedure controls the experimentwise type I error rate, this is only true under the null hypothesis of all means e q ~ a l .In ~ .the ~ presence of a placebo assumed different from the other treatments, this procedure does not control experimentwise

OVER-THE-COUNTER DRUG CLINICAL TRIALS

5

type I error, and so is not appropriate for Objective 1 studies. The type I1 error protection for this procedure is poor compared to procedures that control for comparisonwise type I error, and so we do not recommend this procedure for Objective 2 studies.

10. OTHER CONSIDERATIONS IN STUDY DESIGN Studies of OTC drugs are at times more complex than our example. There may be a stratified randomization scheme for assignment of subjects to treatments. Multicentre trials involving data collection at several sites are common. There may be baseline data on severity of pain. There usually are several measures of efficacy. Because of these complications, OTC drug studies often require more involved analysis than one factor ANOVA. For example, if the study is multicentre, the analysis should include centre as a main effects factor. Also, one should consider explicitly in the analysis the interaction of centres and treatments. Often OTC trials involve evaluation of subjects on some baseline measure, such as pain intensity in our headache example, and only those patients with moderate or severe pain enter into the study. We can represent these initial values in the analysis, either as a covariate or as another main factor. With initial condition treated as a factor, the design may become unbalanced, and we must use appropriate procedures for unbalanced data. Again, we need to include appropriate interaction terms. The multiple comparison procedures discussed above do extend to these more elaborate situations. We can account for significant interactions as needed, through use of an appropriate mean square error term. In the case of analysis of covariance or unbalanced designs, we must compare adjusted means. These adjusted means account for differences in covariates or other factors across groups, and their standard errors reflect the association between the covariate and the dependent variable. Standard statistical packages give pairwise comparisons (through multiple t tests) of adjusted means, with comparisonwise p-values. The Bonferroni procedure applies readily to control the experimentwise type I error rate. A related problem is that OTC trials often involve more than one efficacy (dependent) variable. Studies such as our headache example commonly involve five or more outcome variables (pain scale measured at 1, 2, 3 and 4 hours after medication, an average pain scale, a maximum pain scale, etc.). Multiple testing issues arise in the comparison of groups on this multitude of dependent variables. There are several ways to address this problem. We could employ some multivariate analysis (that is, MANOVA) to compare treatment groups on all outcomes simultaneously.Assuming we reject the null hypothesis, we could then perform separate analyses on each efficacy variable, with a Bonferroni adjustment to the significance level of each of these ANOVAs. Keeping in mind that the goal of these OTC trials is to define clearly the effectiveness of the drugs under study, we prefer a less formal solution. We suggest designation of one or two efficacy variables as the primary variables in the study protocol. To declare differences between treatments, one must show differences on all these major efficacy variables, and in general on most variables. REFERENCES

1. Carmer, S. G.and Swanson, M. R. ‘An evaluation of ten pairwise multiple comparison procedures by Monte Carlo methods’, Journal of the American Statistical Association, 68, 66-74 (1973). 2. SAS Institute. SAS User’s Guide: Statistics (Version 5 Edition), SAS Institute Inc., Cary, North Carolina, 1985. 3. Einot, I. and Gabriel, K. R. ‘A study of the powers of several methods of multiple comparisons’, Journal of the American Statistical Association, 70, 35 1-360 (1975).

6

R. B. DAGOSTINO AND T. C. HEEREN

4. Ramsey, P. H. ‘Power differences between pairwise multiple comparisons’, Journal of the American Statistical Association, 73, 479485 (1978). 5. Kleinbaum, D. G. and Kupper, L. L. Applied Regression Analysis and Other Multivariable Methods, Duxbury Press, North Scituate MA, 1978. 6. Zar, J. H. Biostatistical Analysis, 2nd Edition, Prentice Hall, Englewood Cliffs, NJ, 1984. 7. Miller, R. G. Simultaneous Statistical Inference, Springer-Verlag, New York, 1981. 8. Scheffe, H. ‘A method for judging all contrasts in the analysis of variance’, Biornetrika, 40,87-104 (1953). 9. Tukey, J. W. ‘Allowanees for various types of error rates’, unpublished IMS address, Chicago, 1952. 10. Tukey, J. W. ‘The problem of multiple comparisons’, unpublished manuscripts, 1953. 11. Kramer, C. Y. ‘Extension of multiple range tests to group means with unequal numbers of replications’, Biometrics, 12, 307-310 (1956). 12. Dunnett, C. W. ‘New tables for multiple comparisons with a control’, Biometrics, 20, 482-491 (1964). 13. Duncan, D. B. ‘Multiple range and multiple F tests’, Biometrics, 11, 1 4 2 (1955). 14. Council on Dental Therapeutics. ‘Report of workshop aimed at defining guidelines for caries clinical trials: superiority and equivalency claims for anticaries dentifrices’, Journal of the American Dental Association, 117, 663-665 (1988). 15. Blackwelder, W. C. ‘Proving the null hypothesis true in clinical trials’, Controlled Clinical Trials, 3, 345-353 (1982). 16. Makuch, R. W. and Johnson, M. F. ‘Some issues in the design and interpretation of negative clinical studies’, Archieves of Internal Medicine, 146, 986-989 (1986). 17. Hayter, A. J. ‘The maximum familywise error rate of Fisher’s least significant difference test’, Journal of the American Statistical Association, 81, 1 ~ 1 0 0 (1986). 4

Multiple comparisons in over-the-counter drug clinical trials with both positive and placebo controls.

Evaluations of the efficacy of over-the-counter drugs using ANOVA techniques often misuse multiple comparison procedures. Studies that involve both a ...
430KB Sizes 0 Downloads 0 Views