A Test for the Difference Between Two Treatments in a Continuous Measure of Outcome When There Are Dropouts Morton B. Brown, PhD

ABSTRACT: Consider a r a n d o m i z e d clinical trial that is designed to compare two treatments in which the treatment continues during the entire period of the study. Some subjects m a y refuse to complete the protocol and will not return for the final evaluation. Since the reason for d r o p p i n g out m a y be related to the subject's self-assessed evaluation of the usefulness of the treatment or to undesirable side effects of the treatment, subjects w h o d r o p out cannot be treated as a r a n d o m sample of those w h o entered the trial. We consider the situation where the measure of efficacy of the treatment is continuous. U n d e r the a s s u m p t i o n that the expected value of the measure for those w h o d r o p out is not better (the direction d e p e n d s on the measure) than that for those w h o complete the study, w e propose an a d j u s t m e n t to the usual test for a difference b e t w e e n treatments that allows for the inclusion of the probable effect of the dropouts; this provides a b o u n d on the test for efficacy of the treatment. First, w e estimate a p r e d e t e r m i n e d percentile, such as the m e d i a n score, of the control, or placebo, group and assign this score to all those w h o d r o p p e d out from both groups a n d to all subjects in both groups with scores that are worse than the assigned score. A M a n n - W h i t n e y statistic is then used to test the equality of the distributions of the two groups. We s h o w by simulation that this modified test is equivalent to a test using the complete data and has greater p o w e r than that obtained w h e n including the d r o p o u t s b y assigning the worst observed score to them. This test will be less sensitive to bias that is induced by exclusion of dropouts from the final evaluation. KEY WORDS: Nonresponse, withdrawals, Mann-Whitney test, two-sample test, imputation

INTRODUCTION A major concern in the analysis of randomized clinical trials is the treatment of data from subjects who do not complete the required protocol. This problem is especially acute in the analysis of trials of slow-acting drugs that must be continuously taken throughout the study [1]. In these studies subjects may drop out if they perceive that their treatment is not beneficial or if there are side effects that cause them to be removed from the study medication. In either case the subject can be considered to be a failure of the treatment received. Because subjects who complete a trial may differ from those who do not, Address reprint requeststo: Morton B. Brown, PhD, Department of Biostatistics, School of Public Health, University of Michigan, 109 South Observatory, Ann Arbor, MI 48109-2029. Controlled ClinicalTrials 13:213-225(1992) © Elsevier Science Publishing Co., Inc. 1992 655 Avenue of the Americas, New York, New York 10010

213 0197-2456/92/$5.00

214

M.B. Brown the exclusion of the latter from the analysis may produce either an overly optimistic estimate of the efficacy of treatment w h e n the withdrawals are primarily from the treatment group or, conversely, an underestimate of the efficacy w h e n the withdrawals are primarily from the placebo group. For this reason Peto et al [2] state that it is necessary to analyze data for all subjects based on the initial allocation to groups, i.e., "intent to treat." When the outcome measure is dichotomous (e.g., death or recurrence), a conservative procedure is to assign subjects for w h o m there is no final examination as failures prior to the analysis of the data. An alternative, less conservative procedure would be to stratify subjects by recognized risk factors and to allocate dropouts within each stratum to outcomes proportionally to the observed outcomes. When the outcome measure is continuous, it is less clear as to h o w to include data from those w h o refuse to return for a final examination. Bombardier and Tugwell [1] summarize various possible methods of data analysis: 1. Exclude dropouts from the analysis. This is the most common method but can introduce bias since those w h o drop out differ from those w h o complete the study. 2. Include all subjects that were randomized in their "intent-to-treat" group. This requires a final examination for all subjects. 3. Use the last observed value for each subject, including those w h o drop out prematurely. Although this is likely to be less biased than method 1, bias may still be present since there is no reason to assume that the value of the outcome measure would remain unchanged over time had the subject remained in the study. 4. Extrapolate the values of the outcome measure for those w h o drop out to estimate what their outcome would have been at the end of the study had they remained until the end. This may introduce large differences (either positive or negative) w h e n the subject drops out early in the study following a large initial change. Murray and Findlay [3] propose a method for imputing values for those w h o drop out based on the assumption that the data are missing at random. Values imputed in this manner are only appropriate w h e n the underlying probability model that induces dropouts is one of ignorable nonresponse [4]. There is little reason for assuming that this underlying assumption is correct since the decision of the subject to drop out may be affected by his or her perception of the efficacy of the treatment for the individual case. When one assumes that the decision to drop out is affected by the current status of the subject, it is not possible to impute a value to the subject had he or she continued to the end of the study. However, since most studies include a control group on placebo, or conventional, therapy, it may be reasonable to assume that on the average those w h o drop out will not do better than the control subjects. When the outcome measure is dichotomous, the conventional intent-totreat analysis includes dropouts in the group of those w h o do not improve or do worse; we propose an equivalent approach for a continuous outcome. First, estimate a typical value that can be used to represent the dropouts and then use this value for all subjects w h o drop out or do worse than this typical

215

Comparing Groups with Dropouts

value, i.e., the distributions of the two groups are truncated at this typical value. After transforming the data in this manner, the equality of the distributions of the two groups is tested using a Mann-Whitney rank sum statistic. Since all subjects are included in the analysis, this is an intent-to-treat analysis. Simulation is used to study the effect of the level of truncation and the effect of the pattern of dropout. The proposed statistic is compared by simulation to the ×2 statistic that would be formed if the outcome measure was dichotomized at the typical value. We show that the median of the placebo group is a good choice (in the sense of maximizing power) for use as a typical value and that the proposed test statistic is more powerful than a ×2 statistic that is appropriate for a dichotomous outcome. An example in which such an analysis may be appropriate is a clinical study of a drug for the treatment of arthritis. The subject in such a study may be required to take the drug continuously for a fixed period, such as 1 or 2 years. A measure of efficacy is the change in the number of swollen joints and/or the number of painful joints during the trial. A change in the measure of outcome can be expected in the placebo group due to a placebo effect, e.g., an average improvement as large as 30% has been observed in the placebo group. The assignment of the worst observed outcome (i.e., change) to all the dropouts may be too conservative an approach. It is desirable to use an alternative method to adjust for dropouts, such as the proposed test statistic. DEVELOPMENT OF THE TEST STATISTIC Let x represent the value of the measure of efficacy for the control group that is usually observed, but x is not observed when the subject is a dropout. Let y have a similar meaning for the alternative treatment. Assume that x ~ f ( x , Ox)

and

y - G(y, %)

where F(.) and G(.) represent cumulative distribution functions and 0x and 0y are the medians of the respective distributions. The usual null and alternate hypotheses can be stated a s : Ho:Ox = Oy Hi: O~ ~ Oy

We have chosen to replace these hypotheses by the stochastic inequality: H0: f ( z , G) = G(z, 0y)

for all z

Hi: F(z, G) > G(z, 0y)

for all z

When the two distributions are identical except for a possible change in location (but not scale), then the two sets of hypotheses are equivalent. When the two densities differ in form, the distributional inequality implies the inequality of medians (or percentiles) but the converse is not necessarily true. Conventional parametric statistical hypotheses assume that the distributional forms are similar. Obvious modifications need to be made in the test statistic when the inequality is reversed. Also, the extension to two-sided alternate hypotheses is direct.

216

M.B. Brown Let K be any constant such that K is less than the 100% percentile of F. If all observations in both groups that are less than K were replaced by K, then the stochastic inequality F(z, Ox) > G(z, 0y) would continue to hold for the distributions F*(z, 0xlK) and G*(z, 0ylK) that represent the distributions truncated at the value n. If one could determine a value K such that K is greater or equal to the expected outcome for all dropouts, then it would be possible to use F* and G*, formed by truncating the distributions of F and G at K, as a surrogate for testing the equality of F and G. In a trial in which the treatment continues over a long period of time and the reason for dropping out is likely to be the perceived ineffectiveness of the treatment, one approximation to an upper bound for the dropouts is the median of the placebo subjects. There are two possible ways to compute this median: (1) as the sample median (kl) of all the placebo subjects who complete the trial, or (2) as the sample median (k2) of all the placebo subjects who start the trial where the subjects who drop out are assigned a score less than the median. Then kl > k2 since k~ is the [100 + NR]/2 percentile of the distribution that includes dropouts (the intentto-treat distribution), where NR is the percent of dropouts in the control group. Any distribution-free test of the stochastic inequality that F*(z, 0xlk) > G*(z, 0y Ik) may be used. We have chosen to use the Mann-Whitney statistic because of its recognized efficiency relative to the two-sample t test even when the data are normal. We propose to construct the following test statistic: 1. Estimate the median (k) of the control group either excluding dropouts or including dropouts as equal to the worst value in that group (the method should be predetermined). 2. Replace all values in both groups that are less than k by k. This is a onesided truncation of the sample distributions using k as the limit for truncation. 3. Set the values for all dropouts equal to k. 4. Compute the Mann-Whitney test statistic on the modified data using an appropriate correction for ties.

POWER OF THE TEST STATISTIC Since as many as one-half of the observed values of the control group as well as some in the treatment group are being reset to a common value, it is anticipated that there will be a reduction in the power for a specific expected difference. To evaluate this effect on the test statistic, the following simulation was performed: Let nl = n 2 = N (N = 11, 21, 31 . . . . . 101) be the sample size of each group. For each N, 1000 pairs of samples of size Nwere drawn from a standard normal distribution. Uniform random numbers were generated by the algorithm of Wichman and Hill [5] and then the Box and Muller [6] transformation was applied to obtain standard normal deviates. To determine which observations to treat as dropouts, the following three patterns were used:

217

Comparing Groups with Dropouts

1. Uniform: each observation has the same probability to be a dropout. 2. Linear: the probability of being a dropout is proportional to the rank of the observation where the rank 1 represents the best outcome and N the worst outcome. 3. Quadratic: the probability of being a dropout is proportional to the rank squared where the ranking is from best (1) to worst (N). The probability of being a dropout (denoted by d) was set at 0%, 20%, or 33% and applied to each group independently. That is, the two groups were not merged together when assigning ranks and the expected number of dropouts was the same in the two groups. If the two groups had been merged together prior to assignment of the ranks, fewer observations in the treated group would be designated as being dropouts and the power of the test statistic would be increased relative to the method used. Since side effects of the treatment may increase the number of dropouts in the treated group relative to the control group, the method of selecting dropouts used in the simulation was chosen to be conservative. The six patterns of deletion for a sample size of N = 50 are presented in Figure 1. The horizontal lines represent uniform dropout; the straight lines represent linear dropout; and the curves represent quadratic dropout patterns. Note that the linear and quadratic patterns bias the selection of observations so that subjects with "worse" results are more likely to become dropouts that those with "better" results; uniform dropout is not biased in this manner. If the population median of the control group were used as the typical value for truncation and the null hypothesis were true, 50% of the dropouts would be misclassified into the failure category under the uniform pattern, 25% would be misclassified under the linear pattern, and only 12.5% would be

1 Uni for~

d=0.2

0.9 Linear .///////"///:;/I

o.8-

d-0.2

Quad d = 0 . 2

.I-)

= o 0.7n. o "o

///I-" i-// //./ / i /j, ,/" S/

o.6-

//

0.5-

Uniform Linear Quad

d=0.33 d=0.33

d=0.33

o.4o.3J

o.2-

.., ~ a ~ ° ° . ¢ "

o.1 o o

i

i

r

r

i

i

l

i

i

5

10

15

20

25

30

35

40

45

Rank

of

observation,

1=best,

50

50~worst

Figure 1 Patterns used to create dropouts. Observations in each sample are ranked from 1 (best) to N (worst). N = 50.

218

M.B. Brown misclassified under the quadratic pattern. However, when the alternative hypothesis is true, the misclassification rates for the placebo group will remain unchanged, but those for the treated group will increase since the typical value corresponds to a lower percentile in the treated group than in the placebo group; hence, the method of replacing dropouts induces bias against the treatment effect. Let A (A = 0.0, 0.1, 0.2 . . . . . 1.0, 1.25, 1.5) be the expected difference in the outcome measure between the two groups. The first group was considered to be the control, or placebo, group. The Mann-Whitney statistic was evaluated for each pair of samples where the expected difference A was added in turn to each observed value in the second group. The formula for the Mann-Whitney statistic included an adjustment for tied observations [7]: MW= ([nlnd(nl

+ n2)3 -

R1 - nl (nl + n2 + 1)/2 (nl + n2)} - 2T]/{12(nl + n2)(ni + n2 - 1)})1/2

where n~ and n2 are the sample sizes of the two groups, R1 is the sum of the ranks of the first group, and 2T = 2 (t 3 - t)/12 where t is the number of values that have tied ranks. The Mann-Whitney statistic was first computed for the observed data (after dropouts were identified by the appropriate pattern). It was then recornputed to obtain the modified statistic by using, in turn, five different percentile levels to determine the truncation value in the first group (the control group) that is to be assigned to all lower values. The rank orders of the observations to be assigned as the truncation values were the rounded integers corresponding to (N + 1) multiplied by the following factors: 1/6, 1/3, 1/2, 2/3, 5/6. Note that the factor 1/2 corresponds to the use of the median; factors greater than 1/2 to values in the upper tail of the distribution and factors less than 1/2 to values in the lower tail. All values in both groups that were less than the identified value were reassigned to that value. In addition to the Mann-Whitney statistic, we computed a X2 statistic that treated the outcome as a dichotomous variable where failure was defined as less than or equal to the truncation value. All dropouts were treated as failures. All Mann-Whitney statistics were compared to a critical value of 1.96, which corresponds to a two-sided 5% test of significance, and the number of rejections was tabulated. Similarly, the ×2 statistic was compared to 3.84; no correction for continuity was included in calculation of the statistic. Since 1000 repetitions were performed, the standard errors for the estimates of size and power should not exceed 0.7% and 1.6%, respectively. RESULTS A N D D I S C U S S I O N Figure 2a contains plots of the power of the Mann-Whitney test as a function of A at each of several different percentile (truncation) levels that were used to choose the typical value. We present the plots when each sample has 50 observations and the pattern of dropout followed the quadratic loss pattern with an average dropout rate per group equal to 33%. The curves representing median truncation and 67% truncation are similar and both have higher power than the other curves. These results are typical of the many conditions that

219

Comparing Groups with Dropouts

100

A

No

truno

90trunom16t

/,,"

80.

truno=33t

70-

trunc-Bodian A

60-

trunc-67% truncm83%

403020100

- -

0

0'.2

014

0'.6

018

"1

112

114

1.6

Delta

100

B

....

~°~-No

80"

tr~nc

truDc~161

~. ,,::::"""

trunc~33t

A

•o.

/i/

trunc~median t~mc-6?t

!

i iiiii. o0~

0.2

.....

:. . . . . .

0.4

/

0.6

,

0.8

trunc-83t

................................... ,

1

,

1.2

,

1.4

1.6

Delta

Figure 2 Power curves as a function of level of truncation. The dropout rate is 33%, the pattern of loss is quadratic, and N = 50 for each group. (a) The MannWhitney z statistic. Co) The X2 statistic. were studied. The rationale for this result is that w h e n the groups differ, a truncation value that is above the result for m a n y of the control subjects is desirable, but it has to be balanced with not eliminating too m a n y of the treatment subjects since the rankings of the treatment subjects contribute to the sensitivity of the M a n n - W h i t n e y statistic.

220

M.B. Brown By contrast, in Figure 2b we present the power curves obtained when using the ×2 statistic under the same set of conditions. Note that the power increases as the truncation level increases with maximum power observed at either 67% of 83% truncation. This is due to the high dropout rate (33%) in this study; the median of the combined sample will approach but not exceed 83% at maximum truncation since the 33% dropout rate will depress t h e overall median of the two groups. The low power at 0% truncation is an artifact of the truncation procedure since there can only be one observation in one of the cells for the control group and a high ×2 will result only if multiple values of the treated group are less than the typical value (which is the opposite effect from that expected). The effects of pattern of dropout on the power curves are presented in Figure 3a and b for the Mann-Whitney and ×2 statistics, respectively. For all patterns the number of observations in each sample is 50 and median truncation is used. Since there are more observed values when there are no dropouts (d = 0) and fewest when the dropout rate is greatest, the power curves are ordered by dropout rate. Within a dropout rate, the highest power is for quadratic loss and the lowest is for uniform loss. This is as expected since quadratic loss has relatively more dropouts in the "worse" subjects than in the "better" subjects whereas uniform loss evenly divides the dropouts between the better and worse subjects. Therefore, with quadratic loss there is less misclassification of dropouts as failures than with uniform loss. In Figure 4a and b we compare the Mann-Whitney statistic to the X 2 statistic that is created by dichotomizing each sample at the truncation value. We plot power as a function of A for three sample sizes (N = 20, 50, 100) using the median of the first group as the truncation value for both groups. In Figure 4a the dropout rate is 33% and the quadratic pattern was used to generate the dropouts; in Figure 4b the dropout rate is 20% and uniform loss was used. As expected, the power increases as the sample size increases. The M a n n Whitney z statistic is more powerful than the X2 statistic for all levels of truncation and patterns of generating the dropouts since it includes more information from the observed data. If it was known that the pattern of dropouts was uniform with respect to rank, then no bias is induced by ignoring the dropouts and using only the observed respondents. In Figure 5a and b we compare the statistics that include dropouts as failures to the equivalent statistics if dropouts were ignored. In Figure 5a the pattern of dropout is quadratic and in 5b it is uniform. The powers of the statistics that include dropouts are lower than those that ignore dropouts since the assignment of dropouts to the truncation value induces misclassification bias, which affects the treatment group more than the control group. However, the proposed Mann-Whitney statistic has a power curve that is similar to that of the ×2 statistic that ignores dropouts in spite of the large dropout rate (33%). That is, the use in the Mann-Whitney of the information in the ranks of the observed data above the truncation value compensates with respect to power for the misclassification induced by including the dropouts. There is a need for methods that will enable the analysis of randomized clinical trials based on the original intent-to-treat groups. This can be difficult when there are substantial numbers of dropouts and the subjects are unwilling

221

Comparing Groups with Dropouts

100

A

..... d-0

/',,

A test for the difference between two treatments in a continuous measure of outcome when there are dropouts.

Consider a randomized clinical trial that is designed to compare two treatments in which the treatment continues during the entire period of the study...
665KB Sizes 0 Downloads 0 Views