Research Article Received 4 August 2013,

Accepted 27 November 2014

Published online in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/sim.6394

Sample size calculation for the one-sample log-rank test René Schmidt,a*† Robert Kwiecien,a Andreas Faldum,a Frank Berthold,b Barbara Herob and Sandra Liggesa An improved method of sample size calculation for the one-sample log-rank test is provided. The one-sample logrank test may be the method of choice if the survival curve of a single treatment group is to be compared with that of a historic control. Such settings arise, for example, in clinical phase-II trials if the response to a new treatment is measured by a survival endpoint. Present sample size formulas for the one-sample log-rank test are based on the number of events to be observed, that is, in order to achieve approximately a desired power for allocated significance level and effect the trial is stopped as soon as a certain critical number of events are reached. We propose a new stopping criterion to be followed. Both approaches are shown to be asymptotically equivalent. For small sample size, though, a simulation study indicates that the new criterion might be preferred when planning a corresponding trial. In our simulations, the trial is usually underpowered, and the aspired significance level is not exploited if the traditional stopping criterion based on the number of events is used, whereas a trial based on the new stopping criterion maintains power with the type-I error rate still controlled. Copyright © 2014 John Wiley & Sons, Ltd. Keywords:

one-sample log-rank test; phase-II trial; power calculation; sample size

1. Introduction The primary aim of phase-II trials is to test a large range of potentially useful treatments against a standard, accepting only those for a phase-III trial that exceed the standard, while rejecting the majority of them as inferior to standard. Traditional designs in phase-II cancer trials are single-arm designs with a binary outcome as primary endpoint, for example, tumor response. Then, the binary outcome is defined by whether a response to treatment is observed after a fixed time span or not. The response rate of the (small) population under the new treatment is then compared with that of a historic control. There are trial settings where a binary endpoint is not desirable or appears inadequate but where a survival endpoint is more appropriate in order to account, for example, for potential loss to follow-up. For such settings, Finkelstein et al. [1] as well as Sun et al. [2] propose single-stage designs based on the one-sample logrank test. The one-sample log-rank test was first proposed by Breslow [3] and subsequently considered by Gail and Ware [4] and Woolson [5]. This test allows to compare the survival curve of a new treatment with that of a historic control. Finkelstein et al. [1] and Sun et al. [2] both give a sample size formula (Equations [7] and [2] respectively in their notation) for power requirements of the one-sample log-rank test based on the number of events D to be observed. Finkelstein et al. [1] consider a superiority setting, whereas Sun et al. [2] also allow a noninferiority setting. In this paper, we provide a detailed derivation of the (asymptotic) power formula for the one-sample log-rank test. Whereas Finkelstein et al. [1] obtain their sample size formula using properties of the Poisson distribution, our starting point is a counting process approach. It turns out that there are actually two stopping criteria that can be followed in order to achieve approximately a desired power for the one-sample log-rank test. Next to the number of events D, the sum of cumulative hazards of the patients EH can be monitored, and the analysis is to take place a Institute of Biostatistics and Clinical Research, University of Munster, ̈ Munster ̈ 48149, Germany b Department of Pediatric Oncology and Hematology, University of Cologne, Cologne 50937, Germany *Correspondence

to: René Schmidt, Institute of Biostatistics and Clinical Research, University of Münster, Münster 48149, Germany. † E-mail: [email protected]

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014

R. SCHMIDT ET AL.

as soon as D or EH reaches an a priori determined critical value d or e, respectively. We will prove that the two underlying power approximations of the one-sample log-rank test are asymptotically equivalent. However, a simulation study indicates that the two approximations work differently well for small sample size with the procedure based on EH to be favored. The clinical example for this paper is the METRO-NB 2012 trial (EudraCT number: 2011-004593-29). This is a prospective multicenter phase-II trial to determine the efficacy and feasibility of celecoxib in combination with low-dose metronomic application of oral cyclophosphamide, oral etoposide, and i.v. vinblastine in children and adolescents with recurrent or progressive neuroblastoma. The duration of treatment is 12 months. The primary endpoint is event-free survival (EFS) defined as time from start of treatment up to progression (emerged from residual tumor and progressive disease), recurrence (developing from complete remission achieved by metronomic treatment), drop-out for unacceptable toxicity, secondary malignant neoplasm, or death of any reason. In the setting of the METRO-NB 2012 trial, the historic control is based on 232 patients from the trial described in Simon et al. [6]. Of this cohort, 14 patients had to be excluded as no information on the second recurrence had been documented. The data of the remaining 218 patients, which are representatives for the underlying population, allowed for the first time for a Kaplan–Meier estimate of EFS after the first recurrence of pediatric patients with relapsed, high-risk neuroblastoma and will be used as historic control within the METRO-NB 2012 trial. This paper is structured as follows. Section 2 is concerned with the one-sample log-rank test. In Section 2.1, the test statistic and the sample size formulas based on the number of events D and the sum of the cumulative hazards EH are given, respectively. Formulas for the expected calendar time of analysis and the expected number of patients to be included in order to reach these stopping criteria are stated in Section 2.2. The differences between the two sample size approaches are explored by means of a simulation study in Section 3. In Section 3.1, the choice of scenarios is described. The results of the simulation study are presented in Section 3.2. In particular, the power performance and ability to control the type-I error rate for small sample sizes are investigated. A detailed summary of the results of the simulations is provided in Tables 1–4 and Figures 1–5, which are available online as supplementary material. We conclude with a discussion in Section 4. Proofs of mathematical results including the derivation of the approximate power formula for the one-sample log-rank test are given in the Appendix (available online as supplementary material). In the sequel, we will distinguish between calendar time and study time, which in general do not coincide. On the study time scale, time zero for each individual corresponds to the time of some initiating event. Depending on the issues to be studied, the initiating event could be time of start of treatment, time of remission, and so on. Unless otherwise specified, the study time will be denoted by Latin t, while the calendar time (which is usually defined as zero at the start of the trial) will be denoted by Greek 𝜏 in the sequel. Equation numbers such as (1), (2), … refer to this paper, whereas numbers such as (A.1), (A.2), … refer to equations from the supplementary material.

2. One-sample log-rank test Consider a phase-II clinical trial where a new treatment is to be compared with a treatment of a historic control by means of a time-to-event endpoint. Let 𝜆H (t) denote the known hazard function for the historic control population, let 𝜆(t) denote the (unknown) hazard function for the patients receiving the new treatment, and let R(t) = 𝜆(t)∕𝜆H (t) be the ratio of the hazard function for patients in the phase-II trial to that of the historic control. We consider testing hypotheses of the form H0 ∶ R(t) ⩾ 𝛾0

vs. H1 ∶ R(t) < 𝛾0

∀t ⩽ tmax

(1)

for some fixed 𝛾0 ∈ R and some fixed maximal study time tmax < ∞ (i.e., each patient will be censored not later than study time tmax ). The planning alternative hypothesis underlying power calculations is H1 ∶ R(t) = 𝛾1 for some 0 < 𝛾1 < 𝛾0 . We explicitly exclude situations where R(t′ ) ⩾ 𝛾0 and R(t′′ ) < 𝛾0 for some t′ , t′′ < tmax . Instead, we assume that either R(t) ⩾ 𝛾0 for all t < tmax or R(t) < 𝛾0 for all t < tmax . Then, the parameter space defined by H1 in Equation (1) is the complement of the parameter space defined by H0 . This assumption is fulfilled, for example, in case of proportional hazards, that is, if R(t) ≡ R does not depend on study time t. Intuitively, 𝛾0 may be interpreted as an upper bound for the aforementioned hazard ratio R(t), which the treatment is considered inadequate and is abandoned. Likewise, 𝛾1 denotes another bound below, which the treatment is judged as potentially useful. Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014

R. SCHMIDT ET AL.

2.1. Test statistic and sample size formulas Assume that n patients are enrolled in the phase-II clinical trial introduced previously. Let Ti and Ci , i = 1, … , n, denote their true unobservable survival and noninformative censoring times (in the study time scale), which are assumed to be mutually independent. Observable random variables are the pairs of right-censored survival times Xi and indicators Δi indicating whether a patient i is censored or experienced an event: Xi = min(Ti , Ci ),

Δi = I(Ti ⩽ Ci ),

i = 1, … , n.

∑n Based on these random variables, we define D ∶= i=1 Δi as the number of events observed in the new ∑n x treatment arm. Likewise, let EH ∶= i=1 ΛH (Xi ) where ΛH (x) = ∫0 𝜆H (t)dt is the cumulative hazard function of the historic control. We may interpret EH as the expected number of events if patients would be treated with the treatment of the historic control (cf. Aalen et al. [7], p.129). Likewise, 𝛾0 EH and 𝛾1 EH may be interpreted as the expected number of events in the new trial if the planning null hypothesis H0 ∶ R(t) = 𝛾0 or the planning alternative hypothesis H1 ∶ R(t) = 𝛾1 is true, respectively. As a consequence of martingale arguments (Aalen et al. [7], p.129), the statistic (D − 𝛾j EH )∕(𝛾j EH ) is approximately standard normally distributed if the planning hypothesis Hj ∶ R(t) = 𝛾j , j = 0, 1 holds true (for details, see Appendix A.1 of the supplementary material). Thus, a one-sample test for H0 ∶ R(t) ⩾ 𝛾0 at approximately one-sided significance level 𝛼 can be defined by rejecting H0 if and only if D−𝛾 E Z(n) ∶= √ 0 H = 𝛾0 EH

∑n

∑ Δi − 𝛾0 ni=1 ΛH (Xi ) ⩽ −Φ−1 (1 − 𝛼), √ ∑ n 𝛾0 i=1 ΛH (Xi )

i=1

(2)

with Φ(⋅) denoting the standard normal distribution function. This test is referred to as one-sample logrank test. As illustrated in Appendix A.1 (supplementary material), the asymptotic power of the one-sample logrank test is essentially determined by EH . In order to achieve approximately a power of at least 1 − 𝛽 for the one-sample log-rank test (2) for allocated significance level 𝛼 and effect 𝜃 ∶= 𝛾1 ∕𝛾0 , the analysis of data is to be performed as soon as EH reaches the critical value e given by ( 𝛾0 e ⩾

)2 √ Φ−1 (1 − 𝛼) + 𝜃Φ−1 (1 − 𝛽) . 1−𝜃

(3)

∑ Monitoring of EH = ni=1 ΛH (Xi ) may be performed in the course of the trial as follows. The trial starts at calendar time 𝜏0 , say. For any 𝜏 ⩾ 𝜏0 , let n(𝜏) denote the total number of patients in the trial at calendar time 𝜏. Let x1 , … , xn(𝜏) denote the observed censored survival times from these n(𝜏) patients censored not later than at 𝜏, i.e., for patients with event or censoring before 𝜏, xi is the time to event or censoring whatever came first, and those patients who are still event-free and uncensored infinitesimally ∑ before calendar time 𝜏 are censored at calendar time 𝜏. From this data, n(𝜏) ΛH (xi ) may be computed i=1 ∑n(𝜏 ) ∑ ΛH (xi ) for any 𝜏 ⩾ 𝜏0 . For 𝜏 = 𝜏0 , we have i=10 ΛH (xi ) = 0 by definition. Because the value of n(𝜏) i=1 observed at calendar time 𝜏 is increasing in 𝜏, there is usually a calendar time 𝜏 = 𝜏end , where condition ∑n(𝜏 ) (3) with e = i=1end ΛH (xi ) is fulfilled for the first time. We usually have 𝜏end < ∞. In case of severe overestimation of the true treatment effect, however, it may happen that the critical value e may not be reached based on the a priori planned number of n patients. This aspect is studied in more detail in Section 3. Therefore, it appears recommendable to fix in addition a maximal trial duration in the trial protocol, that is, a calendar time 𝜏max when the analysis of data is performed at the latest if the critical value e is not reached until then. All in all, the analysis of data is performed at the (random) calendar time 𝜏final ∶= min{𝜏max , 𝜏end }, and the test statistic Z(n) from (2) based on these n ∶= n(𝜏final ) patients is ∑ calculated. So, the value n(𝜏) ΛH (xi ) has to be monitored during the trial. i=1 Alternatively, the schedule of the analysis may be based on the number of observed events D. In order to achieve approximately a power of at least 1 − 𝛽 for the one-sample log-rank test for allocated significance level 𝛼 and effect 𝜃, the analysis of data is to be performed as soon as D reaches the critical value d given by Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014

R. SCHMIDT ET AL.

( d⩾𝜃

)2 √ Φ−1 (1 − 𝛼) + 𝜃Φ−1 (1 − 𝛽) 1−𝜃

(4)

(Finkelstein et al. [1]). Because the number of events D(𝜏) at calendar time 𝜏 is increasing in 𝜏, there is usually a calendar time 𝜏 = 𝜏end , where condition (4) with d = D(𝜏end ) is fulfilled for the first time. We usually have 𝜏end < ∞. In case of severe underestimation of the true treatment effect, it may happen that the critical value d may not be reached based on the a priori planned number of n patients. Therefore, also for this stopping criterion, it appears recommendable to fix in addition a maximal trial duration in the trial protocol, that is, a calendar time 𝜏max when the analysis of data is performed at the latest if the critical value d is not reached until then. All in all, the analysis of data is performed at the (random) calendar time 𝜏final ∶= min{𝜏max , 𝜏end }, and the test statistic Z(n) from Equation (2) based on these n ∶= n(𝜏final ) patients is calculated. As proven in Appendix A.1 (supplementary material), the power approximations of Equations (3) and (4) are asymptotically equivalent. Though, both criteria may perform differently well for small sample size, as will be shown in Section 3 by a simulation study. 2.2. Expected times of analysis If the schedule of analysis is given by the sum of cumulative hazards EH or the number of events D, the calendar time of analysis and the number of patients to be included are in general random, because the calendar dates of reaching the critical values given by Equation (3) or (4), respectively, are random. In this section, we derive the expected calendar time 𝜏end of the analysis under the planning alternative hypothesis H1 ∶ R(t) = 𝛾1 . Because the accrual process is known, the expected number n of patients to be included in order to reach the critical values defined by Equation (3) or (4), respectively, may be computed on this basis. t For any point of study time t, let SH (t) = exp(−∫0 𝜆H (s)ds) denote the known survival function of the historic control. Then, under the planning alternative hypothesis H1 ∶ R(t) = 𝛾1 , the survival function of the new treatment arm is S(t) = SH (t)𝛾1 . Let F(t) ∶= 1 − S(t). We assume that patients enter the trial according to a uniform distribution between calendar time 𝜏0 and 𝜏0 + a (in years) and are followed for further f years after the end of the accrual period until the calendar time 𝜏end = 𝜏0 + a + f of the final analysis of data. Let r denote the annual accrual rate of the phase-II trial. As shown in Appendix A.2 (supplementary material), 𝜏end is related to e and d, respectively, as follows: 𝛾0 e =

r 𝜃 ∫f

a+f

a+f

F(s)ds,

d=r

∫f

F(s)ds

(5)

where 𝜃 = 𝛾1 ∕𝛾0 . In case of exponentially distributed survival times, we have SH (t) = exp(−𝜆H t) for the historic control. With 𝜆 ∶= 𝛾1 𝜆H , the equations in Equation (5) yield ( ) exp(−𝜆(a + f )) − exp(−𝜆f ) r a+ and 𝛾0 e = 𝜃 𝜆 ( ) exp(−𝜆(a + f )) − exp(−𝜆f ) d =r a+ . 𝜆

(6)

The expected calendar time 𝜏end of the final analysis can be determined from these equations, although a and f are not uniquely determined. In order to fulfill any of the equations, the length of the followup period f has to be prolonged if the length of the accrual period a is shortened (and vice versa). The corresponding total number n of patients to be included is n = r ⋅ a.

3. Simulation study The clinical motivation for this paper is the METRO-NB 2012 trial (EudraCT number: 2011-004593-29) introduced in Section 1 where the one-sample log-rank test is applied. In order to explore the performance of the procedures described in this paper, we performed a simulation study. In Section 2.1, two stopping criteria are given that can be followed in order to achieve approximately a desired power for Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014

R. SCHMIDT ET AL.

∑n the one-sample log-rank test. Either EH = i=1 ΛH (Xi ) or the number of events D can be monitored, and the analysis is to be performed as soon as EH or D reaches an a priori determined value e given by Equation (3) or d given by Equation (4), respectively. The two underlying power approximations of the one-sample log-rank test are asymptotically equivalent (for details, see Appendix A.1 of the supplementary material) but may perform differently well for (small) finite sample sizes as will be shown in the following. Among others, we will compare the power performance, type-I error control, and appropriateness of the normal approximation of the one-sample log-rank test statistic in a range of scenarios when following the stopping criterion EH or D, respectively. 3.1. Choice of the main simulation scenarios We assume that the setting of a phase-II trial in oncology where the activity of a new chemotherapy regimen is to be assessed as compared with a treatment of a historic control with regard to EFS by means of the one-sample log-rank test. So, we are interested in testing hypotheses of the form in Equation (1) for the hazard ratio R(t) = 𝜆(t)∕𝜆H (t), with 𝜆(t) and 𝜆H (t) denoting the hazard function of the new treatment arm and historic control, respectively. We set 𝛾0 = 1 (superiority setting). The planning alternative hypothesis underlying power calculations is H1 ∶ R(t) = 𝛾1 for some 0 < 𝛾1 < 1. Let S(t) and SH (t) denote the survival function of the new treatment arm and the historic control, respectively. Further conditions underlying our main simulation scenarios are as follows: • The EFS times are exponentially distributed (for the new treatment as well as in the historic control) with 𝜆H = − log(0.5), that is, the 1-year EFS rate of the historic control is 50%. • For the given 𝜃 = 𝛾1 , the critical values e or d for EH or D to stop at are calculated according to Equations (3) and (4), respectively, where the one-sided significance level is set to 𝛼 = 0.025 and the aspired power to 1 − 𝛽 = 0.8. • The length of the accrual period a for allocated critical values e or d is calculated according to Equation (6) under the assumption of f = 0.5 ⋅ a (i.e., the length of the follow-up period is half of the length of the accrual period) and under the assumption of an annual accrual rate of r = 50 patients. • Patients’ accrual times are uniformly distributed on the interval [0, a]. Moreover, the following proceeding is implemented, which is felt to be in step with actual practice according to our medical cooperation partners: • Monthly monitoring: it is monitored at monthly intervals whether the statistics EH or D reached their respective critical value to stop at. • A priori fixed maximal trial duration: let 𝜏0 denote the calendar time when the trial starts. The final analysis of data is performed as soon as EH or D reaches their respective critical value e or d to stop at, but we additionally demand that the final analysis is performed not later than calendar time 𝜏0 + 1.25 ⋅ (a + f ) (rounded up to integer months) if EH or D fails to reach their respective critical value until then. The latter case of an enforced analysis at calendar time 𝜏0 + 1.25 ⋅ (a + f ) will in the sequel be called the case of a delayed analysis. We use the label delayed analysis in this context, because an enforced analysis at calendar time 𝜏0 + 1.25 ⋅ (a + f ) is delayed as compared with the expected calendar time 𝜏0 + (a + f ) of the final analysis. The situation of a delayed analysis, that is, the situation that EH or D does not reach their individual critical value e or d in time especially occurs in the case that the assumed treatment effect deviates from the true treatment effect. If the sample size rule based on EH is used, the trial tends to take longer than anticipated if the true effect is smaller than assumed. Conversely, if the sample size rule based on D is used, the trial takes longer than anticipated if the true effect is bigger than assumed. In case of overestimation of the true effect, the procedure based on D tends to a premature analysis of data. In the sequel, we will strictly distinguish between the true hazard ratio R of the new treatment to the historic control and the assumed hazard ratio 𝛾1 under the planning alternative hypothesis. Given the aforementioned conditions, the sample size n is uniquely determined by Equation (6) for allocated effect size 𝜃 = 𝛾1 . Thus, in order to reflect the effect of different sample sizes, we consider the following six values of 𝛾1 for the planning alternative hypothesis ranging from small to big effect sizes: 𝛾1 = 0.8, 0.75, 0.67, 0.57, 0.5, 0.4. Copyright © 2014 John Wiley & Sons, Ltd.

(7) Statist. Med. 2014

R. SCHMIDT ET AL.

These values of 𝛾1 correspond to the following critical values and sample sizes: e = 183.97, 115.68, 64.43, 36.43, 26.11, 17.25. d = 148, 87, 44, 21, 14, 7. n = 177, 124, 80, 58, 48, 38. For each of these six values of the assumed hazard ratio 𝛾1 of the new treatment to the historic control, simulations were carried out for the following six values of the true hazard ratio R: R = 𝛾1 + i ⋅

1 − 𝛾1 , 4

i = −1, … , 4.

In this way, potential deviations of the true treatment effect from the assumed treatment effect are studied. The choice of i = 0, that is, R = 𝛾1 , means that the planning alternative hypothesis is true. Likewise, i = 4, that is, R = 1, corresponds to the scenario that the null hypothesis is true. The choices of i = 1, 2, 3 represent scenarios of overestimation of the true effect, while i = −1 represents underestimation of the true effect. The step size in R between the different scenarios is chosen equidistantly and proportional to the difference 1 − 𝛾1 in the hazard ratio between null and planning alternative hypothesis. The true hazard ratio R determines the EFS in the active treatment arm. The corresponding true hazard rate 𝜆 of the new treatment is 𝜆 = R𝜆H . In particular, the survival function for patients in the active treatment arm is S(t) = exp(−𝜆t) = exp(−R𝜆H t). In each of 10,000 runs for every constellation, the test statistic of the one-sample log-rank test was calculated twice, once when EH reached its critical value e for the first time and once when D reached its critical value d for the first time. 3.2. Description of results and further simulation scenarios The results of the simulations for the main simulation scenarios as defined in Section 3.1 are summarized in Table 1 (supplementary material). According to Table 1, the power of the test procedure based on following EH is close to the expected power while the power of the test procedure based on following D is below the expected one. This is particularly striking for bigger effect sizes or small sample sizes. For instance, for 𝛾1 = 0.4 and expected power of 80%, the test procedure based on following EH reaches a power of 82.2%, while the one based on following D reaches a power of only 66.5%. For smaller effect sizes (i.e., bigger sample sizes), the power performance of both procedures becomes more aligned, as it would be expected from the asymptotic theory. But still the test procedure based on following EH proceeds closer to the expected power. For instance, for 𝛾1 = 0.8 and expected power of 80%, the test procedure based on following EH reaches a power of 80.3%, while the one based on following D reaches a power of 76.9%. Besides, the test procedure based on following D tends to be more conservative as compared with the test procedure based on following EH , which exhausts the aspired one-sided significance level of 2.5% much better. Again, this is particularly striking for bigger effect sizes or small sample sizes. For instance, for 𝛾1 = 0.4, the significance level of the test procedure based on following EH is estimated at 1.8%, while, with a rejection rate of 1.2%, the test procedure based on following D appears quite conservative. For smaller effect sizes (i.e., bigger sample sizes), both procedures become more aligned. For instance, for 𝛾1 = 0.8, the significance level of the test procedure based on following D is estimated at 1.9%, while, with a rejection rate of 2.1%, the test procedure based on following EH exhausts the aspired significance level of 2.5% better again. Let us consider now how many times the statistics D and EH do not reach their respective critical value prior to the maximal calendar date 𝜏0 + 1.25 ⋅ (a + f ) of the end of trial, that is, how often the event of a delayed analysis is observed. As expected, EH tends to miss its critical value in case of overestimation of the true effect, while D tends to miss its critical value in case of underestimation of the true effect. For EH , this tendency becomes increasingly pronounced for smaller effect sizes. While, for 𝛾1 = 0.4 and R = 1, EH does not reach its critical value in time in 10.2% of the simulations, and EH does not reach its critical value in time in 89.7% of the simulations for 𝛾1 = 0.8 and R = 1. Conversely, for D the tendency to miss its critical value becomes increasingly pronounced for bigger effect sizes. For instance, for 𝛾1 = 0.4 and R = 0.25, D does not reach its critical value in time in 56.3% of the simulations. In case of overestimation of the true effect, D tends to reach its critical value prematurely, that is, considerably before the expected Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014

R. SCHMIDT ET AL.

calendar date 𝜏0 + (a + f ) of the end of trial. So, the feature not to reach the critical value in time is more pronounced for the criterion EH but is rather observed in the case of dramatic deviations of the expected treatment effect from the true one. But still in these cases, power performance and type-I error control were more adequate for the test procedure based on following EH than for the one based on following D. More details on the distribution of the actual trial duration in the presence of a maximal trial duration defined by the calendar date 𝜏0 + 1.25 ⋅ (a + f ) are provided in Figures 1 and 2 (supplementary material) where the scenarios from Table 1 with planned hazard ratio 𝛾1 = 0.4 and 𝛾1 = 0.8 are displayed, respectively, in order to reflect the setting of small and big sample sizes. For bigger effect size (𝛾1 = 0.4), we see from Figure 1 that the procedure based on EH leads to an actual trial duration close to the desired one. The median actual trial duration is always close to the desired one, and the variability of the actual trial duration is relatively small irrespective of the true effect. In contrast, the median actual trial duration for the procedure based on D is only close to the desired trial duration if the expected effect is true and the variability of the median actual trial duration is relatively large in all considered cases. For small effect size (𝛾1 = 0.8) and if the assumed treatment effect is true, we see from Figure 2 that both procedures become more aligned as expected from the asymptotic theory. For both approaches, the variability in the actual trial duration is comparable, and the median actual trial duration is close to the desired one. But with increasing overestimation of the true effect, we again see that the procedure based on EH increasingly fails to reach its critical value before the maximum study time, while the procedure based on D increasingly leads to a premature analysis of data. Additionally, we assessed the robustness of the normal approximation of the one-sample log-rank test statistic if we fix the values of either EH or D to stop at. The approximate normality of the test statistic of the one-sample log-rank is crucial for the derivation of the power formula (A.8) as well as for the application of the test, because the test procedure is based on comparing the observed test statistic with the quantiles of the standard normal distribution (for details, see supplementary material). For the scenarios with planned hazard ratio of 𝛾1 = 0.4 from Table 1, the empirical distributions of the one-sample logrank test statistic for the two different stopping criteria are displayed in Figures 3 and 4 (supplementary material). These scenarios represent the setting of small critical values and sample size (e = 17.25, d = 7, n = 38). Reassuringly, for both sample size approaches, the histograms support the validity of the normal approximation if the true treatment effect coincides with the expected one or is overestimated. Only in the case of underestimation of the true effect, a slight skewness in the empirical distributions is observed in our simulations. For larger critical values and sample sizes, this effect disappears, and quality of normal approximation is uniformly adequate as evident from Figure 5 where the empirical distributions are displayed for the procedure based on EH for the scenarios with 𝛾1 = 0.8 from Table 1 (supplementary material). One potential reason for the observed slight deviation from the normal approximation in the case of severe underestimation of the true effect might be the fact that the assumption of smallness of 1 − 𝜃 underlying the approximation (A.7) becomes increasingly disturbed if 𝜃 decreases (for details, see supplementary material). So far, we concentrated on the main simulation scenarios described in Table 1 where monitoring of EH and D at monthly intervals is considered together with a maximal trial duration of the 1.25-fold of the expected trial duration a + f (i.e., if the critical values e or d from Equation (3) or (4) for EH or D to stop at were not reached until then, the trial is stopped at calendar date 𝜏0 + 1.25 ⋅ (a + f ), and the final analysis is forced). In order to assess the stability of our simulation results with regard to frequency of monitoring as well as choice of the maximal trial duration, we additionally performed simulations with • Quasi-continuous monitoring, that is, monitoring of EH and D at weekly and daily intervals. • Different choices for the maximal trial duration, that is, final analysis not later than the onefold, twofold, threefold, and fivefold of the desired trial duration (rounded up to integer days, weeks, or months depending on the chosen frequency of monitoring of D or EH ). In our simulations, the impact of performing monitoring of EH and D at daily, weekly, or monthly intervals was negligible. In Table 2 (supplementary material), the scenarios with monitoring at daily intervals and final analysis not later than the 1.25-fold of the planned trial duration are displayed. Apart from practical limitations of daily monitoring due to logistic efforts, a direct comparison of Tables 2 and 1 reveals that the benefit of daily monitoring as compared with monthly monitoring with regard to power and significance level performance is negligible. Let us next consider the effect of the choice of the maximal trial duration. Clearly, prolongation of the trial results in increased power. Remarkably, as apparent from Table 3 where the scenarios with monitoring at monthly intervals and final analysis not later than the fivefold of the planned trial duration Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014

R. SCHMIDT ET AL.

are displayed (supplementary material), the impact of prolongation of the trial on power and significance level is here rather small. Even in these simulation scenarios with potentially extreme prolongation of the trial up to the fivefold of the planned trial duration, power and significance level performance is close to the ones observed in the simulation scenarios with final analysis not later than the 1.25-fold of the desired trial duration (Table 1). The prolongation of the trial until the fivefold of the desired trial duration might appear practically irrelevant. However, this case is theoretically interesting because it illustrates power and significance level performance of the two sample size approaches in effective absence of an a priori fixed maximal trial duration. In contrast, the direct comparison of the scenarios with final analysis not later than the onefold and the 1.25-fold of the desired trial duration reveals a slight loss in power for the procedure based on EH in case of small effect sizes, if no prolongation of the trial beyond the planned schedule is allowed (for details, see Table 4 of the supplementary material). Impact on the type-I error rate is not observed. But for all scenarios displayed in Tables 1–4, power performance of the test procedure based on following EH is still better than power performance of the test procedure based on following D. Likewise, the test procedure based on following D is observed to be quite conservative in case of small sample size while the test procedure based on following EH exhausts the aspired significance level much better. Anticonservativeness is not observed in any of the scenarios from Tables 1–4. Finally, other choices for the 1-year EFS rate SH (1) for the historic control and other choices for the annual accrual rate r were made in order to further study the impact of sample size on the performance of the two sample size approaches. Under the assumption of exponential survival and uniform patients’ accrual, the sample size n for allocated power and significance level is uniquely determined by the effect size 𝜃 = 𝛾1 , the hazard rate 𝜆H for the historic control, and the annual accrual rate r. The sample size may be varied by changing any of these three quantities. In the simulation scenarios considered so far, the impact of different sample sizes was studied by varying the value of 𝛾1 while keeping 𝜆H and r fixed. Alternatively, 𝜆H , that is, SH (1), or r might be varied as well. Therefore, next to the choice of SH (1) = 50% and r = 50 already considered previously, the choices of SH (1) = 10%, SH (1) = 90% and r = 10, r = 200 were made as well for exemplary scenarios, in order to reflect extreme settings of bad/good survival and low/high accrual. All these scenarios confirmed stability of the results displayed in Table 1. The whole simulation study underlying this paper and Figures 1–5 of the supplementary material were achieved by using R 3.0.1 [8]. All results are reproducible.

4. Discussion Traditional designs in phase-II cancer trials are single-stage designs using a binary outcome as the primary endpoint, for example, tumor response. In the latter case, the binary outcome is defined by whether a response to treatment is observed after a fixed time span or not. In the literature, the investigator can find several proposals for designs specific to phase-II trials. In the single-stage design of Fleming [9], a predetermined number of patients are recruited into the trial, and a decision about activity of the treatment under study is obtained from the number of responses among these patients. Gehan [10] suggests a twostage design, where the sample size of the second stage depends on the number of responses observed in the first stage. A popular phase-II design is the two-stage procedure of Simon [11]. However, these designs are limited to trials with a binary outcome. For time-to-event endpoints in phase-II trials, Jennison and Turnbull [12] suggest monitoring the trial using Kaplan–Meier estimates [13] and the 𝛼-spending function method described by Lan and De Mets [14]. Lin et al. [15] describe a two-stage design where Nelsen–Aalen estimates [16] are used in the interim analysis. In Case and Morgan [17], a two-stage design is developed for evaluating survival probabilities. The latter shows how results described in Lin et al. [15] may be used to assess hypotheses of the type of H0 ∶ S(x∗ ) = S0 (x∗ ) where x∗ is the survival time of interest and where S(⋅) and S0 (⋅) denote the survival function of the new treatment arm and the historic control, respectively. However, sometimes it may be difficult to elaborate the point of time x∗ of interest, for example, if it is unknown at which time point the survival curves of the new treatment arm and the historic control begins to differ. In this case, a comparison of the survival curve of the new treatment to a fixed reference curve representing the historic control treatment may be more appropriate. For such settings, Finkelstein et al. [1] and Sun et al. [2] propose single-stage designs based on the one-sample log-rank test and give a sample size formula based ∑ on the number of events D = i Δi . Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014

R. SCHMIDT ET AL.

In this paper, we proposed an alternative ∑ sample size formula for the one-sample log-rank test based on the sum of cumulative hazards EH = i ΛH (xi ). So, either EH or D can be monitored, and the analysis is to be performed as soon as EH or D reaches an a priori determined critical value, respectively. We further showed that the two underlying power approximations are asymptotically equivalent. Clearly, the first approach based on D appears more practicable, because it just requires counting the observed events. However, in a simulation study, we found that the one-sample log-rank test tends to be conservative under the null hypothesis and underpowered under the alternative hypothesis if the time point of analysis is planned following the required D. In contrast, if the schedule of analysis is set up following the required EH , the one-sample log-rank test meets the power requirements more exactly and exhausts the aspired significance level much better. This behavior is particularly striking for smaller sample sizes. For bigger sample sizes, both procedures become more aligned with regard to power and significance level performance. Both sample size approaches require performance of the analysis of data as soon as a critical value is reached for the first time. It is clearly seen in simulations that it might happen that EH or D does not reach their respective critical value d or e in time. This mainly happens if the assumed treatment effect deviates from the true one. If the sample size rule based on EH is used, the trial tends to take longer than anticipated if the true effect is smaller than assumed. If the sample size rule based on D is used, the trial tends to take longer than anticipated if the true effect is bigger than assumed. Therefore, the procedure based on EH might not reach its critical value in time in the vicinity of the null hypothesis whereas the procedure based on D tends to reach its critical value prematurely if the null hypothesis holds true. Particularly in early phase-II settings, the latter property of the procedure based on D is clearly a desirable feature, because if there is no effect or only a clinically irrelevant effect, it would be futile to run a longer trial. However, early phase-II trials are typically also characterized by small sample size. In case of small sample size, our simulations indicate that a trial based on the stopping criterion D is usually underpowered even if the planning assumptions on the true treatment effect are correct whereas a trial based on EH usually maintains power. Because both procedures may fail to reach their respective critical value d or e to stop at, it appears recommendable to fix a maximal trial duration, that is, to fix in advance a maximal calendar date when the trial is to be stopped and the final analysis is to be performed at the latest if the respective critical value is not reached until then. In our simulations, we studied several potential choices for the maximal trial duration: onefold, 1.25-fold, twofold, threefold, and fivefold of the expected trial duration. For the procedure based on EH , the results of our simulations indicate that the power is maintained if a moderate prolongation of the trial up to the 1.25-fold of the expected trial duration is allowed. More extensive prolongation of the trial does not show further relevant effect, whereas prohibiting any prolongation of the trial beyond the expected schedule results in a slight loss of power for the criterion based on EH in case of small treatment effect. However, in the latter case power performance of the test procedure based on following EH is still superior to the one based on following D. In contrast, a trial based on the stopping criterion D is observed to be underpowered in all simulation scenarios irrespective of the choice of the maximal trial duration. This effect is particularly striking for scenarios with small sample size. According to the present paradigm (as proposed in [1] and [2]), the time point of analysis is planned following the required number D of events. However, against the background of our simulation results and in spite of potentially higher logistic effort, it appears more favorable to plan the schedule of analysis according to the sum of the cumulative hazards EH and to fix in addition a maximal trial duration. In the present paper, single-stage procedures for the one-sample log-rank test were studied. Because the focus of phase-II trials is set on discarding ineffective treatments, an extension to designs with confirmatory interim analyses appears desirable. The implementation of the one-sample log-rank test within the framework of (adaptive) group-sequential designs may canonically be based on the (asymptotic) independent increment structure of the one-sample log-rank test statistic. Thereby, the information time may be approximated by either the sum of cumulative hazards EH or the number of events D. Against the background of this paper, it might be suspected that the criterion based on EH is preferable in a sequential design setting as well. However, an extension of the proposed single-stage procedure to (adaptive) group-sequential designs and a substantial exploration of its properties in a sequential trial setting will be contents of future research.

Acknowledgements We would like to thank two anonymous referees, an anonymous associate editor and the competent editor for very constructive suggestions and comments that resulted in an improved version of this manuscript. Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014

R. SCHMIDT ET AL.

References 1. Finkelstein DM, Muzikansky A, Schoenfeld DA. Comparing survival of a sample to that of a standard population. Journal of the National Cancer Institute 2003; 95:1434–1439. 2. Sun X, Peng P, Tu D. Phase II cancer clinical trials with a one-sample log-rank test and its corrections based on the Edgeworth expansion. Contemporary Clinical Trials 2011; 32:108–113. 3. Breslow NE. Analysis of survival data under the proportional hazards model. International Statistics Review 1975; 43: 45–58. 4. Gail MH, Ware JH. Comparing observed life table data with a known survival curve in the presence of random censorship. Biometrics 1979; 35:385–391. 5. Woolson RF. Rank-tests and a one-sample log-rank test for comparing observed survival-data to a standard population. Biometrics 1981; 37:687–696. 6. Simon T, Berthold F, Borkhardt A, Kremens B, De Carolis B, Hero B. Treatment and outcomes of patients with relapsed, high-risk neuroblastoma: results of German trials. Pediatric Blood & Cancer 2011; 56(4):578–583. 7. Aalen OO, Borgan O, Gjessing HK. Survival and Event History Analysis. Springer Science + Business Media: New York, 2008. 8. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing: Vienna, Austria, 2013. Available from: http://www.R-project.org/ Accessed on 1 October 2013. 9. Fleming TR. One-sample multiple testing procedure for phase II clinical trial. Biometrics 1982; 38:143–151. 10. Gehan EA. The determination of the number of patients required in a preliminary and slow-up trial of new chemotherapeutic agent. Journal of Chronic Diseases 1961; 13:346–353. 11. Simon R. Optimal two-stage designs for phase II clinical trials. Controlled Clinical Trials 1989; 10:1–10. 12. Jennison CJ, Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Chapman and Hall: London, 2000. 13. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association 1958; 53:457–481. 14. Lan KKG, DeMets DL. Discrete sequential boundaries for clinical trials. Biometrika 1983; 70:659–663. 15. Lin DY, Shen L, Ying Z. Group sequential designs for monitoring survival probabilities. Biometrics 1996; 52:1033–1042. 16. Nelson W. Hazard plotting for incomplete failure data. Journal of Quality Technology 1969; 1:27–52. 17. Case LD, Morgan TM. Design of phase II cancer trials evaluation survival probabilities. BMC Medical Research Methodology 2003; 3(6):1–12.

Supporting information Additional supporting information may be found in the online version of this article at the publisher’s web site.

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014

Sample size calculation for the one-sample log-rank test.

An improved method of sample size calculation for the one-sample log-rank test is provided. The one-sample log-rank test may be the method of choice i...
159KB Sizes 0 Downloads 6 Views