Research Article Received 14 May 2014,

Accepted 8 December 2014

Published online 23 December 2014 in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/sim.6400

A critical look at prospective surveillance using a scan statistic Thais R. Correa,a*† Renato M. Assunçãob and Marcelo A. Costac The scan statistic is a very popular surveillance technique for purely spatial, purely temporal, and spatialtemporal disease data. It was extended to the prospective surveillance case, and it has been applied quite extensively in this situation. When the usual signal rules, as those implemented in SaTScanTM ( Boston, MA, USA) software, are used, we show that the scan statistic method is not appropriate for the prospective case. The reason is that it does not adjust properly for the sequential and repeated tests carried out during the surveillance. We demonstrate that the nominal significance level 𝛼 is not meaningful and there is no relationship between 𝛼 and the recurrence interval or the average run length (ARL). In some cases, the ARL may be equal to ∞, which makes the method ineffective. This lack of control of the type-I error probability and of the ARL leads us to strongly oppose the use of the scan statistic with the usual signal rules in the prospective context. Copyright © 2014 John Wiley & Sons, Ltd. Keywords:

average run length; p-value; recurrence interval; scan statistic

1. Introduction Epidemiologists typically perform geographical surveillance of diseases to detect statistically significant temporal, spatial, or space–time disease clusters. Looking for hints about unknown risk factors, they also want to test whether a disease is randomly distributed over space, over time, or over space and time. This procedure can help on the evaluation of the statistical significance of disease cluster alarms. The scan statistic method proposed by Kulldorff [1] is a very popular method for these purposes. For retrospective surveillance and cluster detection, it is one of the most popular methods among public health officials and researchers, and the SaTScanTM software ([2]) is a widely used free implementation of the method with users from many countries. The main reason for the widespread popularity of the scan statistic method for retrospective surveillance and cluster detection is its control of type-I error probability over multiple tests. When searching for temporal or geographical disease clusters, there are a huge number of potential candidates due to the overwhelming number of combinations of areas and times to form a cluster. Carrying out a statistical significance test for each potential cluster candidate leads to a large number of false positives, an undesirable situation that Kulldorff [1] solved in a simple way. The scan statistic is based on the attained maximum likelihood ratio for a simple comparison of disease occurrence between two groups, within and outside the candidate cluster. This maximum is obtained by scanning all possible candidate clusters, and hence the multiple tests situation is reduced to a single test situation. Its statistical significance is evaluated by means of Monte Carlo replications. This false positive control feature, coupled with its good power performance in simulation studies, transformed the scan statistic into a standard test for retrospective surveillance and cluster detection problems. The scan statistic methodology has also been proposed in the prospective context, in which repeated time-periodic surveillance is performed for early detection of disease outbreaks (Kulldorff [3]). In this

a Departamento de Estatística, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil b Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil c Departamento de Engenharia de Produção, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

6627, Belo Horizonte, Minas Gerais CEP 30123-970, Brazil. [email protected]

† E-mail:

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 1081–1093

1081

*Correspondence to: Thais R. Correa, Departamento de Estatística, Universidade Federal de Minas Gerais, Av. Antônio Carlos,

T. R. CORREA, R. M. ASSUNÇÃO AND M. A. COSTA

1082

case, a stream of new data is regularly added to the database as time accrues. Periodically, the scan statistic is applied to the updated database searching for evidence of emerging clusters. In contrast with the retrospective analysis, the prospective scan delivers a time series of inferential outputs such as successive p-values or test statistics. Recent reviews covering the space–time situation include Woodall et al. [4] and Unkel et al. [5]. Prospective surveillance is an important issue for public health agencies. Among the prospective space–time surveillance models available, we have the methods proposed by Kulldorff [3], Kulldorff et al. [6], Takahashi et al. [7], Tango et al. [8], and Correa and Assunção [9]. Most of these proposals are based on the repeated application of the scan statistic over time. The methods proposed by Kulldorff [3] and Kulldorff et al. [6] are widely adopted. As of September 16, 2014, Kulldorff [3] and Kulldorff et al. [6] have been cited 226 and 260 times, respectively, according to the Web of Science site [10]. This largely underestimates the use of the prospective space–time scan because many papers make use of the methods through the SaTScanTM software without citing directly the original paper. Furthermore, many regular surveillance agencies use the method without publishing results in the scientific journals. Nonetheless, the prospective scan method has been used often in papers published after 2012 in many different subject areas. Gao et al. [11] applied the prospective space–time scan statistic to detect terrorism outbreaks at an early stage. They analyzed the terrorist incidents in the Consortium for the Study of Terrorism and Responses to Terrorism’s Global Terrorism Database from 1998 to 2004. The space–time scan statistic was used by Briggs et al. [12] to look for smaller water supply-associated cryptosporidiosis outbreaks. All reported cases of cryptosporidiosis in a population of 2.2 million in the south of England between January 1, 2009 and December 31, 2010 were analyzed simulating monthly prospective investigation. They identified small outbreaks, and most of them associated with swimming pool use. Their results indicate that frequent small-scale transmission in swimming pools is a relevant factor for disease burden. Glass-Kaastra et al. [13] used both retrospective and prospective temporal scan statistics using data related to multiple-class antimicrobial resistance present in clinical isolates of Escherichia coli F4 and Pasteurella multocida from Ontario swine between 1998 and 2010. Madouasse et al. [14] adopted the prospective space–time scan statistic to detect clusters of low milk production, because this decrease might indicate an outbreak of two culicoides-borne diseases: Bluetongue and Schmallenberg. They used the data from the 2007 Bluetongue epizootic in France. The analysis was repeated on a weekly basis so that, for a given week, data form this week and the 4 previous weeks were analyzed together. Hughes and Gorton [15] applied the space–time scan statistic as a cluster detection method to routine laboratory surveillance data related to Campylobacteriosis, the commonest cause of bacterial enteritis in England. They analyzed laboratoryconfirmed cases of Campylobacteriosis in the North East of England between 2008 and 2011. The usual retrospective scan statistic method controls the type-I error level for the multiple testing because of the large and a priori fixed number of comparisons undertaken to find the most likely cluster. This control is made by means of Monte Carlo simulation of the maximum likelihood over all purely spatial or space–time cylinders, and the easiness in which this is carried out is a major component of its theoretical and practical success. In addition to the multiple testing implied in the fixed database associated with retrospective analysis, the prospective scenario introduces an additional difficulty. The prospective surveillance is meant to be repeated periodically and indefinitely as new data arrive sequentially. In the original proposal of the prospective scan, Kulldorff [3] already felt that the usual scan method would require adjustments in the number of analyses carried out up to a certain point and implemented these adjustments in the SaTScanTM software. Kleinman [16] and Kleinman et al. [17] proposed another type of adjustment on the scan statistic to control for the repeatedly and indefinite character of the prospective situation. They used a recurrence interval (RI) as a measure to correct each p-value issued periodically for the number of previous analyses. Woodall et al. [4] criticized the use of p-values and RIs in the prospective case, because these measures do not reflect appropriately the statistical performance of procedures repeated indefinitely. Limitations of the RI are also pointed out by other authors. According to Kleinman et al. [17], the RI is conservative, because it assumes independent tests. Fraker et al. [18] state that methods with the same RI value can have largely different in-control time-to-signal features. Joner et al. [19] compared the simplified version of the temporal scan statistic with the cumulative sum (CUSUM) technique when used in prospective surveillance for Bernoulli observations. In contrast with the Kulldorff’s scan statistic that allows for an arbitrary window size, they consider the simplest case of a fixed size for the scanning window, where the test statistic at a given time is the number of events in the window. A signal or alarm is given if the test statistic exceeds a fixed threshold. This scan statistic is known as an unweighted moving average in the statistical process control literature. Han et al. [20] compared this same simplified and fixed size temporal scan statistic for detection of increases in Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 1081–1093

T. R. CORREA, R. M. ASSUNÇÃO AND M. A. COSTA

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 1081–1093

1083

Poisson rates and considering a larger set of alternative techniques. Notice that Naus and Wallenstein [21] provide analytical calculation of the p-value in this simplified fixed size window. Sparks [22] also did a comparative analysis between the performance of the space–time scan statistic for detection of outbreaks for the Poisson model, also using a fixed window, with a space–time CUSUM statistic. These authors found a better performance for the CUSUM-based methods in the prospective case. These results hint at the possibility that the scan statistic with the usual signal rules, as those implemented in SaTScanTM , is not appropriate to the prospective situation. Also using a fixed window size, Neill [23] found many more false alarms than the nominal setting in a use of the scan statistic. Takahashi et al. [7] and Tango et al. [8] suggest modifications for the variable window size scan statistic proposed by Kulldorff [3]. In Takahashi et al. [7], arbitrarily shaped clusters are searched with a restriction that avoids the so-called ‘octopus effect’ on the most likely cluster found in Duczmal and Assunção [24] and Costa et al. [25]. This name is due to the presence on the most likely cluster of elongated branches resulting from the maximization procedure when carried out without restriction. The work in Tango et al. [8] is aimed at lifting the conditional aspect of the scan statistic method, where the total number of cases in the Monte Carlo null distribution is kept equal to the observed number. However, they do not question the use of the scan statistic in the prospective situation. The prospective situation brings a difficulty for the scan statistic method. One of the main advantages of the scan test statistic, its type-I error probability control, is not guaranteed over the multiple sequential tests performed in the prospective situation. Kulldorff [3] presents a weak recognition of this difficulty, and he proposed an adjustment by establishing critical thresholds for the test statistic at each time based on its previous values. We will show that this adjustment does not solve the problem. In this paper, we present a critique to the prospective use of the scan statistic. We show that, when applied in a prospective way to detect emerging clusters, either adjusted or not, the scan statistic methods from Kulldorff [3] and Kulldorff et al. [6] are not appropriate. More specifically, the scan statistic does not adjust properly for the sequential and repeated tests carried out during the surveillance. Although this is not the first time a critique of this method appears in the literature (for example, [4, 9, 19, 20, 22]), none of these previous works point out the defect as directly as we do here. We show in great detail the behavior of the main performance measures as the time frame evolves. Furthermore, the continued and high frequency of use of this technique makes clear that those previous criticisms were not sufficient. We believe that another more direct critique is needed, as we have done here. We reached a much stronger conclusion than [19,20,22]. While they found only an inferior performance to the scan technique as compared with the CUSUM, we found an ineffective technique. Our conclusion is that the prospective scan statistic with the SaTScanTM signal rules should not be used despite its popularity. We insist on the specificity of our comments: they are directed toward the use of the scan testing methodology in the prospective scenario. We think that the scan statistic is an excellent technique for cluster detection in the retrospective situation. The essence of this difficulty is that the scan statistic is based on a statistical hypothesis testing framework that is not adequate to the prospective situation. Statistical hypothesis testing uses type-I and type-II errors probabilities as performance measures. These concepts are not meaningful when a statistical test is applied sequentially with no predefined number of repeated applications. This is different from the retrospective situation, when we scan over a number of potential candidates generating a large but preestablished number of statistics. When the number of test statistics is undefined, the type-I and type-II errors probabilities may be equal to 1 and 0, respectively, rendering these performance measures unhelpful in the prospective case. For example, consider the most simple prospective surveillance method for a sequence of i.i.d. random variables Z1 , Z2 , … in which Zi ∼ N(0, 1) in the in-control state and Zi ∼ N(1, 1) in the out of control state. A Shewhart chart procedure declares that the system is out of control when the first ( Zi is larger than a threshold c. If) the system is run indefinitely under the in-control state, it is clear that P mini {Zi > c, i = 1, 2, …} < ∞ = 1. Therefore, any meaningful type-I error probability definition will be equal to 1 in this case. In the same way, under the out of control state, the type-II error probability is equal to 0 for any c. Note that it is possible to redefine events in such a way that error rates can be meaningful [18, 26]. However, the prospective scan statistic uses the traditional definition in which the binary outcome is the final result. It has been shown by [27] and [28] that, under some conditions, the scan statistic can be viewed as a CUSUM statistic. Therefore, at least in principle, appropriate prospective signal rules could be adopted with the scan statistic as a monitoring tool. However, in practice, the usual signal rules are those implemented in SaTScanTM .

T. R. CORREA, R. M. ASSUNÇÃO AND M. A. COSTA

In Section 2, we review both the retrospective and the prospective scan statistics and discuss in more detail the difficulties faced by the scan statistic in the prospective situation. In Section 3, we present simulation results that highlight the problems with the prospective scan statistic. We analyzed some important aspects of the scan statistic in the retrospective and prospective cases. For the prospective situation, we evaluated the behavior of the p-values and the proportion of alarms when the scan statistic is applied sequentially as data become available. We considered different possibilities, with and without adjustments for earlier analyses, showing that these adjustments do not work as expected. We also verified the relationship between the run length (RL) and the RI, as considered by Kleinman [16] and Kleinman et al. [17] in the spatial-temporal aggregated case of the prospective scan method. We close with final considerations in Section 4.

2. The scan statistic for emerging outbreaks In this section, we review the spatial scan statistic of Kulldorff [1] and its extension to the prospective situation proposed by Kulldorff [3]. We keep the notation used in [3]. 2.1. Retrospective purely spatial scan statistic The purely spatial scan statistic uses a circular window that moves on the map, including different sets of neighboring areas. The radius of each circle increases gradually until the circle includes a maximum proportion of the population at risk, usually 50%. Circles containing more than half the population at risk would be more suitably interpreted as a negative cluster of lower risk. The number of events can be considered either Poisson or Bernoulli distributed. The spatial scan statistic S is the maximum likelihood ratio over all possible circles, conditioning on the observed total number of cases N, ( ) maxZ [L(Z)] L(Z) S= , (1) = max Z L0 L0 where L0 is the likelihood function under the null hypothesis of a purely random Poisson process and L(Z) is the likelihood for circle Z. Define nZ as the number of cases inside circle Z. Assuming that the Poisson model, 𝜇(Z), is the expected number of cases under the null hypothesis. Kulldorff [1] showed that L(Z) = L0

(

nZ 𝜇(Z)

)nZ (

N − nZ N − 𝜇(Z)

)N−nZ

if nZ > 𝜇(Z) and L(Z)∕L0 = 1 otherwise. 2.2. Prospective space–time scan statistic

1084

The extension of the scan statistic for the space–time situation basically consists in considering cylinders instead of circles. The space–time scan statistic uses a cylindrical window in three dimensions. The base of the cylinder represents the space, and the third dimension is the time. The statistic is the same given in (1), but know Z is a cylinder. Let s and t be the start and end dates of a cylinder Z, respectively. Let [Y1 , Y2 ] be the time interval for which data exist. The prospective space–time scan statistic considers all cylinders for which Y1 ⩽ s ⩽ t = Y2 . That is, in the prospective context, the space–time scan statistic considers only alive clusters—clusters that reach the actual time. To evaluate the statistical significance for a cluster, the distribution of the statistic given in (1) under the null is constructed by Monte Carlo replications. For the random data sets, cases are generated so that space and time are independent. To solve the problem of multiple testing, the likelihood for the random data sets is maximized over all cylinders used in the previous analyses in addition to the current cylinders, that is, those cylinders for which Y1 ⩽ s ⩽ t ⩽ Y2 and t ⩾ Ym , where Ym is the time in which the surveillance began. We will refer to this situation as prospective scan adjusting for all previous analyses. At a given moment, if the probability of having detect a cluster with higher likelihood during any of the previous analyses or the present analysis is at most 𝛼, then the observed cluster is statistically significant at the 𝛼 level. That means that using the random data sets, it is possible to find the critical value for the 𝛼 level of significance. Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 1081–1093

T. R. CORREA, R. M. ASSUNÇÃO AND M. A. COSTA

According to the author, this method might be too conservative to adjust for all previous analyses if it is in place for a long period. He suggests to handle with this problem by including only those cylinders analyzed during the preceding v times for the random data sets, that is, those cylinders for which Y1 ⩽ s ⩽ t and Y2 − v < t ⩽ Y2 . We will refer to this situation as prospective scan adjusting for a fixed number of analyses. We will also consider two other situations: retrospective scan and prospective scan with no adjustment. In the first case, all cylinders are evaluated, alive or not. In the prospective scan with no adjustment, only alive cylinders are evaluated. In both cases, there is no adjustment for the previous analyses: the likelihood for the random data sets is maximized over all cylinders used in the current analysis. In the prospective situation, it is more common to use different measures from type-I and type-II error probabilities. An alarm goes off at time t when there is empirical evidence that the system under surveillance has changed from the in-control state to the out of control state. In this situation, it frequently used the RL, the waiting time until the alarm goes off. The average RL (ARL) is the expected RL when the process is under control. One tries to fix a target ARL and aims for a procedure that minimizes the expected waiting time for a true alarm when the process is out of control. This is called the conditional expected delay in the quality control literature. The RI is also a measure related to the frequency of false alarms. RI is defined, under the in-control state, as the length of time for which the expected number of alarms is 1. In the next section, we studied the p-value, the test statistic, the type-I error probability, the ARL, and the RI for the prospective scan statistic.

3. Simulation study In this section, we present simulation results of the prospective and retrospective scan statistics for purely temporal analysis. The issues we want to discuss are presented in both space–time and purely temporal contexts. Addressing the purely temporal case is simpler to understand and simulate, and it is enough to show the problems implied by the prospective situation. All results came from the SaTScanTM software, using several different options given there. The specific choices for these options are clarified in the next subsections. 3.1. Simulation design We generated 1000 time series of i.i.d. random variables Y1 , … , Y400 with Poisson distribution with mean equal to 3. For each time series, a sequential analysis is carried out. We begin with a time series of length n = 100 and increased it sequentially until the maximum length n = 400. That is, the first n observations for a time series of length n + 1 are the same as the n observations for the series of length n. For each time series and each length n, we used the SaTScanTM software selecting always 999 Monte Carlo replications in each analysis to obtain the critical value for 𝛼 = 0.05 significance level and the p-value. In all analyses, we used the maximum temporal cluster size as 50% of the study period, as suggested by Kulldorff [3]. We considered four different situations, all described previously: retrospective scan, prospective scan with no adjustment, prospective scan adjusting for all previous analyses, and prospective scan adjusting for a fixed number of analyses. 3.2. Empirical results

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 1081–1093

1085

3.2.1. Results for retrospective scan statistic. Figure 1 shows the results for the retrospective scan (here named situation (a)). The horizontal axis in all three graphs represents the length n of the time series. In this retrospective situation, the scan statistic analysis is carried out at each n without any concerns with the analysis carried out in different time series lengths. That is, the analysis at length n is run as a single analysis, as if no other analysis would be carried out in later moments. For each time series length n, we used the SaTScanTM software to calculate the scan test statistic given in (1). The set of candidate clusters were all the time intervals up to time n, including the nonalive ones. We denote this test statistic by Sn . We also took the critical value associated with the 5% significance level and the p-value from SaTScanTM software. This p-value is correctly obtained by evaluating the test statistic Sn in many Monte Carlo replications in each time series and each length n. That is, for each n and each realized time series Yt , we replicated 999 time series with the same mean as the observed series, evaluated the test statistic Sn considering all possible clusters, alive at n or not, and obtained the p-value by ranking the observed Sn with respect to these replicated Sn values. We repeated this procedure 1000 times. Then, for each time series length n, we can calculate summary statistics for the critical value and p-value.

200

300

400

length n

0.07 0.06 0.05 0.03

0.0 100

0.04

p−value

A critical look at prospective surveillance using a scan statistic.

The scan statistic is a very popular surveillance technique for purely spatial, purely temporal, and spatial-temporal disease data. It was extended to...
1MB Sizes 0 Downloads 7 Views