Article

Prospective analysis of infectious disease surveillance data using syndromic information

Statistical Methods in Medical Research 2014, Vol. 23(6) 572–590 ! The Author(s) 2014 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0962280214527385 smm.sagepub.com

Ana Corbera´n-Vallet1 and Andrew B Lawson2

Abstract In this paper, we describe a Bayesian hierarchical Poisson model for the prospective analysis of data for infectious diseases. The proposed model consists of two components. The first component describes the behavior of disease during nonepidemic periods and the second component represents the increase in disease counts due to the presence of an epidemic. A novelty of our model formulation is that the parameters describing the spread of epidemics are allowed to vary in both space and time. We also show how syndromic information can be incorporated into the model to provide a better description of the data and more accurate one-step-ahead forecasts. These real-time forecasts can be used to identify high-risk areas for outbreaks and, consequently, to develop efficient targeted surveillance. We apply the methodology to weekly emergency room discharges for acute bronchitis in South Carolina. Keywords Infectious disease, prospective analysis, epidemic prediction, targeted surveillance, Bayesian hierarchical space–time model, syndromic information

1 Introduction Infectious diseases pose a growing threat to population health. Due to the emergence and reemergence of infectious diseases with pandemic potential, there has recently been much interest in their analysis. Currently, a large amount of infectious disease data is routinely collected by laboratories, health care providers, and government agencies in an effort to increase the understanding of their evolution and prevent, detect, and manage infectious disease outbreaks. Numerous statistical models have been proposed to describe the dynamics of infectious diseases. Ideally, spatiotemporal modeling of infectious diseases would be done at individual level. This type of models provides an accurate description of the spread of epidemics through time and space at the level of individuals in a population. In addition, these models, which involve the probability of a susceptible individual being infected at a particular time period by coming into contact with 1 2

Department of Statistics and Operations Research, University of Valencia, Burjassot, Spain Department of Public Health Sciences, Medical University of South Carolina, Charleston, U.S.A.

Corresponding author: Ana Corbera´n-Vallet, Department of Statistics and Operations Research, University of Valencia, Doctor Moliner 50, 46100 Burjassot, Spain. Email: [email protected]

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

Corbera´n-Vallet and Lawson

573

infectious individuals, allow for heterogeneity in the population via individual-level covariates.1,2 However, information regarding individual movement and contact behavior is scarcely ever available. In practice, infectious disease data often consist of small-area counts of the number of infected individuals at regular time intervals. The discrete susceptible-infectious-recovered (SIR) model and other compartmental models, such as the susceptible-exposed-infectious-recovered (SEIR) or susceptible-infectious-recovered-susceptible (SIRS) models, are commonly used to describe the progression of disease when aggregated data are available.3–5 Within this scenario, hierarchical Poisson models have been proposed as an alternative modeling framework when the number of susceptibles is unknown and the number of disease cases is small relative to the population size. For example, Muggling et al.6 proposed a Bayesian hierarchical model where the evolution of epidemics is described through changes in the relative risks of disease, which are modeled as multivariate Gaussian autoregressive AR(1) processes to capture the space–time dependence structure of epidemics. This retrospective modeling approach requires the prespecification of the change points determining the increasing and decreasing phases of epidemics. Knorr-Held and Richardson7 proposed a hierarchical model where the log of the relative risks is modeled through latent spatial and temporal components. An autoregressive term describing the dependence on the previous numbers of cases is incorporated into the relative risk model during epidemic periods, which are differentiated through latent indicators modeled as a two-stage hidden Markov model. More recent studies propose a two-component model that can be seen as a Poisson branching process model with immigration.8–10 In these models, the first component relates the incidence of disease to latent parameters describing endemic seasonal patterns. The second one, which captures occasional outbreaks beyond seasonal epidemics, is modeled with an autoregression on the previous number of cases, which affect the Poisson intensity additively rather than multiplicatively. Most of the models suggested in the literature have been developed for retrospective analyses of complete data sets. However, surveillance data in public health accumulate over time and sequential estimation of the model using all the data collected so far is essential to provide the state of infection in real time. Prospective analysis can also be used to predict the course of epidemics, in particular their onset. In a surveillance setting, quick detection of disease outbreaks is fundamental to implement timely public health response and mitigate the impact of epidemics. Within this context, syndromic surveillance has recently been introduced as an efficient tool to detect outbreaks of disease at the earliest possible time, possibly even before definitive disease diagnosis is obtained. The idea is to monitor syndromes or health-related data that precede diagnosis, such as visits to medical facilities or over-the-counter medication sales.11 Alarms are triggered when aberrations associated with syndromes are detected. However, as Chan et al.12 emphasized, alerts based on syndrome aberrations surely contain uncertainty, and so they should be evaluated with a proper probabilistic measure. Our goal in this paper is to develop a flexible methodology to prospectively analyze infectious disease surveillance data. Hierarchical Poisson models with two components describing the pattern of disease during nonepidemic and epidemic stages have proved fruitful in describing the progression of infectious diseases,9,10,13 and so we adopt here a similar modeling framework. During nonepidemic states, we decompose the log of the relative risk into additive components representing spatial and space–time interaction random effects, which allow the small-area relative risks to vary slightly around some stationary mean. The epidemic component, which represents the expected additive increase in disease counts due to the presence of an epidemic, is modeled as a function of previous counts of disease. A novelty of our model formulation is that the parameters describing the spread of epidemics are allowed to vary in both space and time. This feature allows us to closely reproduce the progression of infectious diseases.

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

574

Statistical Methods in Medical Research 23(6)

We present then a joint modeling approach that incorporates information from a syndromic disease, that is a sentinel disease that precedes the disease of interest or other disease predisposing to the infection analyzed. The proposed multivariate model relates increases in the number of cases of the disease of interest to increases previously observed in the syndromic disease. So, instead of analyzing syndromes associated with disease, we model the infectious disease of interest and the syndromic disease jointly and quantify the effect that changes in the behavior of the syndromic disease have on the incidence of the infection under study. The multivariate model provides a better description of the data. In addition, it provides more accurate one-stepahead forecasts, especially at the epidemic onset if the syndromic disease outbreaks early enough. These forecasts can be used to detect high-risk areas for outbreaks and, consequently, to develop efficient targeted surveillance. The remainder of this paper is organized as follows. In Section 2 we present our modeling framework. In Section 3 we describe a simulation study designed to evaluate the performance of the proposed methodology. Section 4 provides an application to weekly emergency room discharges (ERDs) for bronchitis in South Carolina. Finally, we conclude with a general discussion of the proposed methodology and provide directions for future research.

2 Model specification 2.1 The univariate model Let yit be the count of disease in area i, i ¼ 1, 2, . . . , m, and time period t, t ¼ 1, 2, . . . , T. At the first level of the model hierarchy, we assume that the observed counts are Poisson distributed yit  Poðit þ Iit Þ

ð1Þ

where it is the nonepidemic component and Iit is the epidemic component, which represents the expected additive increase in disease counts due to epidemics. We assume here that the nonepidemic component is given by it ¼ eit it

ð2Þ

where eit being the expected count of disease representing the background population effect and yit the unknown area-specific relative risk at time t. The logarithm of the relative risk is usually decomposed into additive components representing spatial, temporal, and space–time interaction random effects.14,15 We adopt here a similar approach and model the log of the relative risk as logðit Þ ¼  þ ui þ vi þ it

ð3Þ

where  is the overall level of the relative risk in the study region; ui and vi represent, respectively, spatially correlated and uncorrelated extra variation; and it is the space–time interaction random effect. As a prior distribution for the intercept we assume the conventional zero-mean Gaussian distribution with variance 2 . We use the improper conditional autoregressive model16 as a prior for the spatially correlated heterogeneity, that is ! 1 X u2 ui juðiÞ  N uj , jni j j2ni jni j

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

Corbera´n-Vallet and Lawson

575

where uðiÞ ¼ ðu1 , u2 , . . . , ui1 , uiþ1 , . . . , um Þ0 , ni is the set of spatial neighbors of the ith region, jnij is the cardinality of ni, and u2 is the correlated spatial component variance. Here the neighborhood is assumed to consist of spatially adjacent areas, but more general definitions are also possible. The prior distributions for the unstructured spatial heterogeneity and the space–time interaction effect are, respectively, vi  Nð0, v2 Þ and it  Nð0, 2 Þ. If the time series data present an overall time trend, a component  t may be included in model (3) to describe the temporal variation. Otherwise, the inclusion of adaptive time components is not advisable, since they are likely to track changes in risk due to epidemics. Hence, to better distinguish between nonepidemic and epidemic stages, we assume that each small-area relative risk varies slightly around its stationary mean  þ ui þ vi during nonepidemic states. Persistent increases in disease incidence are then attributable to epidemics. During an epidemic, counts of disease will trend upward over time until reaching the epidemic’s peak, which will be followed by a gradual downward trend. The component Iit in equation (1) represents the mean of the additional cases due to an epidemic. To capture both the temporal dependence in disease counts and the spatial spread of infectious diseases, we assume here that the expected increase in disease counts in a particular small area and time period depends on the previous number of cases in the area as well as in the neighboring areas. In particular, we model Iit as Iit ¼ it yi,t1 þ i

X

! yj,t1

ð4Þ

j2ni

where it is the autoregressive parameter and i measures, for the ith small area, the magnitude of the neighborhood effect, with higher values implying a higher degree of dependence of disease counts on the neighbors’ past counts. A Beta prior distribution is assumed for parameter i (see Section 4). A key feature of our model formulation is that the autoregressive parameter it is allowed to vary in both space and time. This is fundamental to properly capture the pattern of infectious disease data in the different regions under study: stability around the baseline disease level during nonepidemic periods (where it should be close to zero) and the different phases of an epidemic (growth and decline). Specifically, we model it as ( )     K  X 2k t 2k t bi,2k1 sin þ bi,2k cos þ it it ¼ exp bi,0 þ ! ! k¼1 it  Nð i,t1 ,  2 Þ;

i1  Nð0,  2 Þ

ð5Þ

where K is the number of sine–cosine waves necessary to model the seasonal variation; ! denotes the cycle (for instance, ! ¼ 52 for weekly data); each bi,k is assumed to follow a zero-mean Gaussian distribution, bi,k  Nð0, b2k Þ; and it allows for extra variation at each time period. We might have modeled it in different ways. For instance, we could have used a Bayesian change-point model with unknown number of change points,9 which are appropriate to capture sudden changes in infectiousness. Unlike this approach, model (5) allows for a smooth change in infectiousness. Finally, the standard deviation parameters are assumed to follow a uniform prior distribution on the interval (0,5).17 It is important to emphasize that, unlike previous formulations,8–10 the epidemic component Iit explains changes in disease incidence due to either seasonal epidemics or occasional outbreaks beyond seasonal patterns. The advantage of this formulation is that, by comparing Iit to it, we

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

576

Statistical Methods in Medical Research 23(6)

can assess at any time period the proportional increase in disease counts relative to the baseline disease level.

2.2

The multivariate model

Let us now assume that, for each small area and time period, we observe the number of cases of both the disease of interest (yit) and a disease that is likely to precede the infection under study (ysit ). Following model (1), we consider the Poisson distribution with two components to model the within area variability of the counts yit  Poðit þ Iit Þ ysit  Poðsit þ Isit Þ

ð6Þ

During nonepidemic states, the two diseases may be influenced by common confounding factors, and so they are likely to be correlated. The use of a multivariate model accounting for risk correlations across both diseases and neighboring spatial units is then advisable to provide a better description of the data and enhance comprehension of disease dynamics. Knorr-Held and Best18 introduced a shared component model (SCM) for the joint spatial analysis of two related diseases where the underlying risk surface for each disease is separated into a shared and a diseasespecific component. These components can be interpreted as surrogates for spatially structured unobserved covariates that are either shared by both diseases or specific to one of the diseases. We use here the following SCM to describe the nonepidemic behavior of the diseases it ¼ eit it sit ¼ esit its

ð7Þ

logðit Þ ¼  þ ui þ vi þ it logðits Þ ¼ s þ ui þ vsi þ sit

ð8Þ

where

Note that we do not use the original model proposed in Knorr-Held and Best,18 but a generalized common spatial model,19,20 where a single spatially correlated component is used to model the correlation between diseases and locations. The scaling parameter  Nð0,  2 Þ allows for a different risk gradient for the syndromic disease. Similar to the univariate model, Iit and Isit represent the expected additive increases in disease counts due to epidemics. Epidemics of the syndromic disease will generally precede those of the infection of interest, and so we can benefit from this information to provide a better description of the infection under study. To relate increases in the number of cases of the disease of interest to increases previously observed in the syndromic disease, we model the epidemic components as Iit ¼ it yi,t1 þ i

X

! yj,t1 þ i Isi,t1

j2ni

Isit

¼

sit

ysi,t1

þ

is

X

! ysj,t1

j2ni

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

ð9Þ

Corbera´n-Vallet and Lawson

577

where parameters it, i, sit , and is are defined as those in equation (4). For each spatial unit, the parameter i quantifies the effect that changes in the behavior of the syndromic disease have on the incidence of the infection of interest. As a prior distribution for this parameter we assume here the Beta distribution (see Sections 3 and 4). Note that this model formulation would also be appropriate for jointly modeling the infection of interest and any syndrome which is available in the form of aggregated counts in space and time; for instance, the number of children who do not attend school or units of medication sold. On those occasions, ysit would represent counts of the syndrome.

3 Simulation study In this section, we present a summary of the main results obtained in a simulation study that was carried out to assess the performance of the proposed methodology to both predict the onset of epidemics and describe their dynamics. In order to assess to what extent the multivariate model provides a better understanding of epidemic dynamics when the infection of interest depends on a syndromic disease, we simulated the data from the multivariate model as follows. The US state of South Carolina, which consists of m ¼ 46 counties, was used as the base map to generate counts of disease at county level for T ¼ 52 weeks. We used total weekly ERDs for influenza in South Carolina in 2009 to calculate weekly expected counts for the study region. In particular, the expected counts were calculated as popi eit ¼ r ð10Þ 100000 where popi is the population in county i and r represents the weekly influenza rate per 100,000 population during nonepidemic periods. Note that constant expected counts over time were assumed. The weekly disease rate for the syndromic disease was simulated as rs ¼ r þ Gað3, 1Þ, and the corresponding expected counts esit were calculated as those in equation (10). True background relative risks for the infection of interest and the syndromic disease during nonepidemic periods were simulated following equation (8), that is it ¼ expf þ ui þ vi þ it g and

its ¼ expfs þ ui þ vsi þ sit g

ð11Þ

the values of the standard deviances being  ¼ s ¼ 0:02, u ¼ 0:2, v ¼ vs ¼ 0:15, and  ¼ s ¼ 0:05; the scaling parameter was simulated as  Nð1, 0:252 Þ. Epidemics of the syndromic disease were generated for six counties in the northwest and five counties in the southeast of South Carolina (see Figure 1). We assumed that epidemics of the syndromic disease start at week t ¼ 48 and finish in the 7th week of the following year. Expected increases in the number of disease counts due to epidemics were simulated as Isit ¼ sit ysi,t1

ð12Þ

for the corresponding counties and epidemic periods t ¼ 1, 2, . . . , 7, 48, . . . , 52. In order to simulate small increases at the beginning of epidemics, which are more difficult to detect, we assumed si1  Gað13, 0:1Þ si7  Gað1, 0:1Þ si2  Gað13, 0:1Þ si48  Gað5, 0:1Þ si3  Gað10, 0:1Þ si49  Gað7, 0:1Þ

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

578

Statistical Methods in Medical Research 23(6)

Figure 1. Simulation study. Counties undergoing disease outbreaks.

si4  Gað7, 0:1Þ si50  Gað9, 0:1Þ si5  Gað5, 0:1Þ si51  Gað10, 0:1Þ si6  Gað2, 0:1Þ si52  Gað12, 0:1Þ

ð13Þ

so that EðIsi48 Þ ¼ 0:5ysi,47 , EðIsi49 Þ ¼ 0:7ysi,48 , and so on. For these 11 counties, we also simulated epidemics of the infection of interest, which were assumed to start at week t ¼ 49. Increases in the number of cases for the epidemic periods were simulated as Iit ¼ it yi,t1 þ i Isi,t1

ð14Þ

where it were simulated in accordance to equation (13) with a 1-week delay, that is i49  Gað5, 0:1Þ, i50  Gað7, 0:1Þ, and so on. To better assess the effect that the use of syndromic information has on the description and prediction of disease dynamics, we did not consider here neighborhood effects. Also, to simulate different levels of dependence, we set i ¼ 0:8 for the six counties in the northwest and i ¼ 0:2 for the five counties in the southeast of South Carolina undergoing disease outbreaks. Once the values for the expected counts and relative risks were specified, we sequentially generated the expected increases in disease counts due to epidemics according to equations (12) and (14) and the observed counts in the study region as ysit  Poðesit its þ Isit Þ and yit  Poðeit it þ Iit Þ. To allow for sampling variability, we simulated 100 data sets. As an example of one of these data sets, Figure 2 displays the time plots of the simulated data for two counties undergoing disease outbreaks: Beaufort (located in the southeast) and Laurens (located in the northwest). The results presented here are averaged over the 100 realizations. At each time point t ¼ 47, 48, . . . , 51, we estimated both the univariate and the multivariate model using all the data observed up to time t and calculated one-step-ahead forecasts for week t þ 1. Because disease counts were simulated over a small number of time periods, we cannot properly estimate the seasonal variation of the autoregressive parameter it. Therefore, we substituted it ¼ expfbi þ it g for equation (5), so that the autoregressive parameter for each county is allowed to vary around the stationary level expfbi g. In the multivariate scenario, we modeled sit

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

Corbera´n-Vallet and Lawson 60 20

30

40

disease of interest syndromic disease

0

10

Counts of disease

50

(a)

579

20

30

40

50

disease of interest syndromic disease

0

Counts of disease

(b)

10

10 20 30 40 50 60 70

0

0

10

20

30

40

50

Figure 2. Simulation study. Time plots for the counties of (a) Beaufort and (b) Laurens corresponding to one simulated data set.

in a similar way. As with many statistical models, the posterior distribution of the model parameters is not analytically tractable. Here parameter estimates were obtained via Markov chain Monte Carlo techniques with the free statistical software WinBUGS.21 To reduce the correlation in the sample, we kept one posterior sample in 20 iterations (i.e. thin ¼ 20) after the burn-in period (first 150,000 iterations) until a set of 2500 iterations was obtained. We experimented with a range of different prior specifications, all of them providing similar results. The results presented here correspond to the case where the noninformative Be(1,1) prior distribution is used for parameters i . The prior distributions for the remaining parameters in the model were specified as those in Section 2. To assess the appropriateness of the models to fit the data, we used the deviance information criterion (DIC).22 Table 1 presents the DIC values (together with the pD) obtained in the analysis of the simulated data for the infection of interest with both the univariate and the multivariate model using information from the syndromic disease. As can be seen, the multivariate model leads to an improved goodness of fit as judged by a lower DIC value. Finally, Table 2 presents the overall mean absolute forecast errors obtained with both models for weeks t þ 1 ¼ 48, 49, . . . , 52. We divided the counties in three groups. Group 1 and group 2 are composed, respectively, of the six counties in the northwest and the five counties in the southeast of South Carolina where disease epidemics were simulated. Group 3 corresponds to the remaining counties of South Carolina where no epidemics were simulated. As can be seen, when no epidemics

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

580

Statistical Methods in Medical Research 23(6)

Table 1. Simulation study. DIC (pD) values for both the univariate and the multivariate model. Only the DIC value corresponding to the infection of interest is shown in the multivariate scenario. Data fitted

Weeks 1–47

Weeks 1–48

Weeks 1–49

Weeks 1–50

Weeks 1–51

Univariate model Multivariate model

4036.78 (179.80) 4027.64 (176.22)

4116.20 (182.09) 4109.18 (175.66)

4206.85 (186.87) 4198.69 (179.29)

4305.52 (195.85) 4292.51 (182.67)

4410.94 (207.96) 4390.04 (188.49)

Table 2. Simulation study. Overall mean absolute forecast errors obtained with both the univariate and the multivariate model. Predicted week Univariate model Multivariate model

Group Group Group Group Group Group

1 2 3 1 2 3

48

49

50

51

52

0.54 0.68 0.55 0.55 0.67 0.54

0.93 0.96 0.57 0.88 0.92 0.56

1.79 1.51 0.55 1.60 1.30 0.55

3.09 2.48 0.55 2.91 1.89 0.54

5.18 3.32 0.55 4.13 2.71 0.55

Group 1 and group 2 are composed, respectively, of the six counties in the northwest and the five counties in the southeast of South Carolina undergoing disease outbreaks. Group 3 corresponds to the remaining counties of South Carolina.

occur, the univariate and the multivariate model give similar one-step-ahead forecasts, since no additional information is provided by the syndromic disease. However, the multivariate model provides more accurate forecasts on those occasions where epidemics of the infection of interest are preceded by epidemics of the syndromic disease.

4 Case study: ERDs for bronchitis in South Carolina In this section, we analyze weekly ERDs for acute bronchitis in South Carolina in 2009. Bronchitis and pneumonia are the two most common lower respiratory infections, which are the third leading cause of death worldwide (WHO 2008, updated 2011). Even though these data have a fixed past time, we show here the results obtained in the prospective analysis of the data with the univariate model previously described. Because bronchitis may closely follow an upper respiratory infection, such as the cold or influenza, we also apply the multivariate model to ERD for bronchitis using as syndromic information the number of ERD for acute upper respiratory infections (AURI). Figure 3 displays total weekly ERD in South Carolina. The left and right Y axes correspond, respectively, to ERD for bronchitis and AURI, which are considerably larger throughout the year. Bronchitis and AURI can occur anytime, but they are more frequent during the winter months. The unusual behavior observed in 2009 is due to the H1N1 influenza virus, which arrived in South Carolina in April 2009.

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

Corbera´n-Vallet and Lawson

0

10

20

30

40

ERD AURI

1500 2000 2500 3000

800 600 400

ERD bronchitis

1000

581

50

Week

Figure 3. Weekly emergency room discharges for bronchitis (left Y axis, solid line) and acute upper respiratory infections (right Y axis, dotted line) in South Carolina in 2009.

These data were also analyzed in Corbera´n-Vallet.23 With the objective of detecting outbreaks of disease at the very moment of their onset, the author proposed a multivariate surveillance technique that, by integrating information from multiple diseases, improves the sensitivity and timeliness of outbreak detection for events that are present in more than one data set. As expected, outbreaks of more serious diseases of the respiratory system, such as bronchitis and pneumonia, were occasionally preceded by outbreaks of milder diseases such as AURI. In addition to identifying possible increases in disease incidence, we also aim here to describe and predict the spatiotemporal behavior of bronchitis. Initially, we consider as historical data the observations for the first 30 weeks. We calculate the expected counts for each county by internal standardization using data from nonepidemic periods; in particular, we use the data from week beginning July 12 to week beginning July 26 (weeks 28–30 in Figure 3). At each time point t ¼ 30, 31, . . . , 51, we estimate the model using all the data observed up to time t and calculate one-step-ahead forecasts for week t þ 1. As in the simulation study, we substitute it ¼ expfbi þ it g for equation (5). As a prior distribution for parameter i we consider here the Be(1.5,1.5) distribution, which provides reasonable noninformativeness. Figure 4 displays the estimated relative risks of bronchitis at weeks beginning March 8, May 17, and July 26 (t ¼ 10, 20, 30) obtained with the univariate model when it is fitted to all the data up to week 30. These relative risks represent the baseline risks of bronchitis in each county (see equations (2) and (3)) which, as expected, are practically constant over time. Figure 5 shows the temporal profiles for a selection of five counties: Charleston, Cherokee, Florence, Greenville, and Richland, which are some of the counties that experience a major increase in the number of ERD for bronchitis due to epidemics. This figure also includes the estimated mean levels when the univariate model is initially fitted to the data from the first 30 weeks and one-step-ahead forecasts for weeks 31, 32, . . . , 52. Figure 6 displays the posterior mean of the regression parameters fit g30 t¼2 for these five counties. As it is shown, the estimated regression parameters change over space and time to properly capture the observed pattern of disease counts in each county. Posterior average estimates and 95% credible intervals for some of these parameters and parameter i are given in Table 3. Finally, Figure 7 shows the estimated mean level up to week 30 together with one-step-ahead forecasts and 95% prediction intervals for the total number of ERD for bronchitis in

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

582

Statistical Methods in Medical Research 23(6)

(a)

(b)

1.6

(c)

1.6

1.6

Figure 4. Posterior estimates of the relative risks of bronchitis at week beginning (a) March 8, (b) May 17, and (c) July 26 obtained with the univariate model.

South Carolina. These results have been obtained by adding up those corresponding to the 46 counties. As can be seen, the model provides an adequate fit to the observed data. Because of the complexity of epidemic processes and the noise present in the data, prediction of the number of cases for the next week is complicated. Nevertheless, the proposed model provides quite reasonable onestep-ahead forecasts, the biggest forecast errors being those associated with the onset of epidemics. It is important to emphasize that the limited amount of available data prevents us from estimating the seasonal variation in the number of ERD for bronchitis. Consequently, the model cannot properly predict the beginning of epidemics unless additional information is available. We next investigate how information regarding changes in the incidence of AURI can be used to provide a better description of the data. Similar to the analysis carried out in the univariate setting, we sequentially estimate the multivariate model using the data observed up to week t ¼ 30, 31, . . . , 51 and calculate one-step-ahead forecasts for week t þ 1. As prior distributions for parameters i and is we assume the Be(1.5,1.5) distribution. For parameter i we consider here the Be(2,5) prior distribution. Note that the multivariate model incorporates a large amount of parameters describing the effect of previous counts of disease and changes in the behavior of the syndromic disease. Identification and estimation of all these effects can be difficult on some occasions. We performed a sensitivity analysis and experimented with a range of different prior specifications (including noninformative prior distributions), all of them providing similar results: a small dependence on increases in the incidence of the syndromic disease and posterior mean estimates of parameters i and is close to 0.5. Hence, to facilitate convergence of Monte Carlo simulations in the sequential analyses of the data, we considered these slightly informative prior distributions. Table 4 displays the posterior average estimates and 95% credible intervals for some parameters of the multivariate model. In general, posterior estimates for parameter i are similar to those obtained with the univariate model (see Table 3). However, the estimates obtained with the multivariate model for parameter it are smaller than those of the univariate model. This is due to the fact that, in the multivariate model, increases in the number of ERD for bronchitis are also explained by increases previously observed in the number of ERD for AURI. Consequently, the dependence on the previous numbers of cases is smaller. Table 5 presents, for a selection of time periods, the DIC values (together with the pD) obtained in the analysis of ERD for bronchitis with both the univariate and the multivariate model using as syndromic information the number of ERD for AURI. As can be seen, the multivariate model provides a better description of the data.

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

Corbera´n-Vallet and Lawson

583

60 40 20

ERD bronchitis

(a)

ERD bronchitis

(b)

10

20

30

40

50

10 20 30 40 50

0

0

10

20

30

40

50

0

10

20

30

40

50

0

10

20

30

40

50

0

10

20

30

40

50

35 25 15 5

ERD bronchitis

(c)

20 40 60 80 100

ERD bronchitis

(d)

10 20 30 40 50 60

ERD bronchitis

(e)

Week

Figure 5. Temporal profiles for a selection of five counties together with the estimated mean level (dotted line) and one-step-ahead forecasts for weeks 31,32,. . . ,52 (points on dotted line) obtained with the univariate model. (a) Charleston, (b) Cherokee, (c) Florence, (d) Greenville, and (e) Richland.

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

Statistical Methods in Medical Research 23(6)

0.5

584

0.0

0.1

0.2

0.3

0.4

Charleston Cherokee Florence Greenville Richland

0

5

10

15

20

25

30

Week

Figure 6. Estimated regression parameters from week beginning January 11 to week beginning July 26 (weeks 2–30) for a selection of five counties.

Table 3. Posterior mean estimates and 95% credible intervals for some parameters of the univariate model.

Charleston Cherokee Florence Greenville Richland

i2

i5

i20

i30

i

0.22 [0.11,0.38] 0.21 [0.10,0.40] 0.30 [0.15,0.54] 0.31 [0.18,0.51] 0.34 [0.19,0.58]

0.32 [0.16,0.56] 0.18 [0.09,0.36] 0.31 [0.16,0.54] 0.44 [0.26,0.74] 0.49 [0.28,0.80]

0.21 [0.08,0.43] 0.20 [0.09,0.38] 0.08 [0.02,0.20] 0.23 [0.10,0.42] 0.25 [0.08,0.48]

0.08 [0.02,0.20] 0.09 [0.02,0.23] 0.06 [0.01,0.17] 0.14 [0.04,0.35] 0.10 [0.03,0.26]

0.57 [0.10,0.96] 0.59 [0.18,0.95] 0.50 [0.08,0.92] 0.55 [0.10,0.94] 0.55 [0.09,0.95]

Figure 8 compares the one-step-ahead forecasts of ERD for bronchitis in Charleston, Cherokee, Florence, Greenville, and Richland from week 31 to week 52 obtained with both the univariate and multivariate models. For illustrative purposes, the number of ERD for AURI, which is used to calculate forecasts in the multivariate scenario, is also shown. For the county of Cherokee, a significant increase in the number of ERD for AURI occurs at week 34. This increase affects the prediction of the multivariate model for the number of ERD for bronchitis in week 35, which is larger than that of the univariate model and closer to the real value. So, the multivariate model provides a more accurate forecast at the beginning of the bronchitis epidemic in Cherokee. Although one-step-ahead forecasts obtained for weeks 36–52 with both models are similar, the multivariate model, which takes into account the variation observed in the number of ERD for AURI, provides, in general, more accurate forecasts (see Table 6). For the other four counties in Figure 8, bronchitis and AURI epidemics start at the same time period, and so one-step-ahead forecasts obtained with both models before and at the onset of epidemics of bronchitis are similar. Once a significant

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

Corbera´n-Vallet and Lawson

200 400 600 800 0

ERD bronchitis

1200

585

0

10

20

30

40

50

Week

Figure 7. Estimated mean level (dotted line) from week beginning January 4 to week beginning July 26 (weeks 1–30) together with one-step-ahead forecasts (points on dotted line) and 90% prediction intervals for the total number of ERD for bronchitis in South Carolina from week 31 to week 52, 2009. Real values are represented by a solid line.

Table 4. Posterior mean estimates and 95% credible intervals for some parameters of the multivariate model.

Charleston Cherokee Florence Greenville Richland

i2

i5

i20

i30

i

i

0.10 [0.00,0.28] 0.16 [0.01,0.44] 0.13 [0.00,0.34] 0.21 [0.09,0.39] 0.24 [0.07,0.46]

0.11 [0.00,0.34] 0.09 [0.00,0.26] 0.12 [0.00,0.33] 0.33 [0.16,0.61] 0.28 [0.06,0.57]

0.04 [0.00,0.20] 0.04 [0.00,0.15] 0.03 [0.00,0.12] 0.12 [0.02,0.32] 0.09 [0.01,0.27]

0.04 [0.00,0.19] 0.02 [0.00,0.08] 0.02 [0.00,0.09] 0.11 [0.01,0.32] 0.05 [0.00,0.21]

0.51 [0.06,0.95] 0.48 [0.05,0.94] 0.50 [0.07,0.94] 0.52 [0.08,0.94] 0.54 [0.10,0.94]

0.29 [0.14,0.44] 0.57 [0.25,0.81] 0.20 [0.05,0.33] 0.15 [0.05,0.28] 0.20 [0.08,0.32]

Table 5. DIC (pD) values for both the univariate and the multivariate model. Only the DIC value corresponding to ERD for bronchitis is shown in the multivariate scenario. Data fitted Weeks 1–30 Weeks 1–33 Weeks 1–36 Weeks 1–39

Univ model

Multiv model

6746.64 (363.27) 7229.30 (274.11) 7983.41 (392.86) 8607.89 (316.69)

6538.69 (139.73) 7117.22 (168.75) 7763.08 (140.25) 8398.41 (85.62)

Data fitted Weeks 1–42 Weeks 1–45 Weeks 1–48 Weeks 1–51

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

Univ model

Multiv model

9347.27 (396.19) 10,018.80 (373.52) 10,692.20 (256.32) 11,445.30 (359.79)

9317.18 (322.82) 9938.77 (262.01) 10,685.90 (343.46) 11,149.30 (66.69)

Statistical Methods in Medical Research 23(6)

250 200 150 100 36

38

40

42

44

46

48

50

52

90 80 70 60 40

50

ERD AURI

40 30 20 10

ERD bronchitis

34

50

32

(b)

ERD AURI

60 50 40 30 20

ERD bronchitis

(a)

70

586

32

34

36

38

40

42

44

46

48

50

52

36

38

40

42

44

46

48

50

52

250 200 150

ERD AURI

80 60

100

20

40

ERD bronchitis

34

100 120

32

(d)

ERD AURI

40

60

80

100

120

10 15 20 25 30 35 5

ERD bronchitis

(c)

34

36

38

40

32

34

36

38

40

42

44

46

48

50

52

42

44

46

48

50

52

300

32

150 100

ERD AURI

250 200

50 40 30 20

ERD bronchitis

60

(e)

Week

Figure 8. One-step-ahead forecasts of ERD for bronchitis from week beginning August 2 to week beginning December 27 (weeks 31–52) obtained with both the univariate model (points on dotted line) and the multivariate model (diamonds on dotted line) for a selection of five counties. Real values are represented by thick solid red lines. Black lines represent ERD for AURI. (a) Charleston, (b) Cherokee, (c) Florence, (d) Greenville, and (e) Richland.

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

Corbera´n-Vallet and Lawson

587

Table 6. Overall mean absolute forecast errors for the weeks of prediction (weeks 31–52) obtained with both the univariate and multivariate model for a selection of five counties.

Charleston Cherokee Florence Greenville Richland

Univariate model

Multivariate model

11.17 6.54 6.53 16.11 6.58

10.71 5.57 6.09 14.56 6.42

increase in the number of ERD for AURI is detected, the multivariate model provides forecasts for the next week greater than those obtained with the univariate model. It is important to emphasize here that we are dealing with complex data that present a lot of variability. It may be the case that the dependence between increases in the incidence of the syndromic disease (AURI) and the disease of interest (bronchitis) changes during the forecast horizon and, as a consequence, forecasts obtained with the multivariate model are larger than the real values. This is the case, for instance, of the Richland county at week 36, where the increase in the number of ERD for bronchitis is not as high as expected based on the number of ERD for AURI observed in the previous week. Nevertheless, forecasts obtained with the multivariate model are usually closer to the real values (see Table 6). In addition, in a surveillance context where the main objective is to rapidly detect and follow outbreaks of disease, it is better to err on the side of caution and warn of possible increases in disease incidence that may need further investigation. These results apply to all the remaining counties in South Carolina. Finally, as we have mentioned before, an important feature of the proposed methodology is that it can provide a sense of the magnitude of the outbreak by comparing the epidemic component to the nonepidemic one. To illustrate this, Figure 9 depicts the estimated mean endemic levels obtained with the multivariate model for the county of Cherokee when it is fitted to data observed up to week 34 together with the proportional increases in disease counts relative to the baseline level, that is Iit =it . In addition, the figure also shows the predicted endemic level for week 35 and the corresponding expected proportional increase in disease counts. Increases at weeks 33 and 34 slightly greater than the ones observed in previous weeks indicate that the model has detected a small rise in the number of ERD for bronchitis during these weeks. Taking into account these increases and the high increase observed at week 34 in the number of ERD for AURI, the model alerts us of an epidemic that is rising at week 35, the predicted increase in disease counts due to the epidemic being approximately six times the endemic level.

5 Discussion In this paper we have proposed a model to prospectively analyze infectious disease data. Disease counts are assumed to follow a Poisson distribution whose mean is decomposed into a nonepidemic and an epidemic component, which captures increases in disease incidence due to epidemics. A novelty of the proposed model is that the regression parameters that explain the spread of disease are allowed to vary over space and time. Consequently, the model provides an accurate description of the time series data in each small area. We have also introduced a multivariate extension of the

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

Statistical Methods in Medical Research 23(6)

4

Proportional increase

8 6

30 20

2

10

ERD bronchitis

40

588

0

5

10

15

20

25

30

35

Week

Figure 9. ERD for bronchitis in Cherokee from week 1 to week 34 (points on solid thick line) together with the estimated nonepidemic level (solid line) obtained with the multivariate model and the corresponding proportional increases in disease counts relative to the endemic level (right Y axis, dotted line). Predicted values for week 35 are also included.

model that incorporates information from a syndromic disease. As shown in a small simulation study and in the case study, the multivariate model provides a more accurate fit and, in general, more accurate one-step-ahead forecasts. A main feature of the proposed methodology is that, by comparing the epidemic component to the nonepidemic one, it allows us to assess if an epidemic is occurring at any small area and time period and, if so, quantify the proportional increase in disease counts relative to the baseline disease level during nonepidemic stages. It is not our intention to replace current surveillance techniques, but to develop a flexible modeling strategy to describe the progression of infectious diseases and provide situational awareness. The proposed methodology can be envisaged as a helpful tool to enhance existing surveillance techniques and develop targeted surveillance based on one-step-ahead forecasts. For instance, Corbera´n-Vallet and Lawson24 introduced a Bayesian model-based surveillance technique to detect small areas of increased disease incidence. The technique has proved to be a powerful strategy to detect sudden jumps in risk, but it may fail to detect a gradual change in risk because evidence against the in-control situation is not accumulated over time. For each small area, gradually increasing values of the epidemic component can help develop an aggregated surveillance technique which, by incorporating information from consecutive counts of disease above the endemic level, may improve outbreak detection capability. Also, an extension of the surveillance technique incorporating information from the spatial neighborhood has proved to improve outbreak detection when changes in disease incidence affect neighboring areas simultaneously.24 However, it is not clear yet how disease counts should be aggregated over space. The proposed model might help decide how information should be aggregated over time and space. It is important to emphasize that the proposed multivariate model incorporates information from a syndromic disease, that is a sentinel disease that may precede the disease of interest or other disease predisposing to the infection analyzed. We have analyzed here the number of ERD for acute bronchitis using as syndromic information the number of ERD for AURI. As shown in the case study, acute bronchitis may closely follow an upper respiratory tract infection, but it may also occur

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

Corbera´n-Vallet and Lawson

589

on its own, and so AURI outbreaks do not always precede bronchitis epidemics. We are currently extending our methodology to incorporate information from common syndromes of disease such as school and work absenteeism, over-the-counter medication sales, or physician telephone calls.25 We are convinced that the integration of specific disease syndromes indicating early illness will provide more satisfactory results, in particular, more accurate one-step-ahead forecasts, especially at the onset of epidemics. This information will be fundamental to identify the start of epidemics in advance and assess their intensity. Another very fruitful area for further research is the use of syndromic information that is not available in the form of counts, for instance the percentage of influenza-like illness visits or meteorological factors such as temperature and vapor pressure.12,26 A different modeling framework should be used on those occasions, since the proposed multivariate model, which is based on the Poisson distribution, is no longer appropriate. A possibility may be to incorporate increases in the syndromic factor relative to a mean level directly into the epidemic component associated with the disease of interest. Here the neighborhood is assumed to consist of spatial neighbors defined by a common boundary. However, other choices are also possible. For instance, Paul et al.10 use air travel information in a retrospective analysis of the weekly number of deaths from influenza and pneumonia in 122 cities across the USA. Schro¨dle et al.27 use cattle trade network information to analyze Coxiellosis incidence on Swiss farms. This line of research may be particularly useful, since a neighborhood structure based on population movement may be more appropriate for modeling the spatiotemporal spread of infectious diseases. Finally, it would also be valuable to investigate alternative strategies to estimate the posterior distribution of the model parameters, since sequential estimation of the model using MCMC simulation techniques can be time consuming. Integrated nested Laplace approximations have been recently used to approximate posterior marginals in latent Gaussian models.28 This methodology gives very precise estimates in short computational time. In addition, it can be run in a user-friendly R environment. However, within the INLA setting, it is not possible to allow the autoregressive parameters to vary over space and time.27 An alternative approach would be the development of an adaptive MCMC algorithm.29 Funding This research was supported by Grant Number R03CA162029 from the National Cancer Institute.

References 1. Lawson AB and Leimich P. Approaches to the space-time modelling of infectious disease behaviour. IMA J Math Appl Med Biol 2000; 17: 1–13. 2. Deardon R, Brooks SP, Grenfell BT, et al. Inference for indivual-level models of infectious diseases in large populations. Stat Sin 2010; 20: 239–261. 3. Vynnycky E and White RG. An introduction to infectious disease modelling. Oxford: Oxford University Press, 2010, pp.13–40. 4. Morton A and Finkensta¨dt BF. Discrete time modelling of disease incidence time series by using Markov chain Monte Carlo methods. J R Stat Soc Ser C 2005; 54: 575–594. 5. Lawson AB and Song HR. Bayesian hierarchical modeling of the dynamics of spatio-temporal influenza season outbreaks. Spat Spatio-temporal Epidemiol 2010; 1: 187–195.

6. Mugglin AS, Cressie N and Gemmell I. Hierarchical statistical modelling of influenza epidemic dynamics in space and time. Stat Med 2002; 21: 2703–2721. 7. Knorr-Held L and Richardson S. A hierarchical model for space-time surveillance data on meningococcal disease incidence. J R Stat Soc Ser C 2003; 52: 169–183. 8. Held L, Ho¨hle M and Hofmann M. A statistical framework for the analysis of multivariate infectious disease surveillance counts. Stat Model 2005; 5: 187–199. 9. Held L, Hofmann M and Ho¨hle M. A two-component model for counts of infectious diseases. Biostatistics 2006; 7: 422–437. 10. Paul M, Held L and Toschke AM. Multivariate modeling of infectious disease surveillance data. Stat Med 2008; 27: 6250–6267. 11. CDC. Overview of syndromic surveillance What is syndromic surveillance?. MMWR 2004; 53: 5–11.

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

590

Statistical Methods in Medical Research 23(6)

12. Chan T-C, King C-C, Yen M-Y, et al. Probabilistic daily ILI syndromic surveillance with a spatio-temporal Bayesian hierarchical model. PLoS One 2010; 5: e11626. 13. Heaton MJ, Banks DL, Zou J, et al. A spatio-temporal absorbing state model for disease and syndromic surveillance. Stat Med 2012; 31: 2123–2136. 14. Knorr-Held L. Bayesian modelling of inseparable spacetime variation in disease risk. Stat Med 2000; 19: 2555–2567. 15. Lawson AB. Bayesian disease mapping: Hierarchical modeling in spatial epidemiology, 2nd edn. Boca Raton, FL: Chapman & Hall, 2013, chap. 12. 16. Besag J, York J and Mollie´ A. Bayesian image restoration, with two applications in spatial statistics. Ann Inst Stat Math 1991; 43: 1–59. 17. Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal 2006; 1: 515–533. 18. Knorr-Held L and Best NG. A shared component model for detecting joint and selective clustering of two diseases. J R Stat Soc Ser A 2001; 164: 73–85. 19. Wang F and Wall MM. Generalized common spatial factor model. Biostatistics 2003; 4: 569–582. 20. Ma H and Carlin BP. Bayesian multivariate areal wombling for multiple disease boundary analysis. Bayesian Anal 2007; 2: 281–302. 21. Lunn DJ, Thomas A, Best N, et al. WinBUGS – a Bayesian modelling framework: concepts, structure, and extensibility. Stat Comput 2000; 10: 325–337.

22. Spiegelhalter DJ, Best NG, Carlin BP, et al. Bayesian measures of model complexity and fit (with discussion). J R Stat Soc Ser B 2002; 64: 583–616. 23. Corbera´n-Vallet A. Prospective surveillance of multivariate spatial disease data. Stat Methods Med Res 2012; 21: 457–477. 24. Corbera´n-Vallet A and Lawson AB. Conditional predictive inference for online surveillance of spatial disease incidence. Stat Med 2011; 30: 3095–3116. 25. Kavanagh K, Robertson C, Murdoch H, et al. Syndromic surveillance of influenza-like illness in Scotland during the influenza A H1N1v pandemic and beyond. J R Stat Soc Ser A 2012; 175: 939–958. 26. Lowe R, Bailey TC, Stephenson DB, et al. The development of an early warning system for climatesensitive disease risk with a focus on dengue epidemics in Southeast Brazil. Stat Med 2013; 32: 864–883. 27. Schro¨dle B, Held L and Rue H. Assessing the impact of a movement network on the spatiotemporal spread of infectious diseases. Biometrics 2012; 68: 736–744. 28. Rue H, Martino S and Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations (with discussion). J R Stat Soc Ser B 2009; 71: 319–392. 29. Jewel CP, Kypraios T, Neal P, et al. Bayesian analysis for emerging infectious diseases. Bayesian Anal 2009; 4: 465–496.

Downloaded from smm.sagepub.com at BRIGHAM YOUNG UNIV on March 15, 2015

Prospective analysis of infectious disease surveillance data using syndromic information.

In this paper, we describe a Bayesian hierarchical Poisson model for the prospective analysis of data for infectious diseases. The proposed model cons...
553KB Sizes 1 Downloads 3 Views