Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

Contents lists available at ScienceDirect

Spatial and Spatio-temporal Epidemiology journal homepage: www.elsevier.com/locate/sste

Original Research

Supervised learning and prediction of spatial epidemics Gyanendra Pokharel ⇑, Rob Deardon Department of Mathematics and Statistics, University of Guelph, ON N1G2W1, Canada

a r t i c l e

i n f o

Article history: Received 27 June 2013 Revised 22 July 2014 Accepted 11 August 2014 Available online 16 September 2014 Keywords: Supervised learning Spatial epidemic Random forests Spatial stratification

a b s t r a c t Parameter estimation for mechanistic models of infectious disease can be computationally intensive. Nsoesie et al. (2011) introduced an approach for inference on infectious disease data based on the idea of supervised learning. Their method involves simulating epidemics from various infectious disease models, and using classifiers built from the epidemic curve data to predict which model were most likely to have generated observed epidemic curves. They showed that the classification approach could fairly identify underlying characteristics of the disease system, without fitting various transmission models via, say, Bayesian Markov chain Monte Carlo. We extend this work to the case where the underlying infectious disease model is inherently spatial. Our goal is to compare the use of global epidemic curves for building the classifier, with the use of spatially stratified epidemic curves. We demonstrate these methods on simulated data and apply the method to analyze a tomato spotted wilt virus epidemic dataset. Ó 2014 Elsevier Ltd. All rights reserved.

1. Introduction Mathematical models are often useful to better understand infectious diseases, determine their population-level dynamics, and evaluate the impact of interventions. Mechanistic models that try to capture the underlying transmission mechanisms have received increasing attention for modeling space-time data. For example, Chis-Ster and Ferguson (2007) applied a space-time survival process (Cressie, 1993) to study the 2001 foot and mouth disease (FMD) outbreak in Great Britain. With respect to the same disease outbreak, Deardon et al. (2010) proposed a class of discrete-time individual-level models (ILMs) that can be used for modeling the spread of infectious disease if transmission depends on different individual-level risk factors. In such ILMs, the spatio-temporal aspects of infectious ⇑ Corresponding author at: Department of Mathematics and Statistics, University of Guelph, 50 Stone Road East, Guelph, ON N1G2W1, Canada. Tel.: +1 5197424025; fax: +1 519 837 0221. E-mail addresses: [email protected] (G. Pokharel), rdeardon@ uoguelph.ca (R. Deardon). http://dx.doi.org/10.1016/j.sste.2014.08.003 1877-5845/Ó 2014 Elsevier Ltd. All rights reserved.

disease transmission can be easily included. However, parameter estimation of such models can be highly computationally intensive. Kwong and Deardon (2012) explored a methodology to reduce the time consumption of calculating the likelihood in ILMs via linearization of parts of the model. Keeling and Rohani (2008) suggested that simulation-based models can be used for estimating the parameters via Monte Carlo methods. Data assimilation techniques for initializing realtime forecasts of seasonal influenza outbreaks were used by Shaman and Karspeck (2012). Inference without likelihoods for homogeneously mixing, susceptible, exposed, infectious, and removed (SEIR), models is described by, for example, McKinley et al. (2009). Here, we focus on a method introduced by Nsoesie et al. (2011) that infers underlying infectious disease model parameters for ongoing outbreaks based on epidemic simulation and classification techniques. In a typical supervised classification scenario, the goal is to form a description (classifier) that can be used to predict previously unseen examples from a set of training samples. Each training dataset consists of input features (Hastie

60

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

et al., 2009) called predictors (independent or explanatory variables) and qualitative dependent (response) variables that give the correct classification of that data point. The primary purpose of supervised classification is to build a model from a set of categorized data points that can predict the correct category class of unlabeled future data (Knights et al., 2011). Supervised classification techniques have been successfully applied to human microbiota and microarray data by Knights et al. (2011) and Lee et al. (2005), respectively, as well as to epidemic curve data by Nsoesie et al. (2011). The method is based on building a library of simulated epidemics using various underlying infectious diseases models and epidemic curves characterizing these simulations to train a classifier for use on observed data. Nsoesie et al. (2011) used simulation studies to determine which classification methods perform better at predicting the underlying epidemic models with various sets of parameters instead of using computationally burdensome Bayesian method. They compared the performance of eight classification methods (three nearest neighbor methods, support vector machines (SVM), linear discriminant analysis (LDA), flexible discriminant analysis (FDA), random forests (RF) and a combined classifier) and suggested that random forest was the best performer. Following this general conclusion, we focus here on classification methods based upon random forests. Among various methods of supervised classifications, random forests offer a powerful classification tool and are well established in numerous fields. They are highly accurate, efficient, and interpretable (Hastie et al., 2009). Bootstrap aggregation, which is also called bagging, is a technique for reducing the variance of an estimated prediction function using a combination of several predictors (Breiman, 1996; Hastie et al., 2009). Random forest is an extension of bagging in which a predictor based upon a series of identically distributed trees is formed (Hastie et al., 2009). There are several advantages of random forests. The random forest works efficiently for large dataset in a short amount of time (Nsoesie et al., 2011). It estimates the important variables and avoids over fitting the data (Hastie et al., 2009) and can be used in clustering (Breiman, 2001). It is fairly insensitive to the values of tuning parameters such as a number of variables used to split a leaf and a subsample size (Breiman, 2001). This is in contrast to some other machine learning methods, such as Support Vector Machines, which require accurate choice of tuning parameters in order to obtain good results (Hastie et al., 2009). Nsoesie et al. (2011) used an underlying agent-based network transmission model that assumes the existence of heterogeneities in the spread of an infectious disease because human contact patterns are indeed heterogeneous. Here, we focus on disease systems in which the underlying transmission mechanism is spatial. The first objective is to test whether the random forest classification methods work to estimate the parameters and underlying spatial mechanism using global epidemic curve data. The second objective is to compare the use of global epidemic curves versus spatially stratified epidemic curves for

building the classifier. The third objective explores how easily the correct infectious disease model can be identified from the early part of the epidemic curves (partial epidemic curves). Here, spatial stratification refers to a method of providing a geographically explicit epidemic dataset from which we can obtain data about small population groups that could be more informative than data for the global population. We simulate the epidemics using three special cases of ILMs based on the ILMs described by Deardon et al. (2010) and then observe the simulated epidemics in the global and spatially stratified populations. We then apply these methods to a tomato spotted wilt virus (TSWV) epidemic dataset (Hughes et al., 1997). We find that these classification techniques are able to identify models, and thus characterize disease dynamics, reasonably well. We also observe that spatial stratification of the population tends to improve upon the classification errors rates. Finally, we see that classification rates improve as more and more data are observed, with relatively little improvement as more and more data from after the epidemic peak are included. The remainder of the paper is outlined as follows. The infectious disease transmission model framework described by Deardon et al. (2010), and some specific models used in this study, are discussed in Section 2. The epidemic classification study is outlined in Section 3 with results detailed in Section 4. Application of these methods to the TSWV data is discussed in Section 5. A discussion of results and further work is presented in Section 6. 2. Infectious disease transmission models 2.1. Model framework The disease model framework that we use here to simulate the epidemics is presented in Deardon et al. (2010). The model is a discrete-time SEIR individual-level model (ILM), which means an individual i can be in one of the four states at any time point: i 2 S means individual i is susceptible to disease, i 2 E implies individual i has been exposed to the disease (infected but not able to transmit the disease), i 2 I implies individual i is infectious (infected and able to transmit the disease to others), and i 2 R means individual i has been removed from the population (e.g., through immunization, death, quarantine) and is not longer able to infect other individuals. An individual contracting the disease goes through all four stages S ! E ! I ! R in order. The number of time points required to move from E ! I (latent period) and I ! R (infectious period) are denoted by cE and cI , respectively. At time point t, the sets of susceptible, exposed, infectious, and removed individuals are denoted by SðtÞ; EðtÞ; IðtÞ, and RðtÞ respectively. The complete epidemic history consists of information with respect to those four sets in the time frame t ¼ 1; 2; . . . . . . ; tmax , where t ¼ 1 is the time when the first infection is observed and tmax is the time when the last infection is observed. We suppose the latent period (cE ) is equal to zero so that the model in this context will be within a susceptible-infectious-removed (SIR) compartmental framework.

61

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

The general ILM presented by Deardon et al. (2010) describes the probability of the susceptible individual i being infected in time interval ½t; t þ 1Þ. It is given by

" ( Pði; tÞ ¼ 1  exp  /S ðiÞ

X

how the distance between individuals will affect the probability of infection. The probability that individual i becomes infected at time t is then given by

)# wT ðjÞjði; jÞ þ ði; tÞ

;

" ( ð1Þ

Pði; tÞ ¼ 1  exp  a

j2IðtÞ

Pði; tÞ ¼ 1  exp  a

X

)# expðbE dij Þ þ 

;

j2IðtÞ

where /S ðiÞ represents a function of risk factors associated with susceptible individual i contracting the disease (i.e., susceptibility); wT ðjÞ represents a function of risk factors associated with infectious individual j passing on the disease (i.e., transmissibility); and jði; jÞ represents a function of risk factors involving both susceptible and infected individuals, i and j, and is called an infection kernel. Often, the infection kernel is a function of Euclidean distance dij between individuals i and j. Other random behaviors of the epidemic are represented by ði; tÞ, and might be a function of risk factors associated with external agents or not addressed by the basic model framework. These behaviors could be purely random in nature or a function of susceptible factors or current size of the epidemic (Deardon et al., 2010). Here, we suppose /S ðiÞwT ðjÞ = a and ði; tÞ ¼ , and then, Pði; tÞ, the probability that previously uninfected individual i becomes infected at time t, is given by

" (

X

a > 0; bE > 0;  P 0;

ð4Þ

2.4. Neighborhood model In this model, for some r 2 R, the kernel is given by

jði; jÞ ¼ I½dij < r, where r is the maximum distance that the disease can spread from infectious individual j and I½dij < r is an indicator function defined by

 I½dij < r ¼

1; if dij < r : 0; otherwise

The probability that susceptible individual i becomes infected at time t is then given by

" ( Pði; tÞ ¼ 1  exp  a

X

)# I½dij < r þ 

;

j2IðtÞ

a > 0; r > 0;  P 0;

ð5Þ

)#

jði; jÞ þ 

;

ð2Þ

3. Epidemic classification study

j2IðtÞ

where a is an infectivity parameter reflecting the overall strength of the epidemic. Now, based on model (2), we define the following models. 2.2. Geometric model This model is characterized by the geometric distance b kernel jði; jÞ ¼ dij G , where bG is a geometric decay or spatial parameter. It determines the risk of infections that occur over varying distance. The probability, Pði; tÞ, that susceptible individual i is infected at time t is then given by

" ( Pði; tÞ ¼ 1  exp  a

X

)# b dij G

þ

; a > 0; bG > 0;

 P 0;

j2IðtÞ

ð3Þ 2.3. Exponential model In this model, the kernel jði; jÞ is defined by jði; jÞ ¼ expðbE dij Þ, where bE is a spatial parameter dictating

Following Nsoesie et al. (2011), our goal here is to build a digital library of epidemic curves based on thousands of simulated epidemics. Those epidemics are then used to train a random forest-based classifier that is then used to classify observed (in this case, other simulated) epidemics. First, we describe the epidemic simulation. 3.1. Epidemic simulation A population of 625 individuals on the vertices of a uniform grid of size 25  25 is considered. Thus, the individuals are located at the points given by ð1; 1Þ; ð1; 2Þ; . . . ; ð1; 25Þ; ð2; 1Þ; ð2; 2Þ; . . . ; ð2; 25Þ; . . . ; ð25; 25Þ. One individual, approximately at the center of the population, is set as the initial seed for each simulation at time point t ¼ 1. The epidemic then propagates via one of the three spatial ILMs described in Section 2 (geometric, exponential, or neighborhood models). The epidemic runs from t ¼ 1 to tmax , where t max is either the time the epidemic dies out (i.e., no infectious individuals exist, and the spark term is zero so there cannot be any more) or the time the whole

Table 1 Set of parameters used to simulate epidemics for the geometric, exponential, and neighborhood models. Global parameters are parameters common to all models; local parameters are spatial parameters that are different for each model. A total of 24 combinations of parameter values were used to simulate the epidemics for each of the models. Parameter combination Model name

Geometric model Exponential model Neighborhood model

Global parameters

Local parameters

Infectivity (a)

Spark ()

Infectious Period(cI )

Spatial

0.30

0.00, 0.01, 0.02, 0.05

2, 3

1.50, 2.50, 3.00 0.30, 0.65, 0.90 7.50, 3.50, 2.40

vE ¼

min  2 ðLG  LE Þ bE

and r was set equal to

vN ¼

min  2 ðLG  LN Þ ; r

as these values, respectively, minimize the sum of squares of the difference between the average lengths of the epidemics simulated using the exponential or neighborhood models and the average length of the epidemics simulated using the geometric model under bG ¼ 1:5; 2:5, and then 3.0. This led to the values of bE and r shown in the last column of Table 1. Samples of epidemic curves simulated with one set of parameters for each model are given in Fig. 1. 3.2. Spatially stratified epidemic curves As previously defined, X ¼ ðx1 ; x2 ; x3 ; . . . :; xs Þ is a simulated epidemic curve recorded on the global population. These data can be used to build an epidemic classifier, but we also want to consider the performance of classifiers built on spatially stratified epidemic curves. Let the population be split into s strata. Then, we denote the epidemic curve recorded in the kth stratum as X k ¼ ðxk1 ; xk2 ; xk3 ; . . . ; xks Þ, where xks is the number of infectious individuals at

100

200

300

400

500

Neighborhood Geometric Exponential

0

population becomes infected (i.e., the last day on which an infection occurs). Thus, tmax varies between simulations. The parameter settings used in the simulation study are described in Table 1, and were defined in the following way. First, the parameter settings were assigned to the geometric model (all combinations of a ¼ 0:3;  ¼ 0:00; 0:01; 0:02; 0:05; cI ¼ 2; 3 and bG ¼ 1:5; 2:5; 3:0). These 24 settings were chosen in a reasonably arbitrary manner picked such that ‘informative epidemics’ (i.e., epidemics that do not die out or infect the whole population too quickly) were produced and the distribution of epidemics produced under each setting were discernibly different to the eye. We then define the global epidemic curve of a given simulated epidemic to be X ¼ ðx1 ; x2 ; . . . ; xtmax Þ, where xt is the number of infectious individuals at day t and tmax is the last observed time point. We then define the epimax demic length to be L ¼ : xt > 0 and  LG be the average t lengths over the 400 epidemics simulated using the geometric model for one set of parameters. In defining parameters for the exponential and neighborhood models, the geometric model with parameter values a ¼ 0:3;  ¼ 0:0 and cI ¼ 2 was arbitrarily used as a baseline and 400 epidemic realizations were simulated. These same values of a; b and, cI were used in the equivalent ‘baseline’ exponential and neighborhood models. The goal was then to select parameter values for the exponential and neighborhood kernels that were in some sense equivalent to the geometric kernel parameter settings. This was done by simulating 400 sets of epidemics from each model, for various settings of bE and r, respectively, and then calculating LE and LN , the average lengths of the set of 400 epidemics simulated under a given value of bE or r. That is, bE was set equal to

600

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

Daily Infectious Individuals

62

5

10

15

20

Day

Fig. 1. Epidemic curves simulated using three ILMs (geometric, exponential, and neighborhood). The parameter values used to simulate these curves are a ¼ 0:3;  ¼ 0:05; cI ¼ 3; bG ¼ 0:3; bE ¼ 0:9, and r ¼ 2:4. Each group of epidemic curves has 200 replications.

time point s in stratum k. The full epidemic curve dataset for that epidemic is then given by Y ¼ ðX 1 ; X 2 ; . . . ; X k ; . . . ; X s Þ, where X 1 ¼ ðx11 ; x12 ; . . . ; x1s Þ; X 2 ¼ ðx21 ; x22 ; . . . ; x2s Þ,. . ., X s ¼ ðxs1 ; xs2 ; . . . ; xss Þ. The global epidemic curve data have ðsÞ explanatory variables, and the stratified epidemic curve data have T ¼ sðsÞ. (The response variables for this study are the epidemic models for the different parameter combinations given in Table 1; see Section 3.1). Two stratification methods were used in this study: rectangular and circular. (1) Rectangular stratification. Two variants of this method were used. In the first, regular stratification, square strata of equal size were formed by imposing a regular grid of squares over the global population. In the second, irregular stratification, the same type of grid was imposed upon the population to form the strata but was shifted to ensure that the first infection was approximately at the center of one of the strata. Here, it is important to note that this would give strata of different sizes imposed on the observed finite global population. We considered both forms of stratification under 2  2; 3  3; 4  4; 5  5; 6  6, and 7  7 degrees of resolution (i.e., 4, 9, 16, . . .., 49 strata). Obviously, with the initial infection placed in the center of a square population, under, say, the 5  5 resolution, the regular and irregular rectangular stratification would be equivalent. There would also be less and less difference between regular and irregular stratification as the resolution increases (i.e., as the number of strata increase). However, for resolutions such as 2  2; 3  3, and 4  4, the epidemic curve data under the two forms of rectangular stratification can be quite different. (2) Circular stratification. In this method, strata were formed using concentric circles of increasing radius

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

centered upon the initial infection. The circles were defined by an increase in the radius of equal size for each new, larger concentric circle. One strata would therefore be the center, smallest circle and the others would consist of rings, some of which would be truncated by the edge of the square population observed. Once again, various degrees of resolution were considered, with strata formed from 2, 3, 4, 5, 6, or 7 concentric circles giving a central circular strata with radii of 8:49; 5:65; 4:24; 3:39; 2:83, or 2:42 units, respectively, and each other strata defined by increasingly larger rings of the same respective widths. Some examples of rectangular and circular stratifications are given in Fig. 2. 3.3. Within- and Between-kernel Analyses Simulation studies of two broad themes were carried out: Within-kernel and Between-kernel Analyses. (1) Within-kernel Analyses. In such analyses, the underlying spatial mechanism (geometric, exponential, or neighborhood) was assumed known and goal was then to identify the model with the correct parameter values (i.e., purely parameter estimation).

63

We defined three Within-kernel Analyses (I, II, & III). Within-kernel Analysis I consisted of identifying epidemics simulated using the geometric model with the 24 sets of parameter combinations shown in Table 1. Within-kernel Analyses II and III similarly consisted of an attempt to identify epidemics simulated using the exponential and neighborhood models, respectively, each with the 24 sets of parameter combinations given in Table 1. (2) Between-kernel Analyses. In these analyses, the aim was to identify both the parameter values and the kernel. Thus, these analyses consisted of epidemics simulated using each of the geometric, exponential, and neighborhood models in combination. Four Between-kernel Analyses (I, II, III, & IV) were carried out, each using different sets of the infection spark parameter () and the other parameter combinations shown in Table 1; see Table 2. Between-kernel Analyses I, II, and III consisted of the epidemics simulated using the geometric, exponential, and neighborhood models, each with 12 sets of parameter combinations. Therefore, Between-kernel Analyses I, II, and III involved 36 models in total. Between-kernel Analysis IV consisted of the epidemics simulated using the geometric, exponential, and

Fig. 2. Rectangular and circular spatial stratifications of the global population. The small black circle approximately at the center of the population is an infectious individual at time point t ¼ 1. The left top panel represents 4  4 irregular stratification and the top right panel represents 7  7 regular stratification. The bottom left and right panels represent circular stratification of 4 rings and 7 rings with equal width, respectively.

64

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77 Table 2 Set of values of the spark term for four Between-kernel Analyses. Between-kernel analysis

Value of Spark termðÞ

I II III IV

0.00, 0.00, 0.00, 0.00,

0.01 0.02 0.05 0.01, 0.02

cases, only time points at the beginning or end of the epidemic, when there is relatively little information in the epidemic curves, might be chosen for a single tree. However, combining a large number of such classifiers, as happens in a random forest, can result in a very high performance classifier (Hastie et al., 2009). 4. Results 4.1. Within-kernel Analyses

neighborhood models, each with 18 sets of parameter combinations. Therefore, Between-kernel Analysis IV involved a total of 54 models. Finally, four Between-kernel Analyses were also implemented for partial epidemic curves rather than the entire curve. Separate analyses were carried out for data over 5, 7, 10, and 15 days. 3.4. Random forest-based epidemic classification For each analysis carried out, training and test epidemic sets, each consisting of 200 epidemic realizations per generating model (i.e., parameter and – for Between-kernel Analyses – kernel combinations) were simulated. For example, test and training sets consisting of (3 kernels  12 parameter combinations  200 realizations =) 7200 epidemics each were used in Between-kernel Analysis I. Each epidemic was converted to a set of global or stratified epidemic curves. A random forest classifier was then trained using the training set using the R package randomForest (Liaw and Wiener, 2002). The global or stratified epidemic curves in the test set were then used as an input sequence into the trained random forests classifiers, and the model that generated them (i.e., 36 possible models for Between-kernel Analysis I) as the output class. The epidemics were simulated until they died out or there were no more susceptibles in the population. More than 50% of the epidemics ended before 12 days (examples shown in Fig. 1). Therefore, only the first 20 days of the epidemic curve were used to build the classifier. Breiman (2001) suggested that setting the number of explanatory variables randomly sampled equal to the square root of the total number of explanatory variables in the dataset generally gives near optimum results. Thus, following Breiman (2001), the random forest classifier was built such that the number of randomly sampled explanatory variables (x’s) at each split for the global epidemic pffiffiffi curve datapwas s and for the stratified epidemic curve ffiffiffi data was T . Other settings of the number of randomly sampled exposure variables were tested, but no obvious differences in results were observed. For each analysis, the confusion matrix and prediction error rate were calculated. The fact that the explanatory variables are randomly chosen when each classification tree is constructed means that some significant variables may not be selected and, as such, individual trees may be quite poor classifiers. In such

For classifiers based on global epidemic curves there was a comparatively high risk of misclassification, often with a preference for models with the same infectivity, spatial, and infectious period, parameters, but different sparks values. This proved to be the case for geometric, exponential and neighborhood kernel models alike. For example, in Within-kernel Analysis I, among 200 epidemics simulated using the geometric kernel with parameter set ða; ; cI ; bG Þ ¼ ð0:30; 0:01; 3; 1:5Þ, 43 were classified as epidemics simulated from the geometric models parameter set ða; ; cI ; bG Þ = ð0:30; 0:00; 3; 1:5Þ and 61 from parameter set ða; ; cI ; bG Þ = ð0:30; 0:02; 3; 1:5Þ (Table 3). The misclassification rate was higher for epidemics simulated from models with  ¼ 0:00; 0:01, and 0:02 than those simulated using  ¼ 0:05. Misclassification also tended to be higher when the underlying model has a spatially diffuse kernel (e.g., small bG or bE or large r). Similar results were observed for Within-kernel Analyses II and III. Values of the spatial parameters bG ; bE , and r for the geometric, exponential and neighborhood models, respectively, were very well identified. In fact, the exponential kernel parameter, bE, was identified correctly in each of the 200 test epidemics of Within-kernel Analysis II. Overall, epidemics produced under the geometric models were more successfully identified then those under the exponential and neighborhood models for the global epidemic data (Fig. 3). The prediction error in each Within-kernel Analysis using global epidemic curves was 13.72%, 16.08%, and 17.87% for the geometric, exponential, and neighborhood models, respectively. As was the case for the global epidemic curves, both rectangular and circular stratifications had difficulty identifying the correct spark term for epidemics simulated using the geometric, exponential, or neighborhood models. The most difficulty was noted with respect to identifying the correct spark term for epidemics simulated from models with zero or very small spark terms (). The prediction error rates shown in Fig. 3 show that both rectangular and circular stratification offer an improvement over the equivalent global analysis. In both cases, this was most marked for the neighborhood Within-kernel Analysis; for example, the prediction error for the global epidemic curves under rectangular stratification was 17.5%, for the 4  4 and 5  5 stratification was 12%, and for the 5, 6, and 7 ring circular stratifications was 10%. The improvement is smaller for the exponential models and smaller still for the geometric models, both of which have smaller initial associated prediction errors for the global curves than the neighborhood model analysis.

Table 3 Confusion matrix and model key: Within-kernel Analysis I (Geometric Model). M1 162 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0

M10 0 96 0 0 0 0 0 30 0 0 0 0 0 0 0 0 0 0 33 0 0 0 0 0

M11 0 0 164 0 0 0 0 0 27 3 0 0 0 0 0 0 0 0 0 7 0 0 0 0

M12 0 0 0 162 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 2 1 0 0 0

M13 0 0 0 0 136 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 37 0 0

M14 0 0 0 0 0 162 2 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 24 0

M15 0 0 0 0 0 0 159 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 14

M16 0 61 0 0 0 0 0 158 0 0 0 0 0 0 9 0 0 0 1 0 0 0 0 0

M17 0 0 29 0 0 0 0 0 170 1 0 0 0 0 0 4 0 0 0 0 0 0 0 0

M18 0 0 2 32 0 0 0 0 0 177 0 0 0 0 0 0 3 0 0 0 0 0 0 0

M19 0 0 0 0 11 0 0 0 0 0 194 0 0 0 0 0 0 0 0 0 0 0 0 0

Global parameters

M2 0 0 0 0 0 0 0 0 0 0 0 194 0 0 0 0 0 0 0 0 0 0 9 5

M20 0 0 0 0 0 1 0 0 0 0 0 0 197 2 0 0 0 0 0 0 0 0 0 0

M21 0 0 0 0 0 3 0 0 0 0 0 0 1 196 0 0 0 0 0 0 0 0 0 0

M22 0 0 0 0 0 0 0 8 0 0 0 0 0 0 191 0 0 0 0 0 0 0 0 0

M23 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 195 1 0 0 0 0 0 0 0

M24 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 1 196 0 0 0 0 0 0 0

Local parameters

Infectivity (a) Spark () Infectious period (cI) Spatial (bG) 2 0.00 3

2 0.01 3 0.3 2 0.02 3

2

3

M1 M2 M3 M4 M5 M6

1.5 2.5 3.0 1.5 2.5 3.0

M7 M8 M9 M10 M11 M12

1.5 2.5 3.5 1.5 2.5 3.0

M13 M14 M15 M16 M17 M18

1.5 2.5 3.0 1.5 2.5 3.0

M19 M20 M21 M22 M23 M24

M4 0 43 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 166 0 0 0 0 0

M5 0 0 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 189 0 0 0 0

M6 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 199 0 0 0

M7 38 0 0 0 53 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 125 0 0

M8 0 0 0 0 0 34 4 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 164 0

M9 0 0 0 0 0 0 35 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 179

class.error 0.19 0.52 0.18 0.19 0.32 0.19 0.205 0.21 0.15 0.115 0.03 0.03 0.015 0.02 0.045 0.025 0.02 0.01 0.17 0.055 0.005 0.375 0.18 0.105

65

0.05

Model

1.5 2.5 3.0 1.5 2.5 3.0

M3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 198 0 0 0 0 0 2

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

M1 M10 M11 M12 M13 M14 M15 M16 M17 M18 M19 M2 M20 M21 M22 M23 M24 M3 M4 M5 M6 M7 M8 M9

66

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

0.20

0.20

Prediction Error Rate of Within-kernel Analysis

0.15

Circular Stratification

0.05

0.10

0.10

0.00

0.05 0.00

Prediction Error Rate

0.15

Rectangular Stratification

Global

4

9

16

25

36

49

Global 2

Number of Rectangles Analysis I

3

4

5

6

7

Number of Rings Analysis II

Analysis III

Fig. 3. Comparison of prediction error rate of the global and stratified epidemic curves for Within-kernel Analyses I, II, and III . The epidemic curve data were simulated using geometric, exponential, and neighborhood models for various degrees of stratification. The left panel represents the rectangular stratification in which 2 times 2, 3 times 3, and 4 times 4 grids are irregular and the 5 times 5, 6 times 6, and 7 times 7 grids are regular. The right panel represents the circular stratification.

Overall, circular stratification seems to be preferable; under the best choice of stratification resolution, the prediction error for circular stratification was lower than, or almost as low as, that for the equivalent optimal rectangular stratification resolution. Further, the performance of circular stratification appears to be less dependent upon the stratification resolution itself. Finally, the prediction error rate under rectangular stratification was lower for irregular stratification (i.e., where the initial infection was centered in one of the stratification regions) than for regular stratification (results not shown). 4.2. Between-kernel Analyses In Between-kernel Analysis I, out of 200 epidemics simulated using the geometric kernel model with parameter set ða; ; cI ; bG Þ ¼ ð0:30; 0:01; 3; 2:5Þ, 25 were classified as epidemics simulated from the exponential kernel model using the parameter set ða; ; cI ; bE Þ ¼ ð0:30; 0:00; 3; 0:65Þ, 26 were classified as epidemics simulated using the exponential kernel model with the parameter set ða; ; cI ; bE Þ ¼ ð0:30; 0:01; 3; 0:65Þ, and 11 were classified as epidemics simulated using the neighborhood kernel model with the parameter set ða; ; cI ; rÞ ¼ ð0:30; 0:01; 3; 2:4Þ (Table 4). An examination of Table 4 indicates that epidemics simulated using any one of the geometric, exponential, or neighborhood models with low spark values have a tendency to be matched to one or more of the remaining models. The same characteristic was observed in the case of stratified epidemic curves. A similar tendency was shown in

Between-kernel Analyses II and III for epidemics generated by models with a low spark term (results not shown). This tendency is also illustrated in Fig. 4. Betweenkernel Analysis III, which consisted of the epidemics simulated using all three (geometric, exponential, and neighborhood) models and spark values of 0.00 or 0.05, has the lowest prediction error rate for all global, rectangularly, and circularly stratified epidemic curve-based classifiers. Between-kernel Analysis IV, which consisted of epidemics simulated using all three models with spark values of 0.00, 0.01, or 0.02, has the poorest prediction rate of all four Between-kernel Analyses. For each Between-kernel Analysis, both rectangular and circular stratification techniques performed better than the global epidemic curves in identifying the models most likely to have generated epidemics. For example, the left panel of Fig. 4 shows that the minimum prediction error rate occurs for the 5  5 rectangular stratification for all Between-kernel Analyses; thereafter it starts to increase, presumably due to increased noise in the data. The potential improvement in prediction error for classifiers built on stratified epidemic curves appears to be greater when the model choice covers different spatial kernels than the Within-kernel Analyses. For example, random forest classifiers gave the minimum (12.25%) prediction error rate for Within-kernel Analysis I at 3  3 resolution stratification; this is about 11% lower than the prediction error rate for the global epidemic curves (13.72%). However, the minimum (13.73%) error rate at 5  5 resolution stratification for Between-kernel Analysis

Table 4 Confusion matrix and model key: Between-kernel analysis I. M1 162 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 39 0 0

M10 0 152 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 0 0

M11 0 0 134 0 0 0 0 0 20 0 0 0 0 0 0 26 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 1 0 0 0 0

M12 0 0 0 138 0 0 0 0 1 0 0 0 0 0 0 0 40 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 0 0 0 0

M13 0 0 0 0 138 0 0 0 0 0 54 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0

M14 0 0 0 0 0 143 0 0 0 0 0 23 6 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 20 0

M15 0 0 0 0 0 0 172 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 27 0 0 0 0 0 0 0 0 0 0 0 0 0

M16 0 6 0 0 0 0 0 153 0 0 0 0 0 0 47 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

M17 0 0 25 0 0 0 0 0 135 0 0 0 0 0 0 4 0 0 0 0 0 2 0 0 0 0 0 0 0 4 0 40 0 0 0 0

M18 0 0 0 0 0 0 0 0 0 170 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 16 0 0 0

M19 0 0 0 0 57 0 0 0 0 0 146 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

M2 0 0 0 0 0 26 0 0 0 0 0 168 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 6

M20 0 0 0 0 0 1 0 0 0 0 0 0 152 0 0 0 0 0 0 0 0 0 0 0 0 1 19 0 0 0 0 0 0 0 24 0

M21 0 0 0 0 0 0 0 0 0 0 0 0 0 165 0 0 0 0 0 1 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 35

M22 0 0 0 0 0 0 0 43 0 0 0 0 0 0 152 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

M23 0 0 26 0 0 0 0 0 1 0 0 0 0 0 0 154 0 0 0 0 0 0 0 0 0 0 0 0 8 23 0 0 0 0 0 0

M24 0 0 0 46 0 0 0 0 0 0 0 0 0 0 0 0 153 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 3 0 0 0

M25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 171 0 0 0 0 0 0 38 0 0 0 0 0 0 0 0 0 0 0

M26 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 195 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0

M27 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 195 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0

Global Parameters Infectivity (a)

Spark ()

M28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 172 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0

M29 0 0 1 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 197 0 0 0 0 0 0 0 10 0 0 0 0 0 0

M3 0 0 0 0 0 0 27 0 0 1 0 2 0 2 0 0 0 0 0 4 0 0 163 0 0 0 0 0 0 0 0 0 0 0 0 1

M30 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 195 0 0 0 0 0 0 0 0 5 0 0 0

M31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29 0 0 0 0 0 0 162 0 0 0 0 0 0 0 0 0 0 0

M32 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 193 11 0 0 0 0 0 0 0 0 0

M33 0 0 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 3 0 0 0 0 0 0 6 139 0 0 0 0 0 0 0 8 0

Local Parameters Infectious Period(cI) 2

0.00 3

2 0.01 3

2

Spatial (bG)

Model

bG = 1.50 bG = 2.50 bG = 3.00 bG = 1.50 bG = 2.50 bG = 3.00

M1 M2 M3 M4 M5 M6

bG = 1.50 bG = 2.50 bG = 3.00 bG = 1.50 bG = 2.50 bG = 3.00

M7 M8 M9 M10 M11 M12

bE = 0.30 bE = 0.65

M13 M14

M35 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 184 5 0 0 0 0 0 0

M36 0 0 11 7 0 0 0 0 2 0 0 0 0 0 0 13 0 0 0 0 0 1 0 0 0 0 0 0 6 144 0 0 0 0 0 0

M4 0 42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 162 0 0 0 0 0

M5 0 0 2 5 0 0 0 0 41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 155 0 0 0 0

M6 0 0 0 0 0 0 0 0 0 23 0 0 0 0 0 0 7 0 0 0 0 0 0 2 0 0 0 0 0 0 0 2 176 0 0 0

M7 38 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 158 0 0

M8 0 0 0 0 0 27 0 0 0 0 0 6 22 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 147 0

M9 0 0 0 0 0 1 0 0 0 0 0 1 0 31 0 0 0 0 0 0 0 0 1 0 0 0 6 0 0 0 0 0 0 0 0 158

class.error 0.19 0.24 0.33 0.31 0.31 0.285 0.14 0.235 0.325 0.15 0.27 0.16 0.24 0.175 0.24 0.23 0.235 0.145 0.025 0.025 0.14 0.015 0.185 0.025 0.19 0.035 0.305 0.215 0.08 0.28 0.19 0.225 0.12 0.21 0.265 0.21

67

(continued on next page)

M34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28 0 0 0 0 0 0 157 0 0 0 0 0 0 0 0

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

M1 M10 M11 M12 M13 M14 M15 M16 M17 M18 M19 M2 M20 M21 M22 M23 M24 M25 M26 M27 M28 M29 M3 M30 M31 M32 M33 M34 M35 M36 M4 M5 M6 M7 M8 M9

68

Global Parameters Infectivity (a)

Spark ()

Local Parameters Infectious Period(cI)

0.00 3 0.3 2 0.01

2 0.00 3

2 0.01 3

Model

bE = 0.90 bE = 0.30 bE = 0.65 bE = 0.90

M15 M16 M17 M18

bE = 0.30 bE = 0.65 bE = 0.90 bE = 0.30 bE = 0.65 bE = 0.90

M19 M20 M21 M22 M23 M24

r = 7.50 r = 3.50 r = 2.40 r = 7.50 r = 3.50 r = 2.40

M25 M26 M27 M28 M29 M30

r = 7.50 r = 3.50 r = 2.40 r = 7.50 r = 3.50 r = 2.40

M31 M32 M33 M34 M35 M36

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

3

Spatial (bG)

69

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

I is about 29% lower than the error rate for the global epidemic curves (19.27%). Similar tendencies were also noticed for other analyses. Circular stratification generally appears to offer the greatest and most consistent improvement, in addition to being less dependent on the stratification resolution.

(see Section 6).Lastly, these epidemics are relatively quick; in slower moving epidemics, characterization of disease dynamics may improve using only short ranges of data. 5. Application to real data 5.1. Tomato spotted wilt virus (TSWV)

4.3. Partial epidemic curve-based classification TSWV is one of the most widespread and significantly economically damaging plant viruses, infecting over 1,000 different plant species (Goldbach and Peters, 1994). Symptoms of TSWV differ between species and can also be variable within a single host species. Data from a 1993 study of TSWV in pepper plants, as described in Hughes et al. (1997), are considered here. This dataset consists of the TSWV infection history for 520 pepper plants collected in an experiment conducted inside a greenhouse. The plants were located in 26 rows one meter apart from each other, with each row consisting of 20 plants separated by a distance of half a meter. Therefore, the plants were located with coordinates ðx; yÞ, for each pairing of x ¼ 1:0; 2:0; . . . ; 26:0 and y ¼ 0:5; 1:0; . . . ; 9:5; 10:0. The disease status of the plants was first observed on 26 May 1993 and followed biweekly. In our model framework, we set t ¼ 1 to be 26 May, t ¼ 2 to be 9 June, . . ., and t = 7 to be the last observation on 16 August. Therefore, the epidemic runs for t = 1, 2, . . .,7 in increments of 14 days. A total of 327 infected pepper plants was reported over the 82 days of the experimental period. We set the

The prediction error for the global, rectangularly, and circularly stratified epidemic curves consisting of 5, 7, 10, 15, and 20 days of data is given in Figs. 5 and 6. Identifying the correct epidemic model appears to be more difficult in the early stage of the epidemic than in the latter part. Identifying the correct models for each global, rectangularly, and circularly stratified epidemic curve significantly improves upon moving from 5 to 7 days worth of data. Comparison with Fig. 1 shows that identifying the correct epidemic model is easier after the peak of infection than before. The prediction error rate is stable for classifiers built from global and all forms of stratified epidemic curves after 10 days. Depending upon the analysis being carried out, the improvement in misclassification error when moving from 5 to 10 days worth of data can be between 10% and 20%. However, some of the spatially stratified classifiers still only produce prediction error rates of 0.1–0.3 when 5 days worth of data are considered. Such classification based results could be used to inform a full Bayesian analysis

0.30

0.30

Prediction Error Rate of Between−kernel Analysis

0.15

0.20

0.25

Circular Stratification

0.00

0.05

0.10

0.15 0.10 0.05 0.00

Prediction Error Rate

0.20

0.25

Rectangular Stratification

Global

4

9

16

25

36

49

Number of Rectangles Analysis I

Analysis II

Global

3

4

5

6

7

Number of Rings Analysis III

Analysis IV

Fig. 4. Comparison of prediction error rate of the global and stratified epidemic curves for Between-kernel Analyses I, II, III, and IV. The epidemic curve data were simulated using geometric, exponential, and neighborhood models for various degrees of stratification. The left panel represents the rectangular stratification in which 2 times 2, 3 times 3, and 4 times 4 grids irregular and others 5 times 5, 6 times 6, and 7 times 7 grids regular. The right panel represents the circular stratification.

70

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

infectious period to be cI ¼ 3, i.e., 42 days. This is based on the study Brown et al. (2005) in which the infectious period for TSWV was estimated to be six weeks. 5.2. Classifier training library The same three models discussed in Section 2 – geometric, exponential, and neighborhood – were considered for this dataset. A total of 200 epidemic realizations were simulated for various settings of parameters for each of the respective models. In each case, the spark term was set to 0 on the basis that the experiment was carried out in a greenhouse and, hence, the chance of infection outside the population of interest would be negligible. Therefore, each model had two parameters to be estimated: the infectivity parameter, a; and spatial parameter b for the geometric and exponential models and r for neighborhood model. The parameter values were chosen in the following way. First, one setting of the parameters for each model was chosen in such a way that the epidemic curves generated using those parameter values in corresponding models looked ‘‘broadly similar’’ to the real epidemic curve (TSWV epidemic). A large set of other parameter values was then chosen to generate epidemic distributions that were discernibly different to the eye from the original but still produce broadly plausible epidemic curves (e.g.,

no overwhelming tendency for complete disease saturation far before the end of the experiment, no tendency for epidemics to die out without much infection taking place). Results are shown in Table 5 for four settings of the infectivity parameter and nine settings of the spatial parameters (total 36 combinations) for each respective model. (These settings were included because they were found to cover the area of the parameter space found to lead to the highest class matching probabilities under the random forest classifier). The classification library thus consisted of 7200 (200  36 ¼ 7200) epidemic curves for each model and this provided the training set for the Within-kernel global epidemic classification model. We also generated spatially stratified epidemic curve training sets (both rectangular and circular) using the same parameter values as for the global epidemic classifiers as discussed in Section 3.2. Three Within-kernel (i.e., conditioning on the kernel) analyses were done using the real epidemic curve as the test set and the relevant training set described above. The class probability pertaining to each of the candidate models was calculated for the TSWV epidemic under the random-forest classifier built from the training library. A Between-kernel (i.e., treating the kernel as unknown) analysis was also carried out. For the Between-kernel Analysis, we chose 10 models from each of the three Within-kernel Analyses with the 10 highest class probabilities. Therefore,

10

15

0.4 0.3 0.2 0.1

20

5

10

15 Day

Analysis III

Analysis IV

0.2

0.3

0.4

Global 4 Rectangles 9 Rectangles 16 Rectangles 25 Rectangles 36 Rectangles 49 Rectangles

0.1

Prediction Error Rate

20

0.0

0.0

0.1

0.2

0.3

0.4

Global 4 Rectangles 9 Rectangles 16 Rectangles 25 Rectangles 36 Rectangles 49 Rectangles

0.5

Day

0.5

5

Prediction Error Rate

Global 4 Rectangles 9 Rectangles 16 Rectangles 25 Rectangles 36 Rectangles 49 Rectangles

0.0

Prediction Error Rate

0.4 0.3 0.2 0.1

0.5

Analysis II Global 4 Rectangles 9 Rectangles 16 Rectangles 25 Rectangles 36 Rectangles 49 Rectangles

0.0

Prediction Error Rate

0.5

Analysis I

5

10

15 Day

20

5

10

15

20

Day

Fig. 5. The prediction error rates based on 5, 7, 10, 15, and 20 days of data for four Between-kernel Analyses. The lines represent the prediction error rate for global and rectangularly stratified epidemic curves with various degrees of resolution.

71

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

10

15

0.4 0.3 0.2 0.1

20

5

10

15 Day

Analysis III

Analysis IV

0.2

0.3

0.4

Global 2 Circles 3 Circles 4 Circles 5 Circles 6 Circles 7 Circles

0.1

Prediction Error Rate

20

0.0

0.0

0.1

0.2

0.3

0.4

Global 2 Circles 3 Circles 4 Circles 5 Circles 6 Circles 7 Circles

0.5

Day

0.5

5

Prediction Error Rate

Global 2 Circles 3 Circles 4 Circles 5 Circles 6 Circles 7 Circles

0.0

Prediction Error Rate

0.1

0.2

0.3

0.4

Global 2 Circles 3 Circles 4 Circles 5 Circles 6 Circles 7 Circles

0.5

Analysis II

0.0

Prediction Error Rate

0.5

Analysis I

5

10

15

20

Day

5

10

15

20

Day

Fig. 6. The prediction error rates based on 5, 7, 10, 15, and 20 days of data for four Between-kernel Analyses. The lines represent the prediction error rate for global and circularly stratified epidemic curves with various degrees of resolution.

the training set for the Between-kernel Analysis had 30 models, with the same 200 realizations of each model as the training library. See Section 6 for further discussion on how parameter values for the classifier library might be set up in reality. 5.3. Classification results Table 6 shows the class probability and corresponding infectivity parameter (a) and spatial parameter (bG) values for Within-kernel Analysis I (geometric model). A clear positive correlation is evident between a and bG, as would be expected. The most probable parameter values under the global classifier were found to be ða; bG Þ ¼ ð0:065; 2:0Þ with probability 0.774. A sizeable mass of 0.174 was also apportioned to ða; bG Þ ¼ ð0:022; 1:0Þ (Table 6). The remaining mass was apportioned to parameter values in the vicinity of these points. A qualitatively similar spread of mass was observed under the global classifier of Within-kernel Analysis II (exponential model) and Within-kernel Analysis III (neighborhood model) (Tables 7 and 8); however, a negative correlation was observed between the infectivity (a) and spatial (r) parameters of the neighborhood model. Overall, identifying the most probable parameter values for each analysis using the global epidemic classifier was reasonably easy.

Two things occur when using the rectangular- and circular-based stratified epidemic classifiers in Within-kernel Analyses I and II. First, the parameter values allocated the highest and second-highest probability masses, respectively, under the global epidemic classifier switch position and (approximately) the probabilities afforded them. Second, the probability mass is slightly more spread out overall, suggesting slightly less certain parameter identification when using the stratified epidemic classifiers. Given the extra spatial information incorporated when using stratified epidemic classifiers, the results obtained are arguably more trustworthy. We note that the results are broadly consistent under different spatially-stratified schemes. In Within-kernel Analysis III, the same parameter values are selected by all classifiers tried, although identifiability is less easily attained with the stratified epidemic classifiers. For the Between-kernel Analysis (results shown in Table 9), the exponential model with parameters a ¼ 0:05 and bE ¼ 0:5 is given the highest class probability (0.458) and the geometric model with parameters a ¼ 0:065 and bG ¼ 2:0 is given the second highest class probability (0.330) under the global epidemic classifier. These two models preferred in each of their respective Within-kernel Analyses. Relatively little weight is afforded to any of the neighborhood models. Under the spatially stratified epidemic classifiers, a similar switching to that observed in Within-kernel Analyses I

72

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

a ¼ 0:022 and bG ¼ 1:0, with the exponential model with parameters a ¼ 0:095 and bE ¼ 0:1 coming in second (third

and II occurs with respect to both models with different parameter values but also different kernels. Broadly, the favoured model under stratified epidemic classification appears to be the geometric model with parameter values

in the case of the 2  2 rectangular classifier). Once again, we would probably give more weight to the spatially stratified classification results on the basis that such classifiers do incorporate some spatial information missing from the global epidemic classifiers. Finally, note that little weight is apportioned to any of the neighborhood models.

Table 5 Set of parameters used to simulate epidemics for the geometric, exponential, and neighborhood models. A total of 36 combinations of parameter values (4 infectivity and 9 spatial) were used to simulate the epidemics for each of the models.

5.4. Comparative Bayesian analysis

Parameter combination Model name

Infectivity (a)

Spatial parameters

Geometric model

0.001, 0.022 0.065, 0.500 0.001, 0.0095 0.05, 0.5 0.0005, 0.05 0.5, 1.2

0.001, 0.02, 0.2, 1.0 1.5, 2.0, 2.75, 2.85, 3.0 0.05, 0.1, 0.5, 1.0 1.5, 2.0, 2.5, 3.0, 3.5 8.25, 7.0, 6.5, 6.25 5.5, 3.5, 2.0, 1.5, 1.0

Exponential model Neighborhood model

For comparative purposes, we also fitted each of the three models to the TSWV data using a full Bayesian approach using random walk Metropolis Hastings Markov chain Monte Carlo (MCMC). Exponential priors with the mean of 105were used for each of the model parameters. These priors were chosen so as to have only a negligible influence upon the final posterior results.

Table 6 Set of parameter values and corresponding class probabilities for both global and spatially stratified b(rectangular and circular) epidemics. There were 36 classes based on the set of parameters for Within-kernel Analysis I. Within-kernel analysis I (Geometric model) Class probability Infectivity parameter ðaÞ

Spatial parameter ðbG Þ

Global epidemic

Stratified epidemic Rectangular

Circular

22

33

44

2 Rings

3 Rings

4 Rings

5 Rings

6 Rings

0.001

0.001 0.020 0.200 1.000 1.500 2.000 2.750 2.850 3.000

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0.004 0 0.002 0 0 0.002 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0.022

0.001 0.020 0.200 1.000 1.500 2.000 2.750 2.850 3.000

0 0 0.040 0.174 0.004 0 0 0 0

0 0 0 0.62 0.034 0 0 0 0

0 0 0 0.712 0.058 0.004 0 0 0

0 0 0 0.558 0.066 0.020 0 0 0

0 0 0 0.634 0 0 0 0 0

0 0 0 0.718 0.010 0 0 0 0

0 0 0 0.834 0.006 0 0 0 0

0 0 0 0.740 0.040 0 0 0 0

0 0 0 0.674 0.052 0 0 0 0

0.065

0.001 0.020 0.200 1.000 1.500 2.000 2.750 2.850 3.000

0 0 0 0 0.020 0.774 0.028 0 0

0 0 0.072 0.010 0.012 0.302 0.026 0.006 0

0 0 0 0 0.054 0.128 0.020 0.014 0.006

0.002 0 0 0 0.016 0.296 0.030 0.010 0.002

0 0 0 0 0.020 0.246 0.054 0.002 0.002

0.002 0 0 0 0.008 0.252 0 0.008 0

0 0 0 0 0.010 0.142 0.006 0 0

0 0 0 0 0.038 0.138 0.018 0.018 0.008

0 0 0 0 0.020 0.240 0.002 0.012 0

0.500

0.001 0.020 0.200 1.000 1.500 2.000 2.750 2.850 3.000

0 0 0 0 0 0 0 0.010 0.034

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0.002 0.002

0.004 0 0 0.002 0 0 0 0.002 0.006

0 0 0 0 0 0 0 0 0.002

0 0 0 0 0 0 0 0 0.002

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

73

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

For the geometric model, the posterior mean (95% percentile intervals shown in parentheses) of a was found to be 0.0198 (0.0135, 0.0262) and of bG was 1.3892 (1.1090, 1.6363). For the exponential model, the posterior mean (and 95% PIs) of a was found to be 0.0303 (0.0169, 0.0477) and of bE was 0.501 (0.3549, 0.6511). Both sets of results are broadly in line (although, obviously, not exactly the same as) with those found using the classification methods, providing some evidence that the classifications are producing sensible results. However, for the neighborhood model, the posterior mean (and 95% PIs) of a was 0.00535 (0.0044, 0.0062) and for r was 6.690 (6.3324, 7.3459). These results are obviously less in line with those found using the classification methods of parameterizations, with a noticeably larger radius, r , and smaller abeing estimated using the Bayesian MCMC approach. This divergence between estimates appears to occur as a result of problems with the

model itself. A neighborhood model with a low radius is, of course, essentially a nearest neighbor model, and would be a reasonable model for some plant diseases (e.g., soil borne) if the observation interval was relatively small. However, the TSWV data include some obvious longer distance infection jumps between observation times (Hughes et al., 1997) that a neighborhood model with low radius cannot capture well. In fact, in a likelihood-based method (e.g., Bayesian) with no spark term included in the model (as infections can only occur within the radius of current infections), the radius must be at least as large as the largest of these infection jumps or else a likelihood of 0 is calculated. This leads to an estimate of r that is relatively large, resulting to what would appear to be a quite unrealistic model: uniform infection within a fairly large distance and zero infection beyond. This strict requirement does not exist for simulation-based methods such as our classification

Table 7 Set of parameter values and corresponding class probabilities for both global and spatially stratified (rectangular and circular) epidemics. There were 36 classes based on the set of parameters for Within-kernel Analysis II. Within-kernel Analysis II (Exponential Model) Class probability Infectivity parameter ðaÞ

Spatial parameter ðbE Þ

Global epidemic

Stratified epidemic Rectangular

Circular

22

33

44

2 Rings

3 Rings

4 Rings

5 Rings

6 Rings

0.001

0.050 0.100 0.500 1.000 1.500 2.000 2.500 3.000 3.500

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0.020 0.060 0 0.002 0.002 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0.002 0 0 0 0 0 0 0 0

0.095

0.050 0.100 0.500 1.000 1.500 2.000 2.500 3.000 3.500

0.026 0.144 0 0 0 0 0 0 0

0.040 0.706 0 0 0 0 0 0 0

0.038 0.778 0 0 0 0 0 0 0

0.008 0.630 0.002 0 0 0 0 0 0

0.020 0.664 0 0 0 0 0 0 0

0.016 0.764 0 0 0 0 0 0 0

0.012 0.818 0 0 0 0 0 0 0

0.042 0.856 0 0 0 0 0 0 0

0.012 0.896 0 0 0 0 0 0 0

0.050

0.050 0.100 0.500 1.000 1.500 2.000 2.500 3.000 3.500

0 0 0.786 0 0 0 0 0 0

0 0 0.288 0 0 0 0 0 0

0 0 0.170 0 0 0 0 0 0

0.004 0.002 0.304 0.004 0 0.002 0 0 0

0 0 0.308 0 0 0 0 0 0

0 0 0.220 0 0 0 0 0 0

0 0 0.162 0 0 0 0 0 0

0 0.004 0.102 0 0 0 0 0 0

0 0.002 0.072 0 0 0 0 0 0

0.500

0.050 0.100 0.500 1.000 1.500 2.000 2.500 3.000 3.500

0 0 0 0.016 0.028 0 0 0 0

0 0 0 0.002 0 0 0 0 0

0 0 0.002 0.010 0 0.002 0 0 0

0 0 0.002 0.014 0 0 0 0 0

0 0 0 0.006 0.002 0 0 0 0

0 0 0 0.002 0 0 0 0 0

0 0 0 0.008 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0.004 0.014 0 0 0 0 0

74

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

methods because only epidemic curves are compared. This would explain why a smaller radius is estimated under the classification methods and, in this case, we would suggest that the classification method is likely to more accurately capture the spatial dynamics than a Bayesian analysis for this particular model. The fact that a is estimated as being smaller when r is found to be larger also makes sense, as the stronger epidemics suggested by a larger r can be compensated for with a smaller a. The log likelihood values under the posterior means were 860.02, 868.29 and 882.06, for the geometric, exponential, and neighborhood models, respectively. Not only are these results in line with the suggestion above that the neighborhood model does not fit the data as well as the other models, but they are also in line with the spatially stratified Between-kernel Analysis results, in which the geometric models were afforded a higher probability mass than the exponential models, and the neighborhood models were afforded relatively little probability mass.

This also indicates that stratified classifiers are producing more reliable results than non-stratified global classifiers. 6. Discussion The study had three primary goals. The first was to identify whether the ‘‘simulate-and-classify’’ methods of Nsoesie et al. (2011) work well for identifying infectious disease models when the underlying disease system is spatial. The second was to compare the performance of random forest classifiers built on global epidemic curve data versus spatially stratified epidemic data. The third was to explore how easily the correct infectious disease model can be identified from the early part of the epidemic curves (partial epidemic curves). Overall, results suggest that these classification methods can be used to identify the underlying spatial mechanism and parameter estimates reasonably well. Misclassification rates for spatially stratified classifiers were in

Table 8 Set of parameter values and corresponding class probabilities for both global and spatially stratified (rectangular and circular) epidemics. There were 36 classes based on the set of parameters for Within-kernel Analysis III. Within-kernel Analysis III (Neighborhood Model) Class probability Infectivity parameter ðaÞ

Spatial parameter ðrÞ

Global epidemic

Stratified epidemic Rectangular

Circular

22

33

44

2 Rings

3 Rings

4 Rings

5 Rings

6 Rings

0.0005

8.250 7.000 6.500 6.250 5.500 3.500 2.000 1.500 1.000

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0.020 0 0 0 0.002 0 0.004 0 0

0.004 0 0 0 0 0 0 0 0

0.002 0.004 0 0.002 0 0 0 0 0

0.004 0.008 0.002 0 0 0 0 0 0

0.008 0.002 0.004 0 0 0 0 0 0

0.014 0 0 0 0.002 0 0 0 0

0.050

8.250 7.000 6.500 6.250 5.500 3.500 2.000 1.500 1.000

0 0 0 0 0.096 0.872 0 0 0

0.002 0.008 0.004 0.012 0.106 0.480 0.058 0 0

0.016 0.014 0.018 0.004 0.068 0.598 0.032 0 0

0.020 0.028 0.048 0.056 0.124 0.646 0 0 0.002

0.006 0.010 0.014 0.026 0.206 0.528 0.026 0 0

0.008 0.008 0.002 0.018 0.106 0.636 0.056 0.002 0

0.016 0.008 0.008 0.022 0.128 0.588 0.038 0.002 0

0.008 0.002 0.032 0.026 0.068 0.594 0.140 0.002 0

0.002 0.010 0.030 0.018 0.158 0.520 0.124 0 0

0.800

8.250 7.000 6.500 6.250 5.500 3.500 2.000 1.500 1.000

0 0 0 0 0 0 0.030 0.002 0

0 0.002 0.002 0 0.044 0.084 0.016 0.034 0

0 0 0.006 0 0.012 0.016 0.060 0.050 0

0 0.004 0 0 0.026 0.006 0.008 0 0

0 0.002 0.008 0.006 0.006 0.012 0 0.062 0

0 0.004 0.004 0 0.018 0.004 0.018 0.020 0

0.004 0 0 0 0.010 0.020 0.050 0.026 0

0 0 0 0.002 0.014 0.024 0.014 0.002 0

0 0 0 0 0 0.018 0.032 0.016 0

1.200

8.250 7.000 6.500 6.250 5.500 3.500 2.000 1.500 1.000

0 0 0 0 0 0 0 0 0

0 0.004 0.004 0 0.056 0.078 0.002 0.002 0

0 0.016 0.012 0.004 0.014 0.008 0.018 0.032 0.002

0 0 0 0.002 0.016 0 0.018 0.010 0

0 0.006 0.008 0.006 0.024 0.020 0 0.020 0

0 0.010 0.002 0.002 0.028 0.010 0.004 0.026 0.006

0.008 0.004 0.002 0 0.020 0.006 0.018 0.006 0.002

0.002 0.004 0.004 0.002 0.022 0.010 0.002 0 0.008

0 0 0 0 0 0.022 0.016 0.014 0.004

75

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

Table 9 Set of parameter values and corresponding class probabilities for both global and spatially stratified (rectangular and circular) epidemics. There were 30 classes based on the set of parameters for the Between-kernel Analysis. Between-kernel Analysis Class probability Infectivity parameter ðaÞ

Spatial parameter

Global epidemic

Stratified epidemic Rectangular

Circular 4 Rings

5 Rings

6 Rings

0.022

bG ¼ 0:2 bG ¼ 1:0 bG ¼ 1:5

0 0.068 0

0 0.430 0.036

0 0.488 0.018

0 0.582 0.064

0 0.472 0.038

0 0.662 0.012

0 0.614 0.014

0 0.572 0.048

0 0.622 0.028

0.065

bG ¼ 0:2 bG ¼ 1:0 bG ¼ 1:5 bG ¼ 2:0 bG ¼ 2:75 bG ¼ 2:85

0 0 0.014 0.330 0.002 0

0 0 0.002 0.230 0.038 0.002

0 0 0.012 0.094 0.016 0.002

0 0 0.016 0.246 0.026 0.008

0 0 0.008 0.140 0.058 0.004

0 0 0.006 0.146 0.006 0

0 0 0 0.090 0 0

0 0 0.014 0.090 0.020 0.006

0 0 0.040 0.192 0.012 0

bG ¼ 3:0 bE ¼ 0:1 bE ¼ 0:05 bE ¼ 0:1 bE ¼ 0:5 bE ¼ 0:1 bE ¼ 0:5 bE ¼ 1:0

0 0 0 0.030 0 0 0.458 0

0 0 0 0.150 0 0 0.110 0

0.004 0 0.020 0.230 0 0 0.116 0

0.006 0 0.002 0.012 0 0 0.002 0.002

0 0 0.006 0.120 0 0 0.150 0

0 0 0.004 0.120 0 0 0.044 0

0 0 0.002 0.172 0 0 0.108 0

0 0 0.006 0.178 0 0 0.066 0

0 0 0.002 0.078 0 0 0.026 0

bE ¼ 0:5 bE ¼ 1:0 bE ¼ 1:5

0 0.002 0.044

0 0 0

0 0 0

0 0 0.002

0.002 0.002 0.002

0 0 0

0 0 0

0 0 0

0.002 0 0

r = 6.25 r = 5.5 r = 3.5 r = 2.0

0 0 0.014 0.024

0 0 0.002 0

0 0 0.002 0

0.004 0.002 0.010 0.02

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

r = 5.5 r = 3.5 r = 2.0 r = 1.5 r = 5.5 r = 3.5

0 0 0.014 0 0 0

0 0 0 0 0 0

0 0 0.002 0 0 0

0.002 0 0.010 0.002 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0.5 0.001

0.0095

0.05

0.5

0.05

0.8 1.2

22

33

44

2 Rings

3 Rings

the range of 10–15% for the Within-kernel Analyses and 10–30% for the Between-kernel Analyses. Generally, misclassifications involved choosing a model that was not particularly dissimilar to the true model. For example, an epidemic generated by a model with a diffuse spatial kernel and low spark term could easily be mistaken for one generated by a model with a less diffuse kernel and higher spark term. Such mistakes are intuitive as both types of models tend to generate spatially diffuse infection patterns, with a strong spatial pattern (i.e., high spatial decay and low spark term) being more easily identifiable from their data. This has important practical ramifications, as these results imply that models fitted using such methods could be used for simulation studies to predict, the course of an epidemic, with a reasonable degree of confidence. One way to reduce this uncertainty could be to use multiple models to simulation future epidemics, weighted perhaps by the class matching probabilities obtained through the classification process. The use of spatially stratified epidemic curve datasets certainly have the potential to outperform classifiers built on global epidemic curve data. Overall, circular stratification seems to be preferable to rectangular, tending to lead

to classifiers with smaller prediction error rates that are more robust to the stratification resolution chosen. The use of the early part of the epidemic for identifying the correct epidemic models worked relatively well, but data up to at least the peak of the outbreak were required to get near optimal classification rates. Of course, this could be a product of the characteristics of the simulation study conducted (which involved relatively fast short-lasting epidemics), and it would be of interest to test these methods on other disease systems. Further work could also be carried out to better determine how to derive satisfactory stratification schemes in practice. As the resolution of the stratification increases (i.e., a larger number of smaller stratification regions are used), the data should be able to capture more information on more intricate spatial characteristics of the epidemic. However, model classification success will suffer once the resolution increases beyond a certain level because the larger number of potential explanatory variables will tend to introduce more noise to the system. Deriving measures of information on epidemic curves under various stratification schemes may be possible to optimize such schemes in practice.

76

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

We chose to compare three disease transmission models with geometric, exponential, and neighborhood kernels, respectively. Although disease dynamics can appear quite different among the three models at a fine level of detail, epidemics under these three models could be induced to have broadly similar global spatial dynamics. Thus, a question arises as to whether the kernel would be identifiable solely from information contained in the (global or stratified) epidemic curve(s). Thus, our intention was to address a potentially difficult model identification problem. A corollary is that we would expect these methods to perform better in situations comparing very different models, e.g., comparing spatial, small-world, scale-free, and/or random network-based models. In practice, however, disease transmission models may consist of weighting of multiple kernels (e.g., spatial to represent airborne spread, and network to represent some form of direct contact). It is an open question as to whether these sorts of classification methods will be able to successfully identify the transmission characteristics of systems in which such models would be applicable. Nsoesie et al. (2014) extended the methods of Nsoesie et al. (2011), building a classifier using a Dirichlet process under a nonparametric Bayesian paradigm that allows matching of observed epidemics to a library of simulated epidemics in a non-spatial modeling setting. Although they found random forest classifiers required less computation time, they found the Dirichlet process classifiers better at forecasting the epidemic peak time in advance. In most circumstances classification was equally as good under both methods, but in some scenarios the random forest-based method tended to perform better. One avenue of further research they suggest would be the use of an ensemble method combining the random forest and Dirichlet process classification methods. The main advantage that classification schemes such as those used here would have over more traditional methods of model fitting (e.g., MCMC) is that of reduced computational burden. It is far quicker to simulate from an ILMtype infectious disease model than to calculate the likelihood, especially for large datasets. However, as suggested by Nsoesie et al. (2011), application of these classification methods to real data would likely require a vast library of simulated epidemics to capture epidemics under a very large range of models. However, producing this library may also present computational difficulties because it would likely need to be derived for a specific population once an epidemic has begun. A further exploration of the computational cost and benefits for large systems would therefore be of interest. A related idea is that of working on larger real datasets. The TSWV data were chosen so that a comparative Bayesian MCMC analysis could be easily carried out. However, work on a larger dataset, so as to give a more realistic view of the computational demands that would be experienced in practice, would obviously be of use. Here, our simulation studies were based on a population spread over a uniform grid. This was done to allow exploration of these methods over as simple and easily reproducible a population as possible. We have, however, tested these classification methods on data generated in other ways (e.g., X–Y locations

simulated from a bivariate normal distribution), with pleasingly similar results to those reported here. Further, moving to populations with more heterogeneity in the spatial layout may provide a more natural way to spatially stratify; for example, a simple spatial clustering method could be used to identify strata. As with any method, of course, there are some limitations and dangers to using a classification method to fit a disease transmission model to data. One problem inherent in any situation evaluating a model at a finite set of discrete points in an otherwise continuous parameter space is that vital parts of the parameter space may not be observed. For example, if observations are taken over a (high-dimensional) grid, there is the possibility that the span of the grid does not include areas that give the best model fit or that the grid points are too coarse, thus allowing areas of parameter space associated with best model fit to fall between observed points. One way around this problem – as well as the aforementioned approach of using as large a classification library as possible – would be to use some sort of adaptive scheme, where the grid grows in the direction of high matching probability, or hones in to finer detail on areas of the parameter space of interest. Of course, these classification methods only provide an approximation of a full Bayesian analysis, and thus need to be considered in that light. The authors recommend the use of these methods only when speed is of the utmost importance and a non-approximate fitting procedure would be too time consuming to be practically useful. It may also be that these methods could provide a springboard to doing a full Bayesian analysis, simply providing a first stage screening process on the model and parameter space. Temporal dynamics are obviously key to any infectious disease system. However, the random forest classifiers ignore the temporal dependence between the explanatory variables and select them randomly. If the classifier works well and identifies a reliable transmission model, then that model can obviously be used to make predictions about the future course of an epidemic. However, it is an open question as to whether temporal information used in the classification process could lead to better classifiers. For example, it might be of use to consider the change in the number of cases between time points (as well as the number of cases at each time point) as explanatory variables. Alternatively, randomly sampling each x variable in the epidemic curve could be replaced by randomly sampling blocks of these variables. Both of these methods could introduce more correlation between trees, which would tend to have a detrimental effect on the overall random forest classifier. However, they could improve the predictive ability of individual trees, which would tend to have a positive effect. Further work would be required to ascertain whether the overall effect would likely be positive or negative on classification accuracy in practice. Finally, a major advantage of the use of a Bayesian MCMC framework for fitting infectious disease models is that it enables us to build complex hierarchical models accounting for various sources of uncertainty (e.g., underreporting and reporting delay). It is, of course, possible to simulate epidemics incorporating such mechanisms for

G. Pokharel, R. Deardon / Spatial and Spatio-temporal Epidemiology 11 (2014) 59–77

the epidemic test set or library. However, the ease of identifying the mechanisms of an observation model such as these is also an open question that could be the subject of further work. Acknowledgements This work was funded by the Ontario Ministry of Agriculture, Food and Rural Affairs (OMAFRA) the Natural Sciences and Engineering Research Council of Canada (NSERC), and was carried out on equipment funded by the Canada Foundation for Innovation. References Breiman L. Bagging predictors. Mach Learn 1996;24:5–32. Breiman L. Random forests. Mach Learn 2001;45:5–32. Brown S, Csinos A, Daz-Prez JC, Gitaitis R, LaHue SS, Lewis J, et al. Tospoviruses in Solanaceae and other crops in the Coastal Plain of Georgia. Coll Agric Environ Sci Univ Ga Res Rep 2005:704–19. Chis-Ster I, Ferguson N. Transmission parameters of the 2001 foot and mouth epidemic in Great Britain. PLoS One 2007;2(6). Cressie N. Statistics for spatial data. Wiley Series in Probability and Mathematical Statistics; 1993. Deardon R, Brooks S, Grenfell T, Keeling M, Tildesley M, Savill N, Shaw D, Woolhouse M. Inference for individual-level models of infectious diseases in large populations. Stat Sin 2010;20:239–61.

77

Goldbach R, Peters D. Possible causes of the emergence of Tospovirus diseases. Semin Virol 1994;5(113–120). Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. New York, NY: Springer; 2009. Hughes G, McRobert N, Madden L, Nelson S. Validating mathematical models of plant-disease progress in space in time. IMA J Math Appl Med Biol 1997;14:85–112. Keeling M, Rohani P. Modeling infectious diseases in humans and animals. Princeton University Press; 2008. Knights D, Costello E, Knight R. Supervised classification of human microbiota. FEMS Microbiol Rev 2011;35:343–59. Kwong G, Deardon R. Linearized forms of individual-level models for large-scale spatial infectious disease systems. Bull Math Biol 2012;74(8):1912–37. Lee J, Lee J, Park M, Song S. An extensive comparison of recent classification tools applied to microarray data. Comput Stat Data Anal 2005;48:869–85. Liaw A, Wiener M. Classification and regression by randomForest. R News 2002;2(3):18–22. McKinley T, Cook A, Deardon R. Inference in epidemic models without likelihoods. Int J Biostat 2009;5(1). Nsoesie E, Beckman R, Marathe M, Lewis B. Prediction of an epidemic curve: a supervised classification approach. Stat Commun Infects Dis 2011;3(1). Nsoesie E, Leman SC, Marathe M. A Dirichlet process model for classifying and forecasting epidemic curves. BMC Infect Dis 2014;14(12). Shaman J, Karspeck A. Forecasting seasonal outbreaks of influenza. Proc Natl Acad Sci USA 2012;109(50):20425–30.

Supervised learning and prediction of spatial epidemics.

Parameter estimation for mechanistic models of infectious disease can be computationally intensive. Nsoesie et al. (2011) introduced an approach for i...
1MB Sizes 0 Downloads 5 Views