XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

Original Article

Spatial generalised linear mixed models based on distances

Statistical Methods in Medical Research 0(0) 1–23 ! The Author(s) 2013 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0962280213515792 smm.sagepub.com

Oscar O Melo,1 Jorge Mateu2 and Carlos E Melo3

Abstract Risk models derived from environmental data have been widely shown to be effective in delineating geographical areas of risk because they are intuitively easy to understand. We present a new method based on distances, which allows the modelling of continuous and non-continuous random variables through distance-based spatial generalised linear mixed models. The parameters are estimated using Markov chain Monte Carlo maximum likelihood, which is a feasible and a useful technique. The proposed method depends on a detrending step built from continuous or categorical explanatory variables, or a mixture among them, by using an appropriate Euclidean distance. The method is illustrated through the analysis of the variation in the prevalence of Loa loa among a sample of village residents in Cameroon, where the explanatory variables included elevation, together with maximum normalised-difference vegetation index and the standard deviation of normalised-difference vegetation index calculated from repeated satellite scans over time. Keywords distance-based methods, epidemiological study, Markov chain Monte Carlo, spatial generalised linear mixed models, spatial interpolation

1 Introduction Risk models derived from environmental data are becoming increasingly available for use by decision makers. In particular, Loa loa has emerged as a filarial worm of significant public health importance in Cameroon. As high densities of Loa loa microfilariae are associated with high prevalences of Loa loa infection,1 the people living in an area of high prevalence are considered to be at relatively high risk of experiencing encephalopathic reactions to ivermectin.2 To address these issues, as they apply to Loa loa, detailed epidemiological data from Cameroon have been analysed using standard logistic regression modelling.1,3,4 These data were analysed by Diggle et al.,2 using spatial statistical tools and Bayesian inference to quantify the uncertainty in the predictions of 1

Department of Statistics, Faculty of Sciences, National University of Colombia, Bogota´, Colombia Department of Mathematics, University Jaume I, Castellon, Spain 3 Faculty of Engineering, District University Francisco Jose´ de Caldas, Bogota´, Colombia 2

Corresponding author: Oscar O Melo, Department of Statistics, Faculty of Sciences, National University of Colombia, Crr. 30 # 45-03, Bogota´, Colombia. Email: [email protected]

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

2

Statistical Methods in Medical Research 0(0)

a regional, environmental risk map for Loa loa. In this paper, we present an alternative solution to this problem. We describe a refined spatial generalised linear mixed model (GLMM) incorporating general measures of distance/dissimilarity that can be applied to explanatory variables: numerical, categorical or a mixture of them. Methods of inference for such models (including simulated-based maximum likelihood estimation for model fitting and a discussion of methods for model comparison and testing goodness-of-fit) are provided. To solve such a problem, classical generalised linear models (GLMs), introduced by Nelder and Wedderburn,5 provide a unifying framework for the analysis. Several different ways to extend the GLM class to dependent data have been proposed, among which perhaps the most widely used are marginal models6 and mixed models.7 What we call here a generalised linear geostatistical model is a GLMM of a form specifically oriented to geostatistical data.8 Several authors9–11 considered a natural generalisation of the GLMs to accommodate correlated non-normal data by incorporating random terms into the linear predictor parts. The resulting models are termed GLMMs, providing a convenient and flexible way to model multivariate non-normal data. In particular, GLMMs constitute a unified framework for modelling geostatistical non-normal data, using mixed terms to model the underlying spatial process. The particular application of GLMMs to geostatistical data is known as spatial GLMMs.12,13 In many disciplines related to spatial data analysis (epidemiology, mining, hydrogeology, ecology, earth sciences and environment, among others), researchers often have to deal with variables of distinct nature that are associated with a response variable: categorical and binary variables, such as type of soil or rock, and continuous variables, for example, the spatial coordinates or additional variables. In these cases, the geometric concept of distance between individuals or populations can be used.14 The distance concept is a useful tool in hypothesis testing, parameter estimation, regression, etc. Cuadras and Arenas15 proposed a multiple regression method based on distance analysis using different metrics to work with continuous and categorical explanatory variables. Cuadras et al.16 presented some additional results of a distance-based (DB) model for prediction with mixed variables and explored the problem of missing data giving a solution with a DB approach. For geostatistical models, the main interest is often prediction and the model parameters themselves are not of interest.17 However, in an initial stage of a geostatistical analysis, it is common to investigate the structure of the model, that is, whether data should be transformed, if anisotropy is present, which correlation function to use, and so on. In classical geostatistics, these investigations are done using graphics, but as suggested by Christensen et al.,18 one should in addition study the profile likelihood function. Additionally, in the spatial prediction problem, the only available explanatory variables are those that can be calculated for any location within the geographical region over which predictions are required. The above considerations motivate parameter estimation by maximum likelihood in a generalised linear spatial model. Monte Carlo maximum likelihood (MCML)17,19–21 is an alternative approach to the Monte Carlo expectation maximization (EM) gradient approach used by Zhang.13 We propose a new method for solving such problems using Euclidean distances between individuals. We present a new method based on distances for modelling spatial GLMMs. This model is useful when the explanatory variables are not necessarily continuous or they are a mixture of continuous and categorical variables. Our distance-based spatial generalised linear mixed model (DBSGLMM) method is used to predict the trend and to estimate the covariance structure when the explanatory variables are continuous, categorical, binary or a mixture of them. The parameters involved in the proposed model are estimated using MCML, which is a feasible and a useful technique. MCML is based on Markov chain Monte Carlo (MCMC) and provides a useful tool for model selection in complicated models. It allows a more complete inference than Monte Carlo EM gradient does

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

Melo et al.

3

because likelihood and profile likelihood functions can be obtained, not just maximum likelihood estimates (MLEs). The variation in the prevalence of Loa loa data studied by Thomson et al.22 and Diggle et al.2 is used to illustrate the usefulness of our approach. We used the DBSGLMM to predict the trend and to estimate the covariance structure. We used MCML to select the best model and to make inference about the principal coordinates obtained from the DB method. In this study, we also considered additional non-spatial variation which could not be attributed to the binomial error distribution, but it was not significant. The observed prevalences of Loa loa microfilaraemia against the prevalences predicted using the DBSGLMM approach show substantially less scatterness than the one fitted by Thomson et al.22 and Diggle et al.,2 suggesting that our method is adequate and provides good fitting and predicting results. The plan of the paper is the following. Section 2 develops the methodological proposal: spatial generalised mixed models based on distances, spectral representation, MCMC algorithm for DBSGLMM, spatial prediction of a new individual, goodness-of-fit measures and selection of variables. Section 3 compares the classical spatial GLMM and the proposed method, DBSGLMM. Section 4 develops the application that illustrates the proposed methodology. Then, the paper ends with some conclusions.

2 DB spatial generalised linear mixed models Let s 2 Rd be a location in a d-dimensional Euclidean space, and assume that the potential datum y(s) at each spatial location s is assumed to have a random field structure.23 If s is allowed to vary on a given set D  Rd , we have a stochastic spatial process fyðsÞ, s 2 Dg. A distribution belongs to the exponential family if it has a density function given by f ð yðsÞ; s Þ ¼ h1 ð yðsÞÞ exp½ðs Þtð yðsÞÞ  bðs Þ where ðs Þ, bðs Þ, tð yðsÞÞ and h1 ð yðsÞÞ are functions that take values in the real line. The proposed interpolation is built for a non-Gaussian random field model by specifically considering categorical, continuous and indicator variables in the trend model. The datagenerating mechanism conditional on the signal of the model follows a classical GLM as described by McCullagh and Nelder.24 Explicitly, conditional on zðÞ and eðsi Þ, the responses yðsi Þ at locations si with i ¼ 1, . . . , n are mutually independent random variables whose conditional expectations, ðsi Þ ¼ E½ yðsi Þjvðsi Þ, zðsi Þ, eðsi Þ, are determined as i ¼ gððsi ÞÞ ¼ 0 þ vt ðsi Þc þ zðsi Þ þ eðsi Þ

ð1Þ

where gðÞ is coined a link function which is invertible and continuous, 0 þ vt ðsi Þc is the trend,  0 is an unknown parameter associated to the intercept, vðsi Þ ¼ ðv1 ðsi Þ, . . . , vp ðsi ÞÞt is a vector containing explanatory variables associated to the spatial location si and c ¼ ð1 , . . . , p Þt is a vector of unknown spatial regression parameters. The conditional distribution of each yðsi Þ given zðsi Þ and eðsi Þ is called the error distribution. The zðsi Þ’s are assumed to have a random field structure, whose covariance function is characterised by a finite dimensional parameter h, termed the spatial variance component. In addition, the unobserved noise term eðsi Þ may or may not be included, depending on the application. Nominally, such an error term accommodates over-dispersion relative to the meanvariance relationship implied by the distribution under consideration. The error components are assumed to be mutually independent with mean 0 and variance  2 .

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

4

Statistical Methods in Medical Research 0(0)

Without significant loss of generality, we work here with model (1) without the noise term, eðsi Þ; so, model (1) can be expressed as i ¼ gððsi ÞÞ ¼ 0 þ vt ðsi Þc þ zðsi Þ

ð2Þ

Within the field of linear models, it is usual to work with the model in its canonical form (ðsi Þ ¼ si , tð yðsi ÞÞ ¼ yðsi Þ), which includes a dispersion parameter  4 0. Specifically, conditional on unobserved spatial random variables zðsi Þ, yðsi Þ follows a distribution of the exponential family, i.e.   1 ind 2 yðsi Þjðvðsi Þ, zðsi ÞÞ  exp 2 ½ yðsi Þsi  bðsi Þ þ cð yðsi Þ,  Þ  where  2 is an extra-variation parameter and cðÞ is a specific function. The conditional mean, @bð Þ ðsi Þ, is related to si through the identity ðsi Þ ¼ @ssi . It is to be modelled, after a proper i transformation, as the linear model given by (2) in both the fixed (ðsi Þ) and spatial random effects (zðsi Þ). We can reformulate model (2) in a compact vectorial form as gðEðys jV, zs ÞÞ ¼ gðls Þ ¼ 0 1 þ Vc þ Ezs

ð3Þ

where ls ¼ Eðys jV, zs Þ, ys ¼ ð yðs1 Þ, . . . , yðsn ÞÞt , V ¼ HV with H ¼ I  1n 11t , a centring matrix and I the identity matrix of size n  n. V is a matrix of original explanatory variables; note that V may involve continuous, categorical and binary variables, or even a mixture of them. Furthermore zs ¼ ðzðs1 Þ, . . . , zðsn ÞÞt , ls ¼ ððs1 Þ, . . . , ðsn ÞÞt , 1 is a vector of ones of size n  1, and E is a nonrandom design matrix which is compatible with the random effects zs.

2.1

DB mixed model

To build our model, we consider several general measures of distance/dissimilarity that can be applied to explanatory variables: continuous, categorical or a mixture of them. To this aim, we need to define some similarity or distance measures, which depend on the explanatory variable characteristics. According to Cuadras25 and Cuadras and Arenas,15 let  ¼ f!1 , . . . , !n g be a set consisting of n individuals. Let dii0 ¼ d ð!i , !i0 Þ ¼ d ð!i0 , !i Þ d ð!i , !i Þ ¼ 0 be a distance (or dissimilarity) function defined on  and denoted by " ¼ ðdii0 Þ, the n  n matrix of inter-distances (or inter-dissimilarities). Then, there exists a configuration of points vðs1 Þ, . . . , vðsn Þ 2 Rp (where vðsi Þ ¼ ðv1 ðsi Þ, . . . , vp ðsi ÞÞt , i ¼ 1, . . . , n, which is formed by binary, categorical and continuous variables), such that the similarity according to Gower26 can be defined for mixed variables as  Pp c  jvj ðsi Þvj ðsi0 Þj 1  þ c1ii0 þ ii0 j¼1 Gj mii0 ¼ ð4Þ pc þ ð pb  c4ii0 Þ þ pq where, for the fixed effects, pc is the number of continuous variables, c1ii0 ¼ c1 ðsi , si0 Þ and c4ii0 ¼ c4 ðsi , si0 Þ are the numbers of positive and negative matches, respectively, for the pb binary variables and ii0 ¼ ðsi , si0 Þ is the number of matches for the pq multistate variables. Gj is the range (or distance) of the jth continuous variable.

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

Melo et al.

5

In the case that the explanatory variables in (2) are binary or categorical, the similarity can be defined by mii0 ¼

c1ii0 þ c4ii0 c1ii0 ðSokal-MichenerÞ mii0 ¼ ðJaccard Þ c1ii0 þ c2ii0 þ c3ii0 þ c4ii0 c1ii0 þ c2ii0 þ c3ii0

where c1ii0 , c2ii0 ¼ c2 ðsi , si0 Þ, c3ii0 ¼ c3 ðsi , si0 Þ, c4ii0 are the frequencies of (1,1), (1,0), (0,1) and (0,0), respectively. Through the transformation pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dii0 ¼ 1  mii0 it is possible to obtain Euclidean distances. If all the explanatory variables in (2) are continuous, the distance would be defined as qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dii0 ¼ ðvðsi Þ  vðsi0 ÞÞt ðvðsi Þ  vðsi0 ÞÞ ð5Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi Pp or alternatively by the absolute distance dii0 ¼ j¼1 jvj ðsi Þ  vj ðsi0 Þj. Expressions for the Gower similarity, as given in equation (4), will be useful to the extent of having information associated with mixed variables, not only for the sampling sites but also for the non-sampled, which limits its use in unsampled areas. These distances satisfy that are close to 0 if the v measurements on si and si0 are very similar, i.e. 0 dii ffi 0 if vðsi Þ ffi vðsi0 Þ. After selecting one of the previous distances, define Ann ¼ ðd2ii0 =2Þ and B ¼ HAH. Then B is a semi-definite positive matrix27 of rank n – 1, and the principal coordinates matrix, X, are obtained from the spectral decomposition as B ¼ HAH ¼ U,Ut ¼ XXt where , is a diagonal matrix containing the eigenvalues of B, and X ¼ U1=2 is a n  n matrix of rank n–1 because it has an eigenvector equal to 1, and U contains the standardised coordinates. Additionally, the rows xt ðs1 Þ, . . . , xt ðsn Þ of X are the principal coordinates of B with respect to the distance matrix ". In the same way that vðsi Þ ffi vðsi0 Þ when an individual i is similar to another individual i0 in (3), it is clear that xðsi Þ ffi xðsi0 Þ also holds in this case. One of the potential pitfalls in DB prediction is the gross over-parameterisation since the rank of B can be as high as n – 1. In this case, the number of principal coordinates (columns of X) can be excessive, allowing for an arbitrarily well-fitted model. To avoid such a problem, we select only the most significative principal coordinates using any of the methods commented in Section 2.4. Therefore, in matrix form, the DBSGLMM in reduced form can be expressed by g ¼ gðls Þ ¼ 0 1 þ XðkÞ b þ Ezs

ð6Þ

bt

where ls ¼ EðyjXðkÞ , zs Þ, 0 and ¼ ð 1 , . . . , k Þ are the unknown parameters, b 2 Rk , k n  1, XðkÞ ¼ ðX1 , . . . , Xk Þ which contains the subset of k relevant columns of X that are significantly correlated coordinates with ys . Moreover, each Xj ( j ¼ 1, . . . , k) is a principal coordinate, i.e. a vector column of X. The proposed model in (6) can be written as g ¼ 0 1 þ

k X

j Xj þ Ezs

j¼1

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

ð7Þ

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

6

Statistical Methods in Medical Research 0(0)

or alternatively as i ¼ gððsi ÞÞ ¼

k X

xij j þ zðsi Þ ¼ xti b þ zðsi Þ

ð8Þ

j¼0

with i ¼ 1, . . . , n, and where xi0 ¼ 1, xti ¼ ðxi0 , . . . , xik Þ and bt ¼ ð 0 , bt Þ ¼ ð 0 , 1 , . . . , k Þ. Note that for the model presented in (7), we have Xtj 1 ¼ 0, Xtj Xj ¼ j , and Xtj Xj0 ¼ 0 for j 6¼ j0 , with j, j0 ¼ 1, . . . , k. Model (8) can be considered in a general Box-Cox class of link functions,17 which has the following form  ð ðsi ÞÞ= if 4 0 g ððsi ÞÞ ¼ logððsi ÞÞ if ¼ 0 As the link function g ððsi ÞÞ relates ls with X and zs, we have that ls ¼ ms g1 ðX, zs Þ where ms is a deterministic function. If 4 0, then g ðRÞ ¼ ð1= , 1Þ, and it is necessary to define g1 = g ðRÞ, and f ð yðsi Þjðsi ÞÞ ¼ 1fyðsi Þ¼0g when ðsi Þ ¼ 0 (see Christensen17). ððsi ÞÞ ¼ 0 when ðsi Þ 2 Model (6) is not a simple reformat because it accommodates more complex data structure beyond spatial data. For example, with properly defined E and random effects zs, it encompasses nonnormal clustered data and crossed factor data.7 When E is defined as a matrix indicating membership of spatial regions (e.g. counties or census tracts), (6) models areal data as well. If we assume that the spatial process zs has a multivariate normal (MN) distribution zs  MNð0, Dz Þ, we can then parameterise the process in an efficient way for large size problems. Rewrite zs in terms of a spectral decomposition (see details in Appendix 1), zs ¼ )

ð9Þ t

where )nn is a matrix of basis functions ) ¼ ðw1 , . . . , wn Þ with wi ¼ ð i ðs1 Þ, . . . , i ðsn ÞÞ , and d is an n  1 vector of spectral coefficients (projections of the process zs onto the basis functions), with distribution d  MNð0, D Þ. Note that if the basis functions are orthogonal, then d ¼ )t zs and D ¼ )t Dz ) because )t ) ¼ )t ¼ I. There are several advantages in writing the spatial process zs in terms of the spectral process d. For example, the spectral operator often acts as a decorrelator so that D is relatively sparse, as compared to Dz . Other benefits are its efficient computation, and that in some cases the basis formulation can be truncated28 to achieve dimension reduction. Finally, substituting (9) into (6), we have the following reduced version of a DBSGLMM g ¼ gðls Þ ¼ gðEðys jX, dÞÞ ¼ Xb þ E1 d where X ¼ ð1

2.2

ð10Þ

XðkÞ Þ and E1 ¼ E.

MCML algorithm for DBSGLMM

The application of likelihood-based methods to non-Gaussian spatial GLMMs based on distances is hampered by computational difficulties, which arise because of the high dimensionality of the

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

Melo et al.

7

unobserved random vector d ¼ f ðs1 Þ, . . . , ðsn Þg. This section considers MCMC maximum likelihood17,19,20,29 for generalised linear spatial models. Assuming that each Yðsi Þ in model (10) has a distribution of the exponential family and by independence of Yðs1 Þ, . . . , Yðsn Þ given X, d and g1 ðÞ, the conditional density function of Ys ¼ ys given the observed covariates X and d is f ðys jX, d; b, Þ ¼

n Y   f yðsi Þ; mi g1 ½xðsi Þ, ðsi Þ; b i¼1

where mi ¼ mðsi Þ, and determines the link function. Now, from a classical perspective, the likelihood function based on the observed random variables ys is obtained by marginalising with respect to the unobserved random variables d, leading to the mixed-model likelihood. Then, the likelihood function for a generalised linear spatial mixed model cannot be written in closed form but only as a high-dimensional integral Z Lðb, h, Þ ¼ f ðys jb, h, Þ ¼ f ðys jX, d; b, Þ f ðdjX; hÞdd Rn Z Y n   f yðsi Þ; mi g1 ð11Þ ¼ ½xðsi Þ, ðsi Þ; b f ðdjX; hÞd ðs1 Þ    d ðsn Þ Rn i¼1

where f ðdjX; hÞ denotes the joint distribution of d given the observed covariates X, with h the vector of parameters associated to d. The integral above is also the normalising constant in the conditional density of d given ys , f ðdj ys , b, h, Þ /

n Y   f yðsi Þ; mi g1 ½xðsi Þ, ðsi Þ; b f ðdjX; hÞ

ð12Þ

i¼1

MCMC provides a method for simulating from (12) and approximating (11). The integral has a high dimension, and consequently, it is intractable to find the MLEs by direct maximisation. Then, we consider the case where is fixed, and therefore, it is suppressed. The likelihood function (11) can thus be written as Z Z f ðys jX, d; bÞ f ðdjX; hÞ ~ Lðb, hÞ ¼ f ðys , dÞdd f ðys jX, d; bÞ f ðdjX; hÞdd ¼ n n f~ ðys , dÞ R R # " Z f ðys jX, d; bÞ f ðdjX; hÞ ~ f ðy jX, d; bÞ f ðdjX; hÞ s f ðdjys Þdd ¼ e / E ð13Þ ys ~ ~ n f ðys jX, d; b0 Þf ðdÞ f ðys jX, d; b0 Þf ðdÞ R where f~ ðys , dÞ ¼ f ðys jX, d; b0 Þf~ ðdÞ and f~ ðdÞ is some density function with support on Rn , the conditional density f~ ðdjys Þ / f ðys jdÞf~ ðdÞ, and e Eðjys Þ denotes expectation with respect to f~ ðjys Þ and depends on an initial guess of b, b0 . MLEs can be calculated by maximising the Monte Carlo approximation to (13), Lr ðb, hÞ ¼

r 1X f ðys jX, dð j Þ; bÞ f ðdð j ÞjX; hÞ r j¼1 f ðys jX, dð j Þ; b0 Þf~ ðdð j ÞÞ

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

ð14Þ

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

8

Statistical Methods in Medical Research 0(0)

where the dð j Þ’s (j ¼ 1, . . . , r) are sampled by MCMC from the distribution f~ ðjys Þ. As noted from (13), one should choose f~ ðÞ close to f ðjb hÞ, where b h is the MLE of h, because otherwise one or just a small number of the terms f ðdð j Þjb, hÞ=f~ ðdð j ÞÞ, j ¼ 1, . . . , r may dominate the others in Lr ðb, hÞ, which makes the approximation less useful. Numerical procedures for maximising the Monte Carlo approximation (14) and investigation of which link function (n) is appropriate are presented in Appendix 2. Having estimated the trend parameter b and the spatial correlation parameter h, we are ready to discuss the goodness-of-fit spatial measures.

2.3

Goodness-of-fit measures

After fitting the DB model, it is important to carry out a diagnostic analysis to verify the goodnessof-fit of the estimated model. A global measure of explained variation is obtained by computing the pseudo R2k defined as     l e b, e h l b b, b h 2   , 0 R2k 1 Rk ¼ l e b, e h   ee e e where   l b, h is the log-likelihood function for the saturated model evaluated at b and h, and b b l b, h  denotes the maximum value of the log-likelihood function for the model of interest. Note  that l e b, e h will be larger than any other likelihood function for these observations, with the same assumed distribution and link function, because it provides the most complete description of the data. The discrepancy of the fitted model can be determined through how well the fitted model is significantly different from the saturated model, which contains as many parameters as observations are in the model. For this, let h    i Dðys , b, Þ ¼ 2 l b b, b h l e b, e h An approximation to this quantity can be given by Dðys , b, hÞ ¼

n X

ðrD ðsi ÞÞ2

ð15Þ

i¼1

which is known as the deviance, and n h   io1=2 ^ i Þ 2 l yðsi Þ, b b, b h  l yðsi Þ, e b, e h rDi ¼ rD ðsi Þ ¼ sign½yðsi Þ  ðs   where l yðsi Þ, e b, e h is the maximum log-likelihood for the saturated model associated to the ith  observation, and l yðsi Þ, b b, b h is the maximum value of the log-likelihood function for the model of interest associated to the ith observation. rD ðsi Þ is the ith deviance residual because an observation with a large absolute value of rD ðsi Þ can be viewed as discrepant. As expected, the log-likelihood associated with the saturated model must be greater than a model with k 5 n parameters. As it is known, the residual analysis aims at identifying atypical observations and/or model misspecification. It can be based on ordinary residuals or on deviance residuals. In this way, residuals are measures of agreement between the data and the fitted model. Most residuals are

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

Melo et al.

9

based on the differences between the observed responses and the fitted conditional mean, for example the Pearson residuals are given by ^ iÞ yðsi Þ  ðs rPi ¼ rP ðsi Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ^ i ÞÞ Varððs ^ i ÞÞ is the variance of ðs ^ i Þ. where Varððs To avoid the problem of having a pseudo R2k ’ 1 when the rank of X is k ¼ n – 1, it is necessary to consider only the most correlated eigenvectors of B, given by 1, X1 , . . . , Xk , with the regionalised variable ys , i.e. the most significantly correlated principal coordinates with ys .

2.4

Selection of principal coordinates for the reduced DBSGLMM

Initially, the number of explanatory variables can be chosen as k. A good approach to selecting the columns of X consists in ranking them according to their pseudo correlation coefficient with respect to ys , i.e. R21 ðX1 Þ 4    4 R21 ðXk Þ 4    4 R21 ðXn1 Þ where R21 ðXj Þ is the pseudo correlation coefficient between the Xj th principal coordinate (j ¼ 1, . . . , k, . . . , n  1) and g. This is done by leaving the same link function in g in the different fitted models. With this procedure, the less correlated variables with g in the principal coordinates matrix X are eliminated, i.e. n – k – 1 are not considered in the final model. A similar procedure is obtained using (15), but with ri ðXj Þ, which is the residual obtained from inclusion in the model of Xj , j ¼ 1, . . . , k, . . . , n  1. In this way, the pseudo deviances are sorted from lowest to highest; the first k pseudo deviances are       D ys ; X1 5 D ys ; X2 5    5 D ys ; Xk Another option for the selection of principal coordinates is performed in a similar way to the selection of the number of variables in multivariate regression, using the statistic called Cp  Mallows. That is, we draw a graph representing the points ð j, 1  cð j ÞÞ j ¼ 1, . . . , k, k þ 1, . . . , n  1, and then the point with a significant fall in the lack of predictability, given by 1  cð j Þ, is determined. The predictability c(j) is given by Pj 2 l¼1 R1 ðXl Þ l cð0Þ ¼ 0, cð j Þ ¼ Pn1 , j ¼ 1, . . . , k, . . . , n  1 2 l¼1 R1 ðXl Þ l where l is the lth eigenvalue associated to Xl , l ¼ 1, . . . , k, . . . , n  1 (see Cuadras et al.16 for more details). So, the Xkþ1 , . . . , Xn1 principal coordinates must be removed. Having chosen the k significant principal coordinates, we are ready to discuss spatial techniques to predict the value of a random field at a given location from nearby observations.

2.5

Spatial prediction of a new individual

The coordinates xðkÞ ðs0 Þ are obtained assuming that observations of the mixed explanatory variables are available for a new individual, that is, vðs0 Þ ¼ ðv1 ðs0 Þ, . . . , vp ðs0 ÞÞt is known. Then, the distances between the new individual and each of the individuals involved in the model proposed in (2) are to

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

10

Statistical Methods in Medical Research 0(0)

be calculated, i.e. d0i ¼ dðvðs0 Þ, vðsi ÞÞ, i ¼ 1, . . . , n. From these distances, a prediction can be performed using a result proposed by Gower26 and Cuadras and Arenas15(Section 3.3), which  2 t relates the vector d0 ¼ d01 , . . . , d20n of squared distances and the vector xðkÞ ðs0 Þ ¼ ðx1 ðs0 Þ, . . . , xk ðs0 ÞÞt of principal coordinates associated to the new individual as follows  t   ð16Þ d20i ¼ xðkÞ ðs0 Þ  xðkÞ ðsi Þ xðkÞ ðs0 Þ  xðkÞ ðsi Þ where xðkÞ ðsi Þ ¼ ðx1 ðsi Þ, . . . , xk ðsi ÞÞt with i ¼ 1, . . . , n. Then, we have that 1 Xt ðb  d0 Þ xðkÞ ðs0 Þ ¼ ,1 2 ðkÞ ðkÞ where b ¼ ðb11 , . . . , bnn Þt is a bii ¼ xtðkÞ ðsi ÞxðkÞ ðsi Þ, i ¼ 1, . . . , n.

vector

formed

by

the

ð17Þ diagonal

elements

of

B,

with

2.5.1 Spatial prediction Specifically, kriging is about interpolating the value y0 ¼ ð yðsnþ1 Þ, . . . , yðsnþn0 ÞÞt of a random field Y0 at l prespecified locations from observations yðsi Þ, i ¼ 1, . . . , n. We focus on interpolation of random effects over a continuous spatial area when the observations   are non-Gaussian. Now, let g0 ¼ ððsnþ1 Þ, . . . , ðsnþn0 ÞÞt be the functional prediction and let f g0 , g be the joint density function of g and a vector g0 . If we confine our interest to pseudo unbiased linear predictors of the form e g ¼ h þ Qg

ð18Þ

for some conformable vector h and matrix Q.30 Minimising the mean-squared-error of the prediction, we find the pseudo best linear unbiased predictor, which is given by (see details in Appendix 3)   e g ¼ X0 b þ Covt g, g0 D1  ½g  Xb   where D ¼  2 Rð#Þ þ R2 In , X0 is a matrix of k principal coordinates for the n0 new spatial subjects including a vector of ones, i.e. 1 of size n0  1. The covariance matrix for the prediction has the following general form (see details in Appendix 3) h i       g r ðgjy Þ D1 Cov g, g0 Var e gjys D0 þ Covt g, g0 1 Var s      0 g r is the covariance matrix based on the where D0 ¼ Varðg0 Þ  Covt g, g0 D1 and Var  Cov g, g sample gð1Þ, . . . , gðrÞ.

3 Relation with the classical spatial GLMM Model (6) depends on the chosen distance, dii0 (i, i0 ¼ 1, . . . , n), and it is particularly interesting under mixed variables and non-linear cases. Therefore, when the predictor variables are continuous and the Euclidean distance is used, the DBSGLMM yields exactly the same predictions than the spatial GLMM. This equivalence is also shown to hold for qualitative variables when a distance based on

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

Melo et al.

11

the matching coefficient is used. Additionally, the same is shown when there is a mixture of continuous, categorical and binary explanatory variables.

3.1

Continuous variables

If all the explanatory variables in (3), V ¼ ðV1 , . . . , Vp Þ, are continuous, the Euclidean distance is given by (5). Thus, the distance matrix " ¼ ðdij Þ is obtained, and 1 AC ¼  fdiagðV Vt Þ1t þ 1½diagðV Vt Þt  2V Vt g 2 where diagðV Vt Þ is a vector which contains the diagonal terms of the matrix V Vt . Therefore, BC ¼ HAC H ¼ HV Vt H ¼ VVt ¼ XXt because H½diagðV Vt Þ1t H ¼ 0, H½1ðdiagðV Vt ÞÞt H ¼ 0. Then, the DBSGLMM in (6) is a spatial generalised centred linear mixed model (3), i.e. it produces the same predictions (with p dimensions) than the model given in (3). However, it is not necessary to consider a Euclidean p-dimensional distance as given in equation (5) because any other distance could be used (e.g. absolute distance), provided that some properties are satisfied. Let El (k l n  1) be the space spanned by the columns of X, where X is a metric scaling solution obtained from a distance applied to the same data. Then, by taking k 4 p, i.e. the most suitable columns of X, the DB models improve the classical spatial GLMM when ðg  ^0 1Þ 2 El . Note that this is always true for l ¼ n – 1. To illustrate it, the pseudo R2k can be written as R2k ¼

k X

R21 ðXj Þ

j¼1

where R21 ðXj Þ is the pseudo correlation coefficient between g and the principal coordinate Xj . Note that R2k 5 R2kþ1 because g is uncorrelated with the principal coordinates Xj ’s ( j ¼ 1, . . . , k) and the principal coordinate Xkþ1 . Moreover, R2k is maximised for a given k provided that the first k-ordered principal coordinates are selected. On the other hand, we can represent as R2 ðg, V Þ the pseudo R2 between g and the explanatory variables V using the classical spatial GLMM. In this case, we have that R2 ðg, V Þ

p X

R2 ðg, Vj Þ

j¼1

where R2 ðg, Vj Þ is the pseudo correlation coefficient between g and the explanatory variable Vj . Without loss of generality, suppose p ¼ 1 and note that applying the DBSGLMM, the pseudo R2l can be written as R2l ¼

l X

R21 ðXj Þ,

1 k l n1

j¼1

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

12

Statistical Methods in Medical Research 0(0)

because the principal coordinates (X1 , . . . , Xn1 ) are uncorrelated. Also, we can represent as R2 ðg, V1 Þ the pseudo R2 between g and the explanatory variable V1 using the classical spatial GLMM. Therefore, if we take l ¼ k 4 1 (e.g. k ¼ 2) then we find that R2l ¼

l X

R21 ðXj Þ R2 ðg, V1 Þ

j¼1

So, the DBSGLMM improves the classical spatial GLMM. However, this inequality is not too relevant if R2 ðg, V1 Þ is close to 1, and so, the classical spatial GLMM is sufficiently good, and thus the DBSGLMM improvement is not necessary. This result holds when the explanatory variables are qualitative (see Section 3.2) and mixed (see Section 3.3).

3.2

Qualitative variables

Assume now that all the explanatory variables V ¼ ðV1 , . . . , Vp Þ are qualitative in (3), where now every Vj is a variable in the qj states, j ¼ 1, . . . , p. A measure of similarity between individuals i and i0 is the number of matches mii0 for the qualitative variables involved in the model. Note that 0 mii0 p, and when the explanatory variables are binary, the mii0 =p’s are the matching coefficients (see details in Cuadras and Arenas.15). The square distances between individuals i and i0 is defined as d2ii0 ¼ mii þ mi0 i0  2mii0 ¼ 2ð p  mii0 Þ where d2ij becomes a square Euclidean distance because Vj can be represented by qj (j ¼ 1, . . . , p) binary variables coded (0, 1). In a similar way, as it was shown for continuous variables, writing AQ ¼  12 ½Mr þ Mtr  2M, we find that BQ ¼ HAQ H ¼ HMH ¼ HV V tH where all rows of Mr are equal, M ¼ ðmii0 Þ and HMr ¼ Mtr H ¼ 0. Hence, there is no advantage over the classical spatial GLMM, except that the multicollinearity problem may be automatically solved by using distances. Therefore, the DBSGLMM reduces to the classical spatial GLMM for qualitative variables by scoring the states as 0 (absent) and 1 (present).

3.3

Mixed variables

Suppose now that we have in (3) the matrices V ¼ ðVc Vq Þ, where Vc and Vq are submatrices of continuous and qualitative variables of V , respectively. According to Cuadras et al.,16 it is appropriate using the similarities mii0 given by Gower31 between the individual i and i0 . So, the squared dissimilarities between individuals i and i0 are d2ii0 ¼ 1  mii0 for the model, and " ¼ ðdij Þ is a distance matrix on the set of n individuals.16 Consequently, if the similarity presented in (4) is chosen, the DBSGLMM is shown to be equivalent to the classical spatial GLMM by considering the quantitative variables as usual and by scoring the qualitative variables as described in the previous subsection.

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

Melo et al.

3.4

13

Non-linear DBSGLMM

The DB approach is also useful to perform a non-linear spatial GLMM. Let X ¼ ðX1 , . . . , Xk Þ be the principal coordinates, the non-linear DBSGLMM is given by i ¼ gððsi ÞÞ ¼

k X

fj ðxj ðsi ÞÞ j þ zðsi , hÞ

ð19Þ

j¼1

where fj (j ¼ 1, . . . , k) is a non-linear function (possibly unknown) depending on parameter j. If the original p explanatory variables are continuous, an alternative to the model (19) can be built using the absolute distance given by d2ii0 ¼

p X

jvl ðsi Þ  vl ðsi0 Þj

ð20Þ

l¼1

Cuadras et al.16 proved that if a single explanatory variable is considered and the points are equidistant, the DB approach is equivalent to an ordinary regression on orthogonal polynomials; specifically, these are the Chebyshev polynomials. The same result holds for the DBSGLMM when p ¼ 1. Then, the DB model with distance (20) in the non-linear spatial GLMM may be useful and shows a good performance because only a spectral decomposition of a symmetric matrix, followed by a projection, is needed under the DB approach.

4 Application We consider the epidemiological data analysed by Thomson et al.22 and Diggle et al.2 These data were obtained from a series of field studies undertaken by the joint Institut de Recherche pour le Dveloppement-Centre Pasteur du Cameroun unit in Cameroon between 1991 and 2001,1,3 and subsequently by other institutes.32,33 With the aim of modelling the spatial variation in Loa loa prevalence (presence/absence of Loa-loa microfilariae in the blood sample), a dataset of 21,938 individuals from 168 villages were studied.8 The explanatory variables included elevation, together with maximum normalised-difference vegetation index (NDVI; max(NVDI)) and the standard deviation of NDVI (SD (NDVI)) calculated from repeated satellite scans over time. According to Diggle and Ribeiro,8 inclusion of both the maximum values and SDs of the vegetation index allows discrimination between locations with different degrees of seasonal variation in greenness of methodologies are given by Thomson et al.22 In a first stage, we employed the distance between individuals given in (5) because all explanatory variables were continuous. Thus, the distances from the centring explanatory variables and the matrix B ¼ HAH were calculated. Then, the principal coordinates were built using the explanatory variables: elevation, maximum NDVI and the SD (NDVI). To choose the principal coordinates that are to be included in the model, the higher c( j)’s were selected, but any other procedure proposed in Section 2.4 could also be used. Three principal coordinates were chosen, say X1 , X2 and X3 . This was done here with the aim to keep the same number of explanatory variables. However, this is not necessary under the DB approach, as more coordinates could be included in the model, which might improve the goodness-of-fit of the proposed model. Residual spatial dependence was handled as follows: let yðsi Þ denote the number of positive samples out of ni individuals tested at location si. Besides, we postulated a set of village-specific

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

14

Statistical Methods in Medical Research 0(0)

random effects eðsi Þ, which are mutually independent and Gaussian, with mean 0 and variance R2 ¼  2 = 2 ; this relative nugget instead of  2 is assumed by computational convenience. Our model assumes that the yðsi Þ are conditionally independent binomial variables given an unobserved spatial stochastic process zðsi Þ and eðsi Þ, and that the mean response at si depends on the explanatory variables observed at location si, on zðsi Þ and eðsi Þ. Specifically, if pðsi Þ is the probability that a randomly selected person at a location si tests positive for Loa loa, then

pðsi Þ ð21Þ log ¼ 0 þ x1 ðsi Þ 1 þ x2 ðsi Þ 2 þ x3 ðsi Þ 3 þ zðsi Þ þ eðsi Þ 1  pðsi Þ where zðsi Þ is modelled as a zero mean Gaussian process with variance  2 and correlation structure described by Corðzðsi Þ, zðsi0 ÞÞ ¼ ðh; #Þ, i, i0 ¼ 1, . . . , n. We restrict our study to isotropic correlation functions (we did not find any evidence of anisotropy and thus we omit the corresponding details), where the correlation function ðh; #Þ depends only on the Euclidean distance h ¼ jsi  si0 j between locations. The roles of zðsi Þ and eðsi Þ in the model are to capture the residual spatial and non-spatial variations, respectively, after adjusting by the three principal coordinates. Diggle et al.2 used an exponential correlation function, ðh; #Þ ¼ expðh=#Þ. Here, we consider correlation functions from the Mate´rn class where the exponential model can be considered a particular case, and spherical correlation functions (see Cressie23). The MCMC algorithm was used to generate samples from the predictive distribution of the complete surface zðsi Þ at 1-km resolution, given the observed values of the response variable yðsi Þ at each sampled village location and of the three principal coordinates at 1-km resolution throughout the study region. We first fitted the data by a binomial mixed model based on distances using the logit link function with i.i.d. random effects. Applying the classical GLM and taking the principal coordinates X1 , X2 and X3 , we obtained the estimates ^00 ¼ 0:713, ^01 ¼ 0:497, ^02 ¼ 0:216 and ^03 ¼ 0:251. Then, the left-hand panel of Figure 1 shows the variogram cloud; its diffuse appearance is entirely typical. A more stable variant of the variogram cloud is the empirical variogram as illustrated on the righthand panel of Figure 1. From it, the estimates of the random effects were calculated, from which we 2 obtained the estimates b h0 ¼ ð^ 02 , #^0 , ^R0 Þ ¼ ð0:5, 5:6, 0:001Þ, using the least squares method.23 We then used the previous estimators as initial values to run the MCMC algorithm for the DBSGLMM. The Monte Carlo sample size was 1000, and the first 100,000 iterations were discarded as a burn-in period. A total of 1,000,000 simulations were run, taking samples every 1000 iterations. We studied the following models: H1 H2 H3 H4 H5

: No spatial correlation: : Exponential correlation function, # 4 0 and R2 ¼ 0: : Exponential correlation function, # 4 0 and R2 0: ´ correlation function, ¼ 1, # 4 0 and R2 0: : Matern : Spherical correlation function, # 4 0 and R2 0:

The results obtained after running the MCMC algorithm for the DBSGLMM are presented in ^ in H3 and H1 relative to a 0:52 -distribution Table 1. Comparing the difference of the log L’s ð2Þ (whose 95% quantile is 2.996), we note a significant spatial correlation. In addition, as in Diggle et al.,2 there was not a large nugget effect  2 ¼  2 R2 , which can be obtained by comparing the log L^ differences of H2 and H3 relative to a 0:52ð1Þ -distribution, whose 95% quantile is 1.921. Note that the

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

Melo et al.

15

Figure 1. Variogram cloud (left panel) and binned variogram (right panel) for the prevalence of Loa loa using a DB approach.

Table 1. Maximum likelihood estimates for the models H1, H2, H3, H4 and H5. Model

^0

^1

^2

^3

^ 2

#^

^R2

log L^

H1 H2 H3 H4 H5

1.870 1.862 1.863 1.836 1.820

0.595 0.489 0.484 0.482 0.500

0.193 0.254 0.209 0.190 0.205

0.335 0.273 0.288 0.278 0.277

0.795 0.801 0.769 0.684 0.699

* 21.609 26.195 12.697 35.566

0.00 0.00 0.04 0.11 0.04

94.0 148.9 149.5 147.7 145.4

*This value must not be obtained because it is a non-spatial model.

estimate #^ for model H2 is smaller than the estimate #^ for model H3, which is consistent with model H2 having no nugget effect and model H3 having a small nugget effect. The results in Table 1 also show that the choice of the correlation function is not important, although the exponential model with a small nugget effect is slightly better than other competitive models. Several runs of the MCMC algorithm for the DBSGLMM approach were performed with 2 different starting values ( 00 , 01 , 02 , 03 , 02 , #0 , R0 ) and different correlation functions; however, the results did not differ much, and the pattern was the same as in Table 1. Model comparison is here used in a somewhat informal way because some of the models that were compared are not nested. The analysis also assumes that the log-likelihood ratio can be reasonably approximated by a 2 -distribution, but it is often hard to prove this rigorously. However, the exponential model with a nugget effect shows the highest log-likelihood value

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

16

Statistical Methods in Medical Research 0(0)

which is a reasonable criterion when the models are not nested. Therefore, this model will be investigated further by simulation in the next subsection.

4.1

Inference on the parameters, diagnostic and prediction using DBSGLMM

After having completed an initial run of the MCMC algorithm in the DBSGLMM approach, and with the aim of performing statistical inference on the parameters involved in model (21), we repeated 200 times the process of running 1,000,000 simulations, obtaining samples every 1000 iterations after removing the first 100,000 iterations. The deviance was 112.23, which compared with 2ð0:05,161Þ ¼ 132:66 shows that the logit DBSGLMM appears to fit well. Furthermore, using the logit link function, we found that the pseudo R2k ¼ 0:93, suggesting an appropriate fitting. Table 2 shows the mean, SD and the 95% confidence interval for each of the parameters in model (21). Regarding the regression parameters, Table 2 shows that X1 and X3 principal coordinates reduce the prevalence of Loa loa, while it is increased with the X2 principal coordinate. These conclusions about the principal coordinates might not be relevant to the general goal, as the researcher could be more interested on information over the original variables; however, these are presented here with the aim of highlighting that these components are significative and must be considered in the prediction process. Figure 2 confirms these results, noting that parameters b1, b2 and b3 are different to zero, and therefore, the greenness of the surrounding vegetation and the elevation markedly affect the prevalence of Loa loa, as they play a role in the distance from which the principal coordinates X1 , X2 and X3 were built. To interpret the results for the distance scale parameter #, the minimum distance between the sampling locations is 0.56 km and the maximum is 715.7 km. In particular, the 95% interval for  2 is (0.70; 0.93) on the logit scale, the interval for # of (23.49; 53.47) km indicates that this residual spatial variation operates on a relatively medium scale, and the interval for R2 of (0; 0.20) shows that this effect is practically zero, and thus, the nugget effect is certainly mild. These results are also confirmed through Figure 2. Note that the estimated parameters for model H3 showed in Table 1 are inside the confidence intervals presented in Table 2; this was expected since the result is a realisation due to the randomisation of the process. For example, the estimation of R2 differs between Table 1 and Table 2, but the confidence interval contains both estimations, and they are not significant in both cases.

Table 2. Estimates and 95% confidence intervals for the parameters involved in DBSGLMM (model (21)) to fit the prevalence of Loa loa. Parameter

Mean

SD

2.5% quantile

97.5% quantile

0 1 ðX1 Þ 2 ðX2 Þ 3 ðX3 Þ 2 # R2

1.870 0.393 0.158 0.286 0.801 37.465 0.089

0.031 0.040 0.052 0.072 0.059 8.041 0.054

1.928 0.459 0.051 0.421 0.696 23.486 0.000

1.818 0.314 0.255 0.142 0.927 53.472 0.202

DBSGLMM: distance-based spatial generalised linear mixed model.

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

Melo et al.

17

In addition, Figure 3(c) plots village level Pearson residuals (rPi ) against predicted values and shows the desired absence of any obvious relationship, indicating an adequate first-order fit. The left panels (a) and (b) of Figure 3 show absence of any relationship between Pearson and deviance residuals and the index of the village, respectively. The right panel (d) of Figure 3 plots the observed

Figure 2. Distribution of the parameters involved in the proposed DBSGLMM approach.

(a)

(c)

(b)

(d)

Figure 3. (a) Pearson residuals against number of village, (b) deviance residuals against number of village, (c) Pearson residuals against predicted values for the DBSGLMM, model (21) and (d) relationship between observed prevalence of Loa loa microfilaraemia and predicted prevalence using DBSGLMM.

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:05pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

18

Statistical Methods in Medical Research 0(0)

Figure 4. (Left panel) Point estimates of the prevalence of Loa loa microfilaraemia using the DBSGLMM approach, over-laid with the prevalences observed in field studies. (Right panel) Square root of the prediction variances.

prevalences of Loa loa microfilaraemia against the prevalences predicted using the DBSGLMM approach; this panel shows substantially less scatterness than the one fitted by Thomson et al.22 and Diggle et al.2 Once the model was fitted, the predicted maps of the prevalence of Loa loa and their corresponding SD for the prediction error were obtained. The results are presented in Figure 4. Note that at locations that are sufficiently close to one or more villages, the predictive distribution shows a reduced variance around the observed villages in the sample. In left panel of Figure 4, areas within the pale-orange and yellow colour ranges are those in which there is a high prevalence of Loa loa exceeding the 20% threshold of the policy intervention. Likewise, areas in the red-orange colour range are those in which there is a low prevalence of Loa loa, not exceeding the 20% threshold.

5 Discussion and conclusion The Bayesian approach used by Diggle et al.12 and Christensen and Waagepetersen34 provides a natural way of incorporating parameter uncertainty in predictive inference. However, meaningful priors on structural parameters such as the type of correlation and choice of link function are laborious to construct, and further an MCMC-algorithm with updates of such parameters may not perform well.17 Informative priors are in practice difficult to elicit, and there seems to be no consensus on how to construct non-informative reference priors for these models. Our proposed DBSGLMM using MCMC proves to be a useful tool for model selection in complicated spatial models. However, it is well known that MCMC algorithms for spatial models (based on random fields) can be very slow to mix. This certainly seems to be the case for our model,

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:06pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

Melo et al.

19

where a thinning rate of 1 in 1000 is applied to the MCMC output. On the fitted model, inference can be used with the aim of performing better predictions at unobservable points. The technique presented here would also be useful for other types of geostatistical models far from the generalised linear spatial models. Additionally, increasing the number of principal coordinates, the proposed method produces better estimates of the regionalised variables. Whereas taking a number of principal coordinates equal to the number of original explanatory variables, the results obtained are similar to traditional methods. In this paper, we employed the early DB prediction papers proposed by Cuadras,25 Cuadras and Arenas15 and Cuadras et al.,16 in which the set of available Euclidean predictors was circumscribed to those obtained by principal coordinates (doing classical metric multidimensional scaling). Later, Boj et al.35 clarify the central role played by the column space of the doubly centred squared distance matrix. Thus, from a methodological perspective, predictions are intrinsic because they depend on the dissimilarities between themselves rather than on a given specific basis. The invariance allows use of other possible equivalent bases that could be numerically more stable than the principal coordinates solution presented here. Therefore, a work like Boj et al.35 could be implemented for the DBSGLMM. The proposed methodology has broad application to environmental risk models of disease. In general, when there is paucity of field data and inherent uncertainty in the model inputs, decision makers need to incorporate uncertainties into the model structure. With respect to the application, the variation in the prevalence of Loa loa data was studied, and additional non-spatial variation which cannot be attributed to the binomial error distribution was considered, but it was not significative. However, the greenness of the surrounding vegetation and the elevation appear markedly to affect the prevalence of Loa loa in Cameroon. Funding This work is partially funded and supported by grant MTM2010-14961 from the Spanish Ministry of Science and Education, Carolina Foundation, Applied Statistics in Experimental Research, Industry and Biotechnology (National University of Colombia), and Core Spatial Data Research (Faculty of Engineering, District University Francisco Jose´ de Caldas).

References 1. Boussinesq M, Gardon J, Kamgno J, et al. Relationships between the prevalence and intensity of Loa loa infection in the Central province of Cameroon. Ann Trop Med Parasitol 2001; 95: 495–507. 2. Diggle PJ, Thomson MC, Christensen OF, et al. Spatial modelling and the prediction of Loa loa risk: decision making under uncertainty. Ann Trop Med Parasitol 2007; 101: 499–509. 3. Kamgno J, Bouchite B, Baldet T, et al. Study of the distribution of human filariasis in West province of Cameroon. Bull Soc Pathol Exot 1997; 90: 327–330. 4. Thomson M, Connor S, D’Alessandro U, et al. Predicting malaria infection in Gambian children from satellite data and bednet use surveys: the importance of spatial correlation in the interpretation of results. Am J Trop Med Hyg 1999; 61: 2–8. 5. Nelder JA and Wedderburn RM. Generalized linear models. J R Stat Soc, Ser A 1972; 135: 370–384. 6. Liang KY and Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 1986; 73: 13–22.

7. Breslow NE and Clayton DG. Approximate inference in generalized linear mixed models. J Am Stat Assoc 1993; 88: 9–25. 8. Diggle PJ and Ribeiro PJ. Model-based geostatistics. New York: Springer Series in Statistics, 2007. 9. Laird JH and Ware NM. Random-effects models for longitudinal data. Biometrics 1982; 38: 963–974. 10. Stiratelli R, Laird N and Ware JH. Random-effects models for serial observations with binary response. Biometrics 1984; 40: 961–971. 11. Schall R. Estimation in generalized linear models with random effects. Biometrika 1991; 78: 719–727. 12. Diggle P, Tawn A and Moyeed R. Model based geostatistics. Appl Stat 1998; 49: 299–350. 13. Zhang H. On estimation and prediction for spatial generalized linear mixed models. Biometrics 2002; 58: 129–136. 14. Arenas C and Cuadras C. Recent statistical methods based on distances. Contrib Sci 2002; 2: 183–191.

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:06pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

20

Statistical Methods in Medical Research 0(0)

15. Cuadras C and Arenas C. A distance based regression model for prediction with mixed data. Commun Stat A – Theory Methods 1990; 19: 2261–2279. 16. Cuadras C, Arenas C and Fortiana J. Some computational aspects of a distance-based model for prediction. Commun Stat –Simul Comput 1996; 25: 593–609. 17. Christensen OF. Monte Carlo maximum likelihood in model-based geostatistics. J Comput Graph Stat 2004; 13: 702–718. 18. Christensen OF, Diggle PJ and Ribeiro PJ. Analysing positive-valued spatial data: the transformed Gaussian model. In: Monestiez P, Allard D and Froidevaux R (eds) geoENV III – geostatistics for environmental applications. Boston: Kluwer Academic Publishers, 2001, pp.287–298. 19. Geyer CJ and Thompson EA. Constrained Monte Carlo maximum likelihood for dependent data (with discussion). J R Stat Soc, Ser B 1992; 54: 657–699. 20. Geyer CJ. On the convergence of Monte Carlo maximum likelihood calculations. J R Stat Soc, Ser B 1994; 56: 261–274. 21. Steinsland I. Parallel exact sampling and evaluation of Gaussian Markov random fields. Comput Stat Data Anal 2007; 51: 2969–2981. 22. Thomson MC, Obsomer V, Kamgno J, et al. Mapping the distribution of Loa loa in Cameroon in support of the African Programme for Onchocerciasis Control. Filaria J 2004; 3: 7. 23. Cressie N. Statistics for spatial data, Revised ed. New York: John Wiley & Sons Inc, 1993. 24. McCullagh P and Nelder J. Generalized linear models. London: Chapman Hall, 1989. 25. Cuadras CM. Distance analysis in discrimination and classification using both continuous and categorical

26. 27. 28. 29. 30.

31. 32.

33.

34.

35.

36.

variables. In: Dodge Y (ed.) Recent developments in statistical data analysis and inference. North-Holland, Amsterdam: Elsevier Sciencie Publisher, 1989, pp.459–474. Gower J. Adding a point to vector diagrams in multivariate analysis. Biometrika 1968; 55: 582–585. Mardia KV, Kent JT and Bibby JM. Multivariate analysis. London: Academic Press, Inc, 2002. Royle JA and Wikle CK. Efficient statistical mapping of avian count data. Environ Ecol Stat 2005; 12: 225–243. Hfjbjerre M. Profile likelihood in directed graphical models from BUGS output. Stat Comput 2003; 13: 57–66. McCulloch CE, Searle SR and Neuhaus JM. Generalized, linear, and mixed models, 2nd ed. New Jersey: John Wiley & Sons, 2008. Gower J. A general coefficient of similarity and some of its properties. Biometrics 1971; 27: 857–874. Takougang I, Meremikwu M, Wanji S, et al. Rapid assessment method for prevalence and intensity of Loa loa infection. Bull World Health Organ 2002; 80: 852–858. Wanji S, Tendongfor N, Esum M, et al. Heterogeneity in the prevalence and intensity of loiasis in five contrasting bioecological zones in Cameroon. Trans R Soc Trop Med Hyg 2003; 97: 182–187. Christensen O and Waagepetersen R. Bayesian prediction of spatial count data using generalized linear mixed models. Biometrics 2002; 58: 280–286. Boj E, Delicado P and Fortiana J. Distance-based local linear regression for functional predictors. Comput Stat Data Anal 2010; 54: 429–437. Shumway RH and Stoffer DS. Time series analysis and its applications. New York: Springer-Verlag, 2000.

Appendix 1 Spectral representation According to Royle and Wikle,28 the zs process can be expanded in terms of Fourier basis functions (i.e. sines and cosines). Furthermore, assuming that the spatial process is defined on a regular lattice, for locations si for i ¼ 1, . . . , n and spatial frequencies wq ¼ q=n for q ¼ 0, . . . , n=2 (n even), we have zðsi Þ ¼

n=2 X

ð1Þ ðqÞ cosð2si wq Þ þ

q¼0

n=21 X

ð2Þ ðqÞ sinð2si wq Þ

q¼1

 t  t ð2Þ ð1Þ ¼ wð1Þ d þ w dð2Þ ¼ wti d i i where wð1Þ i ¼ ð1Þ i ðwq Þ



ð1Þ i ðw0 Þ, . . . ,

¼ cosð2si wq Þ,

ð1Þ i ðwn=2 Þ

t

, wð2Þ i ¼



ð2Þ i ðw1 Þ, . . . ,

ð2Þ i ðwn=21 Þ

t

,

 t  t t ð2Þ ð1Þ ð2Þ ðw Þ ¼ sinð2s w Þ, w ¼ w , w , q i q i i i i

 t  t t  t  t dð1Þ ¼ ð1Þ ð0Þ, . . . , ð1Þ ðn=2Þ , dð2Þ ¼ ð2Þ ð1Þ, . . . , ð2Þ ðn=2  1Þ and d ¼ dð1Þ , dð2Þ It is well known that for second-order stationary random processes, the ðqÞ ’s coefficients are nearly uncorrelated and their variances at a given frequency are approximately equal to one-half the power spectral density at that frequency, except for frequencies w0 and wn=2 at which the variance is equal to the associated power spectral density.36 Thus, assuming zs is second-order stationary with

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

XML Template (2013) [17.12.2013–3:06pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

Melo et al.

21

covariance matrix Dz ¼  2 R, where R is the correlation matrix, then CovðdÞ  2 C, where C is a diagonal matrix with the diagonal elements given by ½ f ðw0 Þ, 12 f ðw1 Þ, . . . , 12 f ðwn=2 Þ, 1 1 2 f ðw1 Þ, . . . , 2 f ðwn=21 Þ, where f ðwq Þ is the spectral density at frequency wq corresponding to the correlation function used to construct R.

Appendix 2 Numerical procedures for maximising the Monte Carlo algorithms Numerical procedure for maximising the Monte Carlo approximation (14) is presented. We reparameterise and use the relative nugget R ¼  2 = 2 instead of  2 for computational convenience; so, the covariance matrix of g is  2 ðRð#Þ þ R2 In Þ. Therefore, write ðb, hÞ ¼ ðb,  2 , #, R2 Þ, the maximisation of Lr with respect to b and  2 given # and R2 is fairly straightforward, because the first- and second-order derivatives of the normal densities f ðgð j ÞjX, b,  2 , #, R2 Þ, j ¼ 1, . . . , r, with respect to these parameters are simple, which makes an iterative Newton–Raphson procedure feasible and computationally fast. For such, suitable initial values for the iterative procedure are h  1 i1  1 bð j Þ ¼ Xt Rð#Þ þ R2 In X Xt Rð#Þ þ R2 In gð j Þ  1 1  2 ð j Þ ¼ ½gð j Þ  Xbð j Þt Rð#Þ þ R2 In ½gð j Þ  Xbð j Þ n j ¼ 1, . . . , r, corresponding to the MLEs for the normal densities f ðgð j ÞjX; b,  2 , #, R2 Þ. The values of b and  2 that maximise Lr ðb, hÞ for a fixed value of # and R2 , b bð#, R2 Þ, and ^ 2 ð#, R2 Þ are plugged into Lr, and we obtain L~ r ð#, R2 Þ ¼ Lr ðb bð#, R2 Þ, ^ 2 ð#, R2 Þ, #, R2 Þ. This function is maximised with respect to # and R2 for a given correlation function using numerical optimisation. The parameters # and R2 enter L~ r via the matrix Rð#Þ þ R2 In , and since the inversion of this matrix is computationally demanding, the maximisation would be relatively slow. The maximisation can also be sensitive to starting values in this process because the approximation Lr can be multimodal. The result should be investigated carefully by considering a variety of starting values. On the other hand, when interest is on investigating which link functions are suitable, it is useful to integrate with respect to ls ¼ ððs1 Þ, . . . , ðsn ÞÞt , where ðsi Þ ¼ mi g1 ðxðsi Þ, ðsi ÞÞ, i ¼ 1, . . . , n. The determinant of the Jacobian for this transformation is n Y g0 ððsi Þ=mi Þ J ðls Þ ¼ mi i¼1 Assuming that the link function satisfies g ðRÞ ¼ R for every n of interest, and defining f ðls jX, d; b, hÞ ¼ J ðls Þ f ð g1 ðls ÞjX, d; b, hÞ we obtain Z Lðb, h, Þ ¼ Rn

f ðys jls Þ f ðls jX, d; b, hÞdls

# " f ðys jls Þ f ðls jX, d; b, hÞ ~ f ðls jX, d; b, hÞ e f 0 ðls jys Þdls ¼ E / ys ~f ðl Þ f ðy jl Þf~ ðl Þ Rn Z

s

s

0

s

0

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

s

ð22Þ

XML Template (2013) [17.12.2013–3:06pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

22

Statistical Methods in Medical Research 0(0)

~ ~ e where f~ 0 ðls Þ ¼ J 0 ðls Þf~ ð g1 0 ðls ÞÞ, the conditional density f 0 ðls jys Þ / f ðys jls Þf 0 ðls Þ, and Eðjys Þ ~ denotes expectation with respect to f 0 ðjys Þ. MLE can be calculated by maximising the Monte Carlo approximation to (22), Lðb, h, Þ ¼

r 1X f ðls ð j ÞjX, d; b, hÞ r j¼1 f~ 0 ðls ð j ÞÞ

ð23Þ

where the ls ð j Þ’s ( j ¼ 1, . . . , r) are sampled by MCMC from the distribution f~ ðjys Þ. If g ðRÞ 6¼ R, then (22) may still hold. Finally, a starting value for b can be chosen by fitting a DBSGLMM with i.i.d random effects. From the resulting estimates of the random effects, the empirical variogram can be calculated through 2 1 X^ ^ jÞ , h40 ^ ðsi Þ  ðs ðhÞ ¼ 2jNðhÞj NðhÞ where NðhÞ ¼ fðsi , sj Þ : jsi  sj j ¼ hg and jNðhÞj is the number of distinct pairs N(h). Then, this empirical variogram is plotted, which may help to gain some insight into the parametric form of the variogram and to choose initial values of the variogram parameters.

Appendix 3 Interpolation of random effects using unbiased linear predictors Minimising the mean-squared-error of the prediction is done through Z Z h   i  t   0 t 0 e e e E gg A gg g  g0 f ðg0 , gÞdgdg0 ¼ g  g0 A e

ð24Þ

where f ðg0 , gÞ is the joint density function of g0 and g, and A is a positive definite symmetric matrix. Using (18), the left-hand side of (24) can be written as h t  i q ¼ E h þ Q  g0 A h þ Q  g0 h   t  i ¼ ht Ah þ 2ht AE Q  g0 þ E Q  g0 A Q  g0 ð25Þ Differentiating partially (25) with respect to h and equating to 0, we have that   @q ¼ 2A h þ E Q  g0 ¼ 0 @h   h ¼ E Q  g0 Substituting (26) into (25) gives h   t   t  i q ¼  E Q  g0 AE Q  g0 þ E Q  g0 A Q  g0    ¼ tr AVar Q  g0         ¼ tr A QD Qt þ Var g0  QCov g, g0  Covt g, g0 Qt

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

ð26Þ

ð27Þ

XML Template (2013) [17.12.2013–3:06pm] [1–23] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/130237/APPFile/SG-SMMJ130237.3d (SMM) [PREPRINTER stage]

Melo et al.

23

      where D ¼  2 Rð#Þ þ R2 In , Var g0 is the covariance matrix of g0 , and Cov g, g0 is the crosscovariance matrix between g and g0 .   We now wish to minimise (27) with respect to Q. To do this, ignore A and Var g0 because they do not involve Q. Then, differentiating partially (27) with respect to Q and equating to 0, we have that   @q ¼ 2QD  2Covt g, g0 ¼ 0 @Q   Q ¼ Covt g, g0 D1 

ð28Þ

Thus, substituting (26) and (28) in (18), the pseudo best linear unbiased predictor is given by     e g ¼ E g0 þ Covt g, g0 D1  ½g  EðgÞ ð29Þ   ¼ X0 b þ Covt g, g0 D1  ½g  Xb where X0 is a matrix of k principal coordinates for the n0 new spatial subjects including a vector of ones, i.e. 1 of size n0  1. Taking expectation in (29), we obtain

  Eðe gjys Þ ¼ X0 b þ Covt g, g0 D1  Eðgjys Þ  Xb h i   e X0 b þ Covt g, g0 D1 Er ðgjys Þ  Xb  where e Er is the empirical mean vector based on the sample gð1Þ, . . . , gðrÞ. The covariance matrix for the prediction given in (29) is given by

1    0 Varðe gjys Þ ¼ D0 þ Covt g, g0 D1  Varðgjys Þ D Cov g, g h    i   g r gjy D1 Cov g, g0 D0 þ Covt g, g0 D1 Var s      0 g r is the covariance matrix based on the where D0 ¼ Varðg0 Þ  Covt g, g0 D1 and Var  Cov g, g sample gð1Þ, . . . , gðrÞ.

Downloaded from smm.sagepub.com at University of British Columbia Library on November 17, 2015

Spatial generalised linear mixed models based on distances.

Risk models derived from environmental data have been widely shown to be effective in delineating geographical areas of risk because they are intuitiv...
665KB Sizes 1 Downloads 0 Views