Biometrics 70, 110–120 March 2014

DOI: 10.1111/biom.12118

Fast Forward Selection for Generalized Estimating Equations with a Large Number of Predictor Variables Jakub Stoklosa,1, * Heloise Gibb,2 and David I. Warton1 1

School of Mathematics and Statistics and Evolution & Ecology Research Centre, The University of New South Wales, NSW 2052, Australia 2 Department of Zoology, La Trobe University, Victoria 3068, Australia ∗ email: [email protected]

SUMMARY. We propose a new variable selection criterion designed for use with forward selection algorithms; the score information criterion (SIC). The proposed criterion is based on score statistics which incorporate correlated response data. The main advantage of the SIC is that it is much faster to compute than existing model selection criteria when the number of predictor variables added to a model is large, this is because SIC can be computed for all candidate models without actually fitting them. A second advantage is that it incorporates the correlation between variables into its quasi-likelihood, leading to more desirable properties than competing selection criteria. Consistency and prediction properties are shown for the SIC. We conduct simulation studies to evaluate the selection and prediction performances, and compare these, as well as computational times, with some well-known variable selection criteria. We apply the SIC on a real data set collected on arthropods by considering variable selection on a large number of interactions terms consisting of species traits and environmental covariates. Key words: Ecology; Generalized estimating equations; Information criterion; Model selection; Multivariate analysis; Negative binomial regression; Score statistics; Trait/environment relationship. 1. Introduction For many studies involving longitudinal data with correlated responses, the well-known marginal regression method of generalized estimating equations (GEEs) is usually considered for analysis (Liang and Zeger, 1986; Hardin and Hilbe, 2002). In this context, likelihood functions are often undefined, hence common model selection techniques, such as Akaike information criterion (AIC, Akaike, 1974) and the Bayesian information criterion (BIC, Schwarz, 1978) cannot be used. Several alternative criteria have since been proposed, these include: quasi-likelihood information criterion (QIC, Pan, 2001b) and its variants (Hin and Wang, 2009), generalized Mallow’s Cp (GCp , Cantoni, Flemming, and Ronchetti, 2005), and BIC using the quadratic inference function (QIF) approach (Wang and Qu, 2009). More recently, model selection criteria have been proposed using penalized GEEs (Wang, Zhou, and Qu, 2012) and penalized QIFs (Cho and Qu, 2013), where both methods use the smoothly clipped absolute deviation (SCAD) penalty. We focus on the situation where the number of predictor variables (q) in the model is large. In this setting all-subsets is not practicably feasible and a stepwise approach (e.g., forward selection, see Figure 1 and Table 1) is commonly implemented, either explicitly (as in Burnham and Anderson, 1999) or implicitly (as in MARS, LARS/LASSO algorithms Friedman, 1991; Hastie, Tibshirani and Friedman, 2001). Moreover, even for stepwise approaches, the number of models to be fitted increases quadratically with the number of variables (see Table 1). For example, in the case study presented in Section 4, we are concerned with 81 interaction variables, hence 3321 model fits would be required in order to complete a full forward selection path.

110

In this article we propose the score information criterion (SIC): a much faster and generally applicable selection criterion which incorporates correlation into its quasi-likelihood term rather than the penalty term, giving better all-round performance for correlated data. The SIC is represented as a function of score statistics, and hence it only requires fitting a “null” model and not all candidate models. In combination with forward selection this means that many fewer models are fitted to the data, so a potentially huge reduction in computation time can be achieved. In addition, the proposed SIC statistic can be consistent or asymptotically efficient depending on the choice of penalty, in an analogous way to classical results for BIC and AIC respectively (Schwarz, 1978; Shibata, 1980). In Section 2.1 we introduce some notation and give a brief outline of GEEs. We review several existing model selection techniques for GEEs in Sections 2.2.1–2.2.2, and give the proposed SIC in Section 2.2.3. Consistency and prediction properties are presented for the SIC under different penalty terms. Simulations are presented in Section 3, where we investigate and compare: selection and prediction performances, as well as computational times. In Section 4 we analyse arthropod data by considering variable selection on the interactions between species trait and environmental covariates. 2.

Methods

2.1. Notation and Generalized Estimating Equations Consider a clustered data set with cluster sizes j = 1, . . . , ni and i = 1, . . . , N independent clusters. For simplicity, let ni = n for i = 1, . . . , N. Let yi = (yi1 , . . . , yin )T be a vector of responses, and Xi = (xTi1 , . . . , xTin )T be an n × (q + 1) matrix © 2013, The International Biometric Society

Fast Forward Selection for Generalized Estimating Equations

~1

+X1

+X1

+X1

+X1

+X2

+X2

+X3

+X3

+X3

+X3

+X5

+X5

2

3

111

+X3

+X4

+X5 k:

0

1

4

5

Figure 1. Forward and proposed selection paths with q = 5 number of variables. Forward selection fits all models (i.e., each node as above), whereas the proposed approach requires fitting models along the selection path only (shaded nodes). This figure appears in color in the electronic version of this article. where xij is a vector with q covariates and the intercept. Throughout we assume that the components of yi are correlated but yi and yl are independent for i = l. Suppose that g(μi ) = Xi θ, where g(·) is a specified link function, E(yi | Xi ) = μi , Var(yi | Xi ) = v(μi )φ for some known function v(·) and scale parameter φ, and θ ∈  ∈ Rq+1 contains the associated unknown regression parameters. Let R = R(α) = Corr(yi ) be the within-subject correlation matrix (parametrized by unknown α), also commonly referred to as the working correlation matrix. In general, R is unknown but some structure may be specified, for example, Table 1 Number of model fits required (as the number of variables q increases) for: all subsets variable selection; forward variable selection; and the proposed variable selection approach (see Section 2.2.3). Note that there is a huge reduction in the number of model fits when considering the proposed approach (3rd column) for large q. # of model fits when using # of vars.

All subsets

Forward selection

Proposed approach

5 10 50 100

31 1023 ≈ 1.13e15 ≈ 1.27e30

15 55 1275 5050

5 10 50 100

≈2

≈ q /2

q

q

2

q

compound symmetry (CS), an autocorrelation of order 1 (AR(1)) or independence. For now, assume that φ = 1 and let Ai = diag{v(μi1 ), . . . , v(μin )} and y = (yT1 , . . . , yTN ). The regression parameters θ are estimated by solving the GEE:

U θ (θ, R; y) =

N 

DTi V −1 i si = 0,

(1)

i=1

1/2

1/2

where Di = ∂μi /∂θ, V i = Ai RAi and si = yi − μi . The subscript on U is included to emphasise that U is a function of derivatives w.r.t. θ. A key benefit of using (1) is that if R is misspecified, then consistent estimators for θ are still obtainable (Liang and Zeger, 1986; Pan, 2001b). There are a number of estimation algorithms for β and α, we will use GEE1 (Hardin and Hilbe, 2002), however the ideas presented in this article are valid irrespective of choice of estimation algorithm. Let  θ be an estimate of θ. The covariance matrix for  θ can be consistently estimated using the sandwich or robust estimator (Liang and Zeger, 1986). For our analysis we consider Pan’s modification (Pan, 2001a) as it provides better efficiency when we have good knowledge of the mean/variance relationship:

 Vθ = lim N

 N 

N→∞

 N  i=1

−1 DTi V −1 i Di

i=1

−1 DTi V −1 i W i V i Di

 N  i=1

−1 DTi V −1 i Di

,

112

Biometrics, March 2014 1/2

N

−1/2

−1/2



1/2

where W i = Ai A si sTi Ai /N Ai . Recently, i=1 i Warton (2011) developed a regularization approach, which overcomes estimation instability issues when N is small or n > N, these extensions are easily applicable for the proposed methodology discussed in Section 2.2. 2.2. Model Selection Criteria Below, we review several variable selection criteria discussed in the first paragraph of Section 1. Denote by M the class of all candidate models that include the intercept model plus a subset of q covariate parameters. Define Mθ to be some candidate model with parameter space denoted by θ ∈  and denote its dimension by |θ|. 2.2.1. Prediction methods. As discussed in Section 1, likelihood functions are intractable when considering GEEs. The QIC replaces the likelihood function with a quasi-likelihood, and tries to minimize the Kullback–Leibler (KL) divergence between the true model and the estimated model. Denote I by an independent working correlation matrix. For some candidate model Mθ , the QIC is given as: θ , R; y) = −2 QIC(

 ni

N

 θ ), IV Q( θ ; yij ) + 2Tr(

(2)

i=1 j=1

θ ; yij ) is the quasi-likelihood function using  θ (evalwhere Q(   I = Ni=1 DTi V −1 uated using R), Tr is the trace and  i Di is estimated using both  θ and I. The GCp also avoids likelihood functions and is based on a measure of predictive error. Let rij = (yij − μ  ij )/v1/2 (μij )  ij are the fitted values and the v1/2 (μij ) are estimated where μ from the full model. The GCp is given as: GCp ( θ , R; y) =

ni N  

rij2 −

i=1 j=1

+

N  i=1

ni N 2  Tr(M −1 DTi A−2 i Di ), N

N→∞

N i=1

N 1  1 ¯N = g gi ≈ N N i=1

...,

N 



N 

DTi A−1 i si ,

i=1

−1/2 −1/2 DTi Ai BL Ai si



N 

−1/2

DTi Ai

−1/2

Bl Ai

si ,

i=1

.

i=1

A generalized method of moments is then applied, which yields the QIF (first term of (4) as given below), and minimization is carried to obtain estimates for θ. For variable selection, a penalty can be added to the QIF, defined as: ¯ −1 g ¯N + N ¯ TN C PQIF( θ , R; y, Pλ ) = N g

q 

Pλ (| θ j |),

(4)

j=1

N

¯ = (1/N) g gT is the sample covariance of gi and where C i=1 i i Pλ is the penalty function. We follow Cho and Qu (2013) by choosing the nonconvex SCAD penalty for Pλ and use a local quadratic approximation algorithm to carry out the estimation and variable selection. The PQIF leads to some very nice variable selection properties provided that q increases as N increases. For further details on the PQIF implementation, properties and performance, see Cho and Qu (2013) therein. 2.2.3. The score information criterion. In this section we will propose a new variable selection criterion, the score information criterion (SIC). This criterion is a function of score statistics, so we start by quickly reviewing their

formulation. Let θ = (γ T , δT )T and = Var U θ (θ, R; y) =

N

ni

(3)

i=1 j=1

where M = lim N −1

Bl for l = 1, . . . , L, which allows the GEE (1) to be approximated by the extended score vector:

 DTi sTi V −1 i si Di is estimated using θ .

Importantly, both (2) and (3) take summation over the ni correlated observations in the leading term, referred to throughout as the “quasi-likelihood” (using the term loosely). This in effect means that within-cluster correlation is not accounted for in the quasi-likelihood and it is only adjusted for later in the penalty term. Both criteria can be understood as coming from the AIC mould, in the sense that they try to maximize the expectation of some quasi-likelihood on new data, and under certain conditions their penalties would reduce to the AIC penalty of 2|θ| for some model Mθ . 2.2.2. Penalized quadratic information function. Qu, Lindsay, and Li, 2000 proposed the QIF to improve the efficiency of parameter estimation under the misspecification of the working correlation. Here, the working correlation is approximated by a linear combination of several basis functions

−1 DTi V −1 i W i V i Di . Consider testing H0 : δ = 0, under this i=1 null hypothesis we can also partition the GEE (1) and its δ γ , R; y) and variance matrix, denoted respectively by U δ ( (see Web Appendix A for the formula). The generalized score statistic (Rotnitzky and Jewell, 1990; Nuamah, Qu, and Amini, 1996) is thus given by:

−1

 δ U δ ( Tδ ( γ , R; y) = U δ ( γ , R; y)T γ , R; y),

(5)

γ , that is it requires pawhich only requires the estimates  rameter estimates under the null model only. If H0 were true, then U δ ( γ , R; y) (the score equations for δ) would be close to zero when evaluated under the null. Large values of Tδ ( γ , R; y) would argue against H0 . Now we will motivate the use of score statistics to approximate a likelihood-based criterion. For some candidate model Mθ and λ > 0, a typical likelihood-based information criterion (Burnham and Anderson, 1999) has the form: IC(θ; y) = −2(θ; y) + λ|θ|

(6)

for some defined log-likelihood function (θ; y). However, in our case (θ; y) is undefined, so we first re-express (6) as the

Fast Forward Selection for Generalized Estimating Equations likelihood ratio statistic with some penalty term:





γ ; y) − 2 (θ; y) − ( −2(θ; y) + λ|θ| = −2( γ ; y) + λ|θ|





= −2 (θ; y) − ( γ ; y) + λ|θ| + c

(7)

for some reference model Mγ (usually the intercept model) whose log-likelihood (denoted by c) is ignorable because it is not a function of Mθ . The key idea behind our new criterion is to replace the first term of (7) with a score statistic; this is motivated as a quadratic approximation to the log-likelihood ratio statistic (Rao, 1973). Score statistics not only have the computational advantage of being calculable from the null model, but in the context of GEEs, they are calculable via equation (1) even when the likelihood is undefined. First, we propose a one-step SIC procedure. Consider some set of candidate models Mθ ∈ M of the form θ = (γ T , δT )T for some γ that is common to all Mθ ∈ M, that is, Mγ can be treated as a common reference model. Then: SIC(1) (θ, R; y, λ) = −Tδ ( γ , R; y) + λ|θ|.

(8)

γ only, that is, only one Note that this criterion is a function of  model needs to be fitted, model Mγ . The best-fitting candidate model is determined via the steepness of the score equations U δ ( γ , R; y) at the parameter estimate  γ for this common reference model. The approximation of score statistics to likelihood ratio statistics can be poor when there is a significant departure from the null model. This is especially the case when there is a strong mean-variance relationship, because failure of the null model to capture mean trends implies a failure to capture variance (see Section 3.3). Hence an improved approximation might calculate the score statistic in one-parameter increments, resulting in: SIC(θ, R; y, λ) = −

|δ|  k=1

max

δ(k) ∈δ\k−1



Tδ(k) ( θ k−1 , R; y) + λ|θ|, (9)

where θ Tk = (θ Tk−1 , δ(k) ) and δ\k−1 = {δ ∩ θ ck−1 } where θ ck−1 is the complement set of θ k−1 . In other words, we sequentially add new parameters selected from δ, these are δ(k) for k = 1, . . . , |δ|, in the order that maximizes the score statistic Tδ(k) ( θ k−1 , R; y) at each step, and we use the sum of these score statistics in our criterion. This formulation leads to several desirable properties and advantages over other criteria: • If we consider a forward selection type procedure, this reduces the number of model fits from polynomial in q (the number of covariates) to linear in q; since we do not need to estimate every model in the forward selection path, instead only null hypothesis models need to be fitted (shaded nodes in Figure 1). This significantly reduces the computational cost (particularly when we consider a large number of predictor variables, as noted in Nuamah et al., 1996). We investigate computational times further in Sections 3 and 4.

113

• The SIC also incorporates the correlation between variables into the leading or “quasi-likelihood” term, rather than adjusting for it in the penalties as in (2) and (3). • Computation can be even faster (although the approximation to a likelihood ratio statistic cruder) if SIC is calculated using the one-step approximation (8). • As with AIC and BIC, a stopping rule can be easily implemented for SIC when using forward selection. Specifically, the selection path could cease at model Mθk−1 if the largest possible score statistic Tδ(k) ( θ k−1 , R; y) were less than λ. • Finally, we consider the penalties λ = 2 and λ = log N for SIC, denote by SIC2 and SIClog N respectively, as these are analogous to AIC and BIC and have similar properties which we show below. First, we demonstrate a consistency property for the SIC when λ = log N. Let Mt ⊂ M be the subset of true models and define the class of most parsimonious true models by PMt = {Mθ ∈ Mt : |θ| ≤ |θ ∗ |, ∀Mθ∗ ∈ Mt }. Let λ = log N in (9). Then we can prove consistency in variable selection as follows. Theorem 1. Consider a set of candidate models which includes an Mθ ∈ PMt . Denote the model selected by the ˜ θ . In the Appendix we show that minimum SIClog N (θ) by M SIClog N (θ) is consistent: ˜ θ ∈ PMt ) → 1 Pr(M as N → ∞. Theorem 1 is analogous to a classical result for BIC (Schwarz, 1978). We can similarly derive an analogue of the classical asymptotic efficiency result for AIC (Shibata, 1980), when λ = 2 in (9). As in Zhang, Li, and Tsai (2010), an asymptotically efficient criterion selects the model whose KL loss is asymptotically equivalent to the minimum possible value amongst the candidate models. Let Me ⊂ M be a subset of models which may not necessarily contain the true model but is from a set of candidate models in the neighborhood of the true model. Theorem 2. Let  θ be the parameter estimates from model Mθ ∈ Me , which has been selected by the minimum SIC2 (θ). Assume the log-likelihood for model Mθ exists and denote KL( θ ) as the KL loss for  θ . In the Appendix we show that θ ) is asymptotically efficient: SIC2 ( θ) KL(



inf

Mθ ∈Me

→1

KL( θ)

in probability as N → ∞. Note that in Theorem 2 we needed the restrictive assumption that the likelihood is defined: this was necessary in order to use the definition of efficiency in Zhang et al. (2010). Many GEE models do not satisfy this condition, for example, those

114

Biometrics, March 2014

Table 2 Simulation 1. For various variable selection criteria we give: the proportions of times the true model was selected (Ptrue ) and the logarithms of the scaled mean prediction error (log(MSE∗ )) for each N (number of clusters) for cluster size n = 10, we also give the median computational times (seconds) across each N. Note that in (b), the “-” indicates no information available (non-convergence). Ptrue

log(MSE∗ )

N

N

Selection criteria

15

30

50

100

200

15

30

50

100

200

Times (seconds) Median

(a) q = 5 QIC GCp PQIF SIC2 SIClog N (1) SIC2 (1) SIClog N

0.51 0.53 0.04 0.54 0.53 0.48 0.40

0.62 0.63 0.10 0.66 0.76 0.72 0.78

0.70 0.68 0.18 0.70 0.89 0.74 0.94

0.69 0.68 0.32 0.67 0.90 0.74 0.92

0.70 0.70 0.50 0.69 0.97 0.75 0.98

2.66 2.64 2.65 2.62 2.68 2.70 2.83

1.82 1.83 2.01 1.83 1.75 1.78 1.73

1.19 1.19 1.37 1.20 1.06 1.15 1.02

0.59 0.60 0.78 0.61 0.45 0.57 0.44

−0.15 −0.15 −0.04 −0.13 −0.31 −0.16 −0.32

2.15 2.53 10.26 0.48 0.48 0.18 0.18

(b) q = 20 QIC GCp PQIF SIC2 SIClog N (1) SIC2 (1) SIClog N

0.00 0.00 − 0.00 0.01 0.01 0.01

0.00 0.00 − 0.02 0.02 0.02 0.01

0.01 0.01 0.01 0.01 0.06 0.02 0.06

0.01 0.01 0.19 0.04 0.28 0.06 0.26

0.02 0.02 0.51 0.05 0.55 0.05 0.56

5.51 5.57 − 6.35 6.13 5.37 5.17

4.67 4.69 − 4.63 4.40 4.59 4.34

4.09 4.11 4.00 4.03 3.83 4.01 3.81

3.32 3.32 2.98 3.25 2.86 3.20 2.90

2.59 2.58 1.81 2.52 1.78 2.51 1.75

35.64 42.67 51.49 3.96 3.96 2.33 2.40

for a non-Gaussian response with a working correlation matrix that is not the identity. It may be possible to derive a similar result involving (possibly scaled) mean squared error (MSE) hence relax the assumption that the likelihood is defined. 3. Simulations To compare the computational speed to other criteria and further investigate the properties for the SIC (i.e., consistency, efficiency and properly accounting for correlation) we constructed three simulation studies. Specifically, we considered the QIC, GCp , SIC2 , SIClog N , the SIC using a one-step ap(1) proximation with λ = 2 (denote by SIC2 ) and λ = log N (de(1) note by SIClog N ) in a forward selection setting, and the PQIF which requires fitting the full model and selecting λ. For each simulation study we generated 250 data sets. 3.1. Simulation 1 The first part of our simulation study follows Cantoni, Flemming, and Ronchetti (2005), where we considered binary response data and a logistic link function. The linear predictor is given by g(μij ) = xTij θ where g(t) = log{t/(1 − t)}, xTij = (1, xijA , xijB , xijC , xijD , xijE ) and θ = (0.5, 1, 0, 0.5, 0.5, 0)T , thus q = 5. The covariates were generated independently from the following distributions: xijA ∼ Bin(1, 0.5), xijB ∼

 and xijC , xijD , xijE ∼ N(0, 1). Mult (1, 2, 3)T , (0.35, 0.15, 0.5)T B E Given θ, the covariates xij and xij are redundant from the true model. We generated correlated binary data with a CS structure (with α = 0.1) using the R-package mvtBinaryEP as in Wang et al. (2012). Predictive performance was measured by generating independent training and test data (here the

test data contained 200 clusters for each n considered below), where a scaled MSE of the linear predictor on the test data was used. The second part of our simulation study includes more covariates. Here, we constructed a similar simulation study design as Cho and Qu (2013) by generating true in(k) dependent covariates from xij ∼ Unif(0, 8) for k = 1, 2, 3 (k)

and xij ∼ Unif(0, 1) for k = 4, . . . , q = 20. We set θ = (1, 0.8, −0.7, −0.6, 0, . . . , 0)T , with a CS structure and α = 0.4. In Table 2 we report the proportions for the number of times the true model was chosen (Ptrue ) and the logarithms of the scaled predictive performance (log(MSE∗ )) for clusters N = 15, 30, 50, 100, 200 and cluster size n = 10. The median computational times (in seconds) across all N is also reported. In Web Figure 1 we give boxplots for the number of false excluded/included variables in the final selected model to assess which criteria under-fitted or over-fitted the data. We also considered cluster sizes n = 5 and n = 15 and found similar results (see Web Tables 1 and 2). It is evident from Table 2a that QIC, GCp and SIC2 select the true model with similar frequency. This may be expected as all three criteria are of the AIC-type. These criteria were (1) (1) slightly outperformed by SIClog N , SIC2 and SIClog N , with the one-step methods performing very well as the number of clusters increased. The PQIF performed the poorest in Table 2a, which could be a result of the small sample size used, however once we increased q and N (see Table 2b), the PQIF outperformed all criteria with the exception of SIClog N and (1) SIClog N which were comparable. All criteria performed poorly for large q and small N, and tended to over-fit the data as

Fast Forward Selection for Generalized Estimating Equations the number of clusters increased, although noticeably less for SIClog N , see Web Figure 1. The prediction performance was relatively similar for all criteria, where one-step methods once again performed best. The most significant difference was in the median computational times for completing the full forward selection path, where the SIC methods were much faster than QIC, GCp and PQIF. For example, in the latter simulation study, the SIC methods gave median computational times ranging between 2.3 and 4 seconds, compared with QIC, GCp and PQIF which gave 35.64, 42.67, 51.49 seconds, respectively (see last column of Table 2b). We further investigate computational times in Section 4, where more variables are included in the model, hence differences in computation time are more dramatic. 3.2. Simulation 2 In our second simulation study we explored the sensitivity of different criteria to within cluster correlation. Of particular interest was where the parameters lay in relation to the major and minor axes of ellipsoids representing confidence regions for parameters assuming within-cluster correlation (as in the SIC and QIF “quasi-likelihood”) and under the independence working assumption (as in the GCp and QIC “quasilikelihood”). As discussed in Warton (2008), the working independence assumption enforces a spherical confidence region (for suitably standardized variables), underestimating uncertainty along major axes, overestimating along minor axes. Hence we hypothesized (as in Warton, 2008) that when the parameters were located along the major axis of the confidence region, procedures using the working independence assumption (QIC and GCp ) would be more likely to include such terms in the model. Conversely, when the parameters were located along a minor axis, procedures using the working independence assumption would be less likely to include them in the model, see Figure 2 of Warton (2008). To test this idea we considered a Gaussian response and an AR(1) correlation structure with magnitudes α = 0, 0.1, . . . , 0.9. Covariates xiA , xiB and xiC were generated independently from a normal distribution with zero means and unit variance. Here, we considered a varying coefficient GEE, that is, let g(μij ) = xTij θ where θ = θ(j) is now a varying coefficient vector for j = 1, . . . , n. We then conducted an eigendecomposition on Var( θ ). Note that since R = Cov(y), we

N

−1

θ) = XTi R−1 Xi , which does not depend on have Var( i=1 θ. Values for θ were then chosen  as either the first or third eigenvectors of Var( θ ) along b jj , where jj are the eigenvalues and b is some specified constant. We considered a true model which only contains one covariate xijA (with θ constructed as described above) and we fixed the intercept coefficient vector to 0.2. The cluster size was set to n = 3 with N = 100 clusters and b = 3. Note that for each forward selection step, once a new covariate is introduced into the model, then the vector of coefficients are either all included or all left out of the model. The one-step SICs alternatives were not reported here as they gave similar results to their corresponding full path SICs. Results for the PQIF were also excluded here, which unexpectedly performed poorly; this could be due to the moderate sample being size used or a more appropriate penalty (Pλ ) should be considered

115

(e.g., a grouped-type LASSO penalty) since n-length parameters are now associated within each covariate. We included a na¨ive SIC2 (denoted SIC2 (na¨ive), and calculated assuming the working independence assumption), which we expected to behave similarly to QIC and GCp , and have little robustness to changes in the correlation structure. In Figure 2 we plot the Ptrue results for each criterion when located along the first and third eigenvectors against α. As expected, we see that all criteria gradually improve in selecting the true variables as α increases when located along the first eigenvector, see Figure 2a. When points are located along the third eigenvector, the SICs remain quite robust, however QIC, GCp and SIC2 (na¨ive) give progressively poorer results as α increases, see Figure 2b. This is because an increasing amount of sample variation is in a direction orthogonal to the vector of true parameters θ. The QIC, GCp and SIC2 (na¨ive) are less effective in this setting, overestimating sample variation in the direction of θ due to their independence assumption. 3.3. Simulation 3 As mentioned above, the one-step SIC can perform poorly when the mean-variance relationship differs. We investigated this further by considering a negative binomial distribution with an independent correlation structure, six covariates xijA , xijB ∼ Bin(1, 0.5), xijC , xijD ∼ Unif(−1, 1) and xijE , xijF ∼ N(0, 1), and coefficients set to θ = (−1, 0, 1, 0, −1, 0)T with the intercept set to 0.1. Thus, g(μij ) = xTij θ where g(t) = log(t) and v(μij ) = μij (1 + ψμij ), which allows us to control the strength of the mean-variance relationship through ψ. We set N = 30 and N = 100 with n = 5, and increasing values for ψ = 0.5, 1.25, . . . , 5.75. In Web Appendix B we also investigated the case when the dimension of the cluster size exceeds the number of clusters n > N (which also mimics our case study data). Here, we fixed ψ = 3.5 for N = 15 and N = 30, and allowed the cluster size to vary for n = 20, 25, . . . , 50. In Figure 3 we plot Ptrue for each variable selection criterion, where it is clear the one-step SICs gives poorer results (compared with the full path SICs) as the mean-variance relationship strengthens. The reference model used in simulations was the intercept model, which assigns all observations the same mean and hence the same variance. Thus a one-step SIC was in effect assuming homoscedastic data, clearly inappropriate, and leading to undesirable behavior when there was a strong mean-variance trend (see Figure 3). 4. Example: Arthropod Data The case study presented here is motivated by multivariate abundance data collected on ants which were identified to genus and morphospecies, with reference specimens identified to species (Gibb and Cunningham, 2013). The study was conducted in Eucalyptus woodlands in south-eastern Australia, at N = 30 sites within 150 km of the city of Canberra (35◦ 15’S 149◦ 08E). In addition to these observed count data, environmental covariates, such as shrub cover, slope, etc. were collected from each site, and trait covariates such as maximum eye width, number of spines, etc. were also recorded (see Figure 4 for the full covariate list). The mean abundance can therefore be represented as functions of environmental and trait covariates, and environmental × trait interactions.

116

Biometrics, March 2014

(b) Third eigenvector

ptrue 0.4

0.4

ptrue

0.6

0.6

0.8

0.8

1.0

1.0

(a) First eigenvector

QIC 0.2

0.2

GCp SIC2 SIClogN

0

0.1

0.2

0.3

0.4

0.5

α

0.6

0.0

0.0

SIC2(naive) 0.7

0.8

0.9

0

0.1

0.2

0.3

0.4

0.5

α

0.6

0.7

0.8

0.9

Figure 2. Simulation 2. Proportions of times the true model (Ptrue ) was chosen for each variable selection criterion as a function of correlation parameter (α) when θ is located along the (a) first and (b) third eigenvectors. Note that QIC and GCp give the exact same results, also QIC, GCp and SIC2 (na¨ive) give progressively poorer results as correlation is increased when θ is located along the third eigenvector (b). This figure appears in color in the electronic version of this article.

(a) N=30 and n=5

(b) N=100 and n=5 1.0

1.0

QIC GCp

0.8

0.8

SIC2 SIClogN (1) (1)

ptrue 0.4 0.2 0.0

0.0

0.2

0.4

ptrue

SIClog N

0.6

0.6

SIC2

0.5

1.25

2

2.75

ψ

3.5

4.25

5

5.75

0.5

1.25

2

2.75

ψ

3.5

4.25

5

5.75

Figure 3. Simulation 3. Proportions of times the true model (Ptrue ) was chosen for each variable selection criterion for (a) N = 30 and (b) N = 100 with the dispersion parameter set at ψ = 0.5, 1.25, . . . , 5.75. Notice how the Ptrue are smaller for the SIC(1) s compared with the full path SICs as ψ increases. This figure appears in color in the electronic version of this article.

Fast Forward Selection for Generalized Estimating Equations

117

Weber.s_length

Sculpturing

Standardized coefficient estimate

Trait variables

Scape_length

Pilosity

0.2 0.1

No._spines

0.0 Max_eye_width

−0.1

Mandible_length

−0.2

Femur_length

Vo l_L yin g

itte r Liv es toc k Mi d.s tor ey _c ov er Sh ru b_ co ve r Sl op e

Le af_ l

Ba re _R

oc k Ca no py _c ov er Fo rb

Eye_Position

Environment variables Figure 4. Case Study. The standardized coefficient estimates for each selected environmental×trait interaction term (see Section 4 for interpretation) using the arthropod data. This figure appears in color in the electronic version of this article. These data can be represented as clustered data (where the number of sites are represented as clusters and the cluster size is the number of species seen), and analysed using GEEs to account for possible correlation structures between species seen at particular sites (Warton, 2011). The total number of observed species (cluster size) was n = 35. To avoid any numerical instabilities and convergence issues (since n > N), we used the regularization approach of Warton (2011) and Moore–Penrose pseudo inverses to estimate the covariance matrix . Our interest is in the interactions between the environmental and species trait covariates. The matrix of interaction terms informs us about which species characteristics cause differences in environmental response across species, and how. We use nine environmental covariates measured at each site and nine species trait covariates, thus we have 81 different possible interaction variables to consider. The main effects design matrix (corresponding to Mγ ), consists of an intercept term and all covariates (i.e., their linear and quadratic terms); this always remained fixed in the analysis and variable selection is only conducted on the interaction terms, hence q = 81 in our analysis. We made the working assumption that the marginal distributions for each species are independent. We initially considered a Poisson distribution for the response, however upon examining the Dunn–Smyth residual plots (Dunn and Smyth, 1996), we identified a strong trend due to over-dispersion in the mean-variance relationship. Therefore we considered a negative binomial distribution for the response. We also inves-

tigated the computational times when considering a stopping rule in the variable selection analysis; once the current criterion value had increased to 5 above the minimum achieved criterion value, the forward selection algorithm would stop and the interaction terms with the corresponding minimum criterion value would be selected. In Table 3 we report the number of interaction terms selected by using QIC, GCp , (1) SIC2 , SIClog N and their one-step approximations, SIC2 and (1) SIClog N . We also report the computational times with and without using the stopping rule.

Table 3 Case Study. Total number of selected variables in the model and computational times for QIC, GCp , SIC2 , SIClog N and (1) (1) their one-step approximations, SIC2 and SIClog N , with and without stopping rule, using the arthropod data. Selection criteria QIC GCp SIC2 SIClog N (1) SIC2 (1) SIClog N

Computational time (minutes)

Total # of selected var.

No stopping rule

Stopping rule

64 42 45 9 8 3

177.12 225.91 5.84 5.90 2.19 2.09

161.10 122.99 2.82 1.18 1.14 0.38

118

Biometrics, March 2014

From Table 3 we see that QIC, GCp and SIC2 selected a similar number of interaction variables whereas SIClog N , (1) (1) SIC2 and SIClog N selected much fewer variables. The computational times were a lot faster for each SIC, reducing computation time from hours to minutes. This was especially the case when implementing the stopping rule. An obvious reason for this was that SIC tended to choose smaller models than competitors, hence required less steps along the forward selection path. A subtle secondary reason was that the main computational cost for GCp and QIC was in earlier steps (i.e., when the number of candidate variables to consider was largest), so a stopping rule was of less benefit for those methods. In Web Figures 3–5 we plot the selection paths, Dunn– Smyth residuals and coefficient estimates from each criterion. We specifically focus on SIClog N , where a minimum SIClog N value is evident (Web Figure 3) and the Dunn–Smyth residuals plots indicated a reasonable fit with no obvious pattern. The standardized coefficient estimates given in Figure 4 enable a biological interpretation of species traits and their effects on environmental response of different species. Broadly, the relationships accord with predictions based on the ecology of ants. For example, sculpturing of the cuticle of ants is negatively related to the volume of logs. Cuticle sculpturing or thickness is indicative of dehydration tolerance (Beament, 1961) and a greater volume of logs occurs in habitats with higher insolation (more litter and canopy cover), where ants are more protected from dehydration. Secondly, mandible length, which is indicative of how predatory a species is (Weiser and Kaspari, 2006), was lower in sites with more bare ground and shrubs. Bare ground is usually associated with disturbance and may therefore attract a more opportunistic and generalized set of species that are less likely to be specialist predators. 5. Discussion In this article we proposed a new forward selection criterion based on score equations. As score equations are used to define GEEs, it is natural to also use them to construct an information criterion for GEEs. In addition to providing much faster computational times than existing variable selection criteria, the SIC also incorporates correlation into its quasi-likelihood, as opposed to its penalty. The quasi-likelihood contains the information on whether variables should or should not be included in the model. As QIC and GCp assume independence in their quasilikelihood, they behave as if assuming independence in variable selection. As in Warton (2011), this does not necessarily mean low discriminatory ability, but it means variable ability; good in some scenarios (where parameters are aligned with the major axis of sample variation) and bad in others, whereas SIC finds a middle road. The same idea has been noted in multivariate analysis when comparing statistics that make different correlation assumptions, whether using a distance-based (Mielke and Berry, 2001) or a model-based (Warton, 2008) approach. We considered two criteria, SIC and the one-step approximation SIC(1) . One-step is quicker but doesn’t modify observation weights when a variable is included in the model. For example, if Mω is the intercept model, the mean, and hence,

the variance is the same for all observations irrespective of the mean-variance relationship. This was not a problem when there was little (simulation 1) or no (simulation 2) difference in variance across observations, however in simulation 3 using negative binomial data, it was a problem. We can only recommend the one-step method when the mean-variance relationship is not strong, if used with care. If one were to consider doing backward selection, then score statistics could be replaced with Wald statistics (Rotnitzky and Jewell, 1990) and an information criterion can be constructed in an analogous manner. The advantage of Waldbackward, as for SIC-forward, would be that candidate models would not need to be fitted. Although pruning from big models may take too long, some fast combination of forward and backward could be implemented using score and Wald, along the lines of Nuamah et al. (1996). Another major focus in this article was to investigate and show consistent variable selection and asymptotic efficiency in BIC and AIC-type contexts. The choice of which penalty to use for the SIC may depend on the type of analysis which is being conducted. For example, if we are interested in standard inference procedures and the sample size is large then considering λ = log N may be more favourable, however if prediction is the main goal then considering λ = 2 might be a better choice. In simulation 1, using λ = 2 did not actually provide better prediction performance, but this result is not expected in general. We proposed using SIC for forward selection with a large number of predictor variables—a generic problem encountered to many areas of research other than ecology, such as biomedical studies, economics, social sciences, genomics, etc. (Wang et al., 2012). Another important application is semi-parametric regression, where a large number of spline functions are included for each predictor, for example, MARS (Friedman, 1991). In future work we plan to extend MARS to correlated counts using our fast forward selection algorithm. 6. Supplementary Materials Web Appendix Sections A–C, Web Figures 1–5, Web Tables 1–2, R-code and the case data set are available with this paper at the Biometrics website on Wiley Online Library.

Acknowledgements We would like to thank two anonymous referees and the associate editor for their helpful comments in improving this manuscript. This work was supported by the Australian Research Council Discovery Project (project no. DP0985886, DP130102131), Future Fellow (project no. FT120100501) and the CSIRO Office of the Chief Executive Postdoctoral funding schemes. Appendix Proof of Theorem 1 Proof. The proof of Theorem 1 consists of two steps following along the lines of Wang and Qu (2009). In the first step we consider the class of candidate models that do not contain the true model, and show that asymptotically, models in this set can not minimise SIClog N . In the second step, we consider the class of candidate models which do contain the

Fast Forward Selection for Generalized Estimating Equations

119

true model, and show that these can not minimise SIClog N except when the true model is as parsimonious as possible.

Note that throughout we use U θ (θ, R; y) = E U θ (θ, R; y) +   Op N 1/2 where E(·) is the expectation conditional on Xi for i = 1, . . . , N. Step 1. Consider the model Mω ∈ Mt characterized by a vector of parameters ω. We can construct a model Mθ ∈ Mt such that θ = (ωT , ψT )T and hence Mω ⊂ Mθ . Under the usual regularity conditions (Rotnitzky and Jewell, 1990), for some θ 1 = (ωT , ψ(1) )T where ψ(1) is a suitably chosen element of ψ, we have E U ψ(1) (ω , R; y) = 0 and hence Tψ(1) (ω , R; y)/N → c for c > 0. Therefore,

since |ω| > |ψ|. Thus SIClog N will select the most parsimonious model Mψ , with probability approaching 1 as N → ∞.

Tψ(1) (ω , R; y) SIClog N (ω) SIClog N (θ) log N − > + (|ω| − |θ|) N N N N

θ ∗ be the parameter estimates for model Mθ ∈ Me seLet  lected by the minimum of (A.1). Under certain conditions, Zhang et al. (2010) showed that (A.1) is asymptotically loss efficient for  θ∗ :



→ c−O

log N N



Proof of Theorem 2 Proof.

We will use Theorem 2 of Zhang et al. (2010). Denote

θ by D( θ ; y) = −2 ( the scaled deviance for  θ ; y) − ( θ s ; y)

where  θ s is the full or saturated model. Zhang et al. (2010) considered the generalized information criterion (GIC) for Mθ with λ = 2: GIC2 (θ) = (1/N)D( θ ; y) + (2/N)|θ|.

KL( θ∗ )

> 0.



inf

Note that since the first term on the LHS asymptotically dominates the second term, SIClog N selects the model whose objective function is smallest, that is, SIClog N prefers Mθ with probability approaching 1 as N → ∞. Step 2. Now consider Mω ∈ Mt . We will compare Mω to a model Mψ ∈ PMt , which exists under the conditions of the theorem. First, we will construct Mθ such that θ = ω ∪ ψ, hence Mθ ⊂ Mt . Next, we have SIClog N (ω) SIClog N (ψ) − log N log N =

SIClog N (ω) SIClog N (θ) − log N log N



|ψ| 

Tψ(k) ( θ ω,k−1 , R; y)

k=1

=

log N |ω| 



log N

in probability as N → ∞. The conditions are stated as (C1)–(C3) in the supplementary materials of Zhang et al. (2010). Since the first few moments for y exist (condition (C2)), we consider a quadratic approximation to the deviance at (or in the neighborhood of) the true model. Note that the higher-order terms in this Taylor expansion will only vanish if in the neighborhood of the true model, such that  θ approaches the true θ. Provided this holds, we can show that the score statistic approximates the deviance, and the discrepancy between the two vanishes in the limit. Let Mω ∈ Me and θ = (ωT , ψ)T for (without loss of generality) scalar ψ. Then





θ ; y) − (ω ; y) + 2|θ| + op (1) = −2 ( = D( θ ; y) + 2|θ| − D(ω ; y) + op (1).

log N + |ω| log N

Tω(k) ( θ ψ,k−1 , R; y)

k=1

→1

KL( θ∗ )

, R; y) + 2|θ| SIC2 (θ) = −Tψ (ω

SIClog N (θ) SIClog N (ψ) − log N log N

+



∈Me

− |ψ|

(A.1)

log N , log N

(A.2)

The last term of (A.2) is not a function of  θ , when ignoring this constant and dividing by N, this gives (A.1), and we are now in the setting where Theorem 2 of Zhang et al. (2010) can be applied directly. Hence SIC2 ( θ ) is asymptotically efficient  as N → ∞.

References where θ ω,k = (ωT , ψ(1) , . . . , ψ(k) )T and θ ψ,k = (ψT , ω(1) , . . . , ω(k) )T , and the ψ(k) and ω(k) are entered sequentially to maximize the score statistic at each step, as in (9). Now, as Mω ∈ Mt , then E U ψ(k) ( θ ω,k−1 , R; y)

= 0 and hence

θ ω,k−1 , R; y) = Op (1) (and similarly for Mψ ) which gives: Tψ(k) ( → |ω| − |ψ| + op (1) > 0,

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control 19, 716–723. Beament, J. W. L. (1961). The water relations of insect cuticle. Biological Reviews 36, 281–320. Burnham, K. P. and Anderson, D. R. (1999). Model Selection and Inference: A Practical Information-Theoretic Approach. New York: Springer-Verlag. Cantoni, E., Flemming, J. M., and Ronchetti, E. (2005). Variable selection for marginal longitudinal generalized linear models. Biometrics 61, 507–514.

120

Biometrics, March 2014

Cho, H. and Qu, A. (2013). Model selection for correlated data with diverging number of parameters. Statistica Sinica 23, 901–927. Dunn, P. and Smyth, G. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics 5, 236– 244. Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics 19, 1–67. Gibb, H. and Cunningham, S. A. (2013). Restoration of trophic structure in an assemblage of omnivores, considering a revegetation chronosequence. Journal of Applied Ecology 50, 449– 458. Hardin, J. W. and Hilbe, J. M. (2002). Generalized Estimating Equations. Boca Raton, Florida: Chapman and Hall. Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning. New York: Springer. Hin, L. Y. and Wang, Y. G. (2009). Working-correlation-structure identification in generalized estimating equations. Statistics in Medicine 28, 642–658. Liang, K. Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13–22. Mielke, P. W., Jr. and Berry, K. J. (2001). Permutation Methods: A Distance Function Approach. New York: Springer-Verlag. Nuamah, I. F., Qu, Y. and Amini, S. B. (1996). A SAS macro for stepwise correlated binary regression. Computer Methods and Programs in Biomedicine 49, 199–210. Pan, W. (2001a). On the robust variance estimator in generalised estimating equations. Biometrika 88, 901–906. Pan, W. (2001b). Akaike’s Information Criterion in generalized estimating equations. Biometrics 57, 120–125. Qu, A., Lindsay, B. G. and Li, B. (2000). Improving generalised estimating equations using quadratic inference functions. Biometrika 87, 823–836.

Rao, C. R. (1973). Linear Statistical Inference and its Applications, 2nd Edition. New York: Wiley. Rotnitzky, A. and Jewell, N. P. (1990). Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data. Biometrika 77, 485–497. Schwarz, G. E. (1978). Estimating the dimension of a model. The Annals of Statistics 6, 461–464. Shibata, R. (1980). Asymptotically efficient selection of the order of the model for estimating parameters of a linear process. The Annals of Statistics 8, 147–164. Wang, L. and Qu, A. (2009). Consistent model selection and datadriven smooth tests for longitudinal data in the estimating equations approach. Journal of the Royal Statistical Society, Series B 71, 177–190. Wang, L., Zhou, J., and Qu, A. (2012). Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics 68, 353–360. Warton, D. I. (2008). Penalized normal likelihood and ridge regularization of correlation and covariance matrices. Journal of the American Statistical Association 103, 340–349. Warton, D. I. (2011). Regularized sandwich estimators for analysis of high-dimensional data using generalized estimating equations. Biometrics 67, 116–123. Weiser, M. D., and Kaspari, M. (2006). Ecological morphospace of New World ants. Ecological Entomology 31, 131–142. Zhang, Y., Li, R., and Tsai, C. L. (2010). Regularization parameter selections via generalized information criterion. Journal of the American Statistical Association 105, 312–323.

Received February 2013. Revised September 2013. Accepted September 2013.

Fast forward selection for generalized estimating equations with a large number of predictor variables.

We propose a new variable selection criterion designed for use with forward selection algorithms; the score information criterion (SIC). The proposed ...
326KB Sizes 0 Downloads 0 Views