Biometrics 70, 920–931 December 2014

DOI: 10.1111/biom.12212

Robust Inference in the Negative Binomial Regression Model with an Application to Falls Data William H. Aeberhard,1,2, * Eva Cantoni,1 and Stephane Heritier2,3 1

Research Center for Statistics and Geneva School of Economics and Management, University of Geneva, Geneva, Switzerland 2 Sydney School of Public Health, University of Sydney, Sydney, Australia 3 Department of Statistics, Macquarie University, Sydney, Australia ∗ email: [email protected]

Summary. A popular way to model overdispersed count data, such as the number of falls reported during intervention studies, is by means of the negative binomial (NB) distribution. Classical estimating methods are well-known to be sensitive to model misspecifications, taking the form of patients falling much more than expected in such intervention studies where the NB regression model is used. We extend in this article two approaches for building robust M-estimators of the regression parameters in the class of generalized linear models to the NB distribution. The first approach achieves robustness in the response by applying a bounded function on the Pearson residuals arising in the maximum likelihood estimating equations, while the second approach achieves robustness by bounding the unscaled deviance components. For both approaches, we explore different choices for the bounding functions. Through a unified notation, we show how close these approaches may actually be as long as the bounding functions are chosen and tuned appropriately, and provide the asymptotic distributions of the resulting estimators. Moreover, we introduce a robust weighted maximum likelihood estimator for the overdispersion parameter, specific to the NB distribution. Simulations under various settings show that redescending bounding functions yield estimates with smaller biases under contamination while keeping high efficiency at the assumed model, and this for both approaches. We present an application to a recent randomized controlled trial measuring the effectiveness of an exercise program at reducing the number of falls among people suffering from Parkinsons disease to illustrate the diagnostic use of such robust procedures and their need for reliable inference. Key words: Bounded influence function; Negative binomial regression; Overdispersed count data; Redescending estimators; Weighted maximum likelihood.

1. Introduction Falls occur frequently among certain categories of patients such as elders who have recently been hospitalized or people suffering from Parkinson’s disease. As these falls may result in serious injuries and yield devastating consequences, researchers in musculoskeletal diseases regularly conduct intervention studies aiming at reducing their occurrence. In such studies, the main outcome is the number of falls reported during a specified period of time, and thus consists of a count, while covariates would typically include an intervention effect, mobility-related information such as gait speed, and a history of the individual’s falls. Among these studies, the recent randomized controlled trial called PD-FIT (Canning et al., 2009), featuring participants with Parkinson’s disease, represents the motivation of our methodological work. Counts such as falls often display overdispersion, a phenomenon observed when the variance of the data exceeds its mean, thus invalidating the traditional Poisson distributional assumption. To overcome this, a popular choice is to use the negative binomial (NB) distribution (see Anscombe, 1949) and fit a NB generalized linear model (GLM; Mccullagh and Nelder, 1989) to the data, given a fixed value for the overdispersion parameter. Maximum likelihood estimators (MLEs) have been thoroughly studied for the NB model (see Lawless, 1987; Piegorsch, 1990). However, these estimators cannot deal

920

with so-called “multiple fallers,” occasionally appearing in falls data. Such multiple fallers are not individuals that simply fall much more than others, they are individuals that fall much more than expected, given a model; hence they can be considered as outliers in the response with respect to the assumed model. These outliers can not only bias the regression parameters estimates and misguide inference but also lead to overestimating the overdispersion parameter, in turn overestimating the standard errors and thus inflating the computed required sample size when running power analyses. In the falls literature, ad hoc procedures that have been used so far consist in truncating the counts at a predefined value, and carrying on with the MLEs. Such winsorizing schemes may indeed limit somehow the impact of the multiple fallers but strongly depend on the cut-off value, which is arbitrary. Furthermore, the fact that this cut-off value does not depend on covariates and is thus constant across individuals is problematic as people vary a lot in terms of falls and may therefore not be comparable on a unique scale. For these reasons, these ad hoc procedures are not admissible and estimators with bounded influence function (Hampel et al., 1986) should be used instead. Cadigan and Chen (2001) proposed robust M-estimators for the NB distribution while Amiguet (2011) introduced three-steps estimators that combine high breakdown point (BDP) with reasonable efficiency, © 2014, The International Biometric Society

Robust Inference in the NB Model but these suggestions only apply to independent and identically distributed data, while we require linking a count response with covariates. In the GLM class of models, Cantoni and Ronchetti (2001, 2006) have proposed an approach for building robust estimators and inference procedures, but the NB distribution, with its specific nuisance parameter accounting for overdispersion, has not been treated by the authors. In parallel to their work, Bianco and Yohai (1996) have taken another approach for constructing robust estimators for logistic regression, this approach being recently extended by Bianco, Boente and Rodrigues (2013a, 2013b) to Poisson and gamma responses. Our goal here is to extend these two approaches to the NB model and compare them in this framework while at the same time to suggest a valid robust estimator for the overdispersion parameter. This article is organized as follows: Section 2 summarizes the framework and notation and gives the expressions of the classical MLEs of the NB model; Section 3 presents the robust estimators of the regression coefficients according to the two approaches, along with a unified notation that facilitates their comparison, and the robust M-estimator we introduce for the overdispersion parameter; Section 4 reports the comparisons of the estimators through simulations; Section 5 shows the application of our methods to the PD-FIT data; finally, Section 6 concludes this article and discusses the limits of this work as well as possible extensions.

921

where β (yi , xi , β, σ) = (yi − μi )V −1 (μi )

∂μi xi , ∂ηi

and



1 σ (yi , xi , β, σ) = − 2 σ





 yi + 1/σ − (1/σ)



σ(yi − μi ) , − log(σμi + 1) − σμi + 1 where (u) = ∂ log (u)/∂u denotes the digamma function. Solving (1) requires a starting value for β, usually computed through a Poisson GLM and plugged into the estimating equation for σ, as suggested by Breslow (1984), and then consists of an iterative scheme until numerical convergence is achieved. The method of moment estimator (MME) for σ is a popular alternative to the MLE (see Breslow, 1984; Lawless, 1987). It is defined as the value for σ that solves n

(yi − μi )2 V −1 (μi ) = df,

i=1

2. Framework and Notation We consider a GLM setting featuring n independent response variables Y1 , . . . , Yn , where each Yi , for i = 1, . . . , n, is linked with a vector of covariates xi ∈ Rp through its conditional mean: g(E[Yi |xi ]) = g(μi ) = ηi = xTi β, where g(·) is a link function and β = (β0 , . . . , βp−1 )T is the vector of regression parameters. The support of Yi |xi consists of the non-negative integers and we choose, without loss of generality, to use a logarithmic link function g(·) = log(·). We assume a NB distribution with mean parameter μi > 0 and overdispersion parameter σ > 0, denoted as NB(μi , σ), yielding the following probability mass function: (yi + 1/σ) f (yi ; μi , σ) = (σμi + 1)−1/σ (1/σ)(yi + 1)



σμi σμi + 1

yi .

Under this parameterization, which is based on Lawless (1987) and denoted as “NB2” by Hilbe (2011), the conditional variance is Var[Yi |xi ] = V (μi ) = μi + σμ2i , so that the Poisson distribution with mean μi is the limiting case when σ → 0. We choose to use this parameterization over the other common one involving κ = σ −1 because of the greater numerical stability of the estimation of its overdispersion parameter, as noted by Clark and Perry (1989) and Piegorsch (1990). We can derive the MLE for (β, σ)T by considering the joint estimation of both parameters through

 n i=1

β (yi , xi , β, σ)

i=1

σ (yi , xi , β, σ)

n

 = 0,

(1)

where df is the number of degrees of freedom remaining after having estimated the parameters required for replacing μi by ˆ i . On Web Figure 1, we display the boxplots of estimations μ of σ according to the maximum likelihood and the momentbased estimators. The response values are generated at the NB(μi , σ) distribution according to the simulation setting described in Section 4, with n = 250 and σ = 0.7. For the sake of simplicity, we use here the true value of β so that the variability is due only to the estimation of σ, and therefore df = n. By comparing the two boxplots, we immediately see that the MME has a much larger variability than the MLE, with efficiencies usually less than 40%. In view of the relatively large sample size used for this simulation, this is an issue. So, in spite of its ease of computation, we would not recommend the use of the MME for σ, and rather use the MLE. 3. Robust M-Estimators for β and σ In the next two subsections, we explore two different approaches for building robust M-estimators for β, given a value for σ. Unless otherwise explicitly stated, all the expectations are computed at the NB(μi , σ) distribution, conditionally on xi for i = 1, . . . , n. 3.1. Bounding the Pearson Residuals (PR Approach) We extend the approach of Cantoni and Ronchetti (2001, 2006) to the NB distribution, yielding a robust estimator for β ˆ . Roughly speaking, by this approach we which we denote β PR achieve robustness on one hand in the response by bounding the Pearson residual ri = (yi − μi )V −1/2 (μi ) (hence the name of the approach) that appears naturally in the score function β (yi , xi , β, σ), and on the other hand on the design by introducing weights w(xi ). The resulting Mallows quasi-likelihood

922

Biometrics, December 2014

estimator is defined as the solution to n i=1

UPR,i (β, σ) =

n

ψ(ri )V −1/2 (μi )

i=1

∂μi w(xi )xi − ai (β) = 0, ∂ηi (2)

where ψ(·) is a continuous and bounded function which depends on one or a few tuning constants, w(xi ) is a weight limiting   the impact i of possible leverage points, and ai (β) = E ψ(ri ) V −1/2 (μi ) ∂μ w(xi )xi is a correction term ensuring ∂ηi Fisher consistency at the model. We wish to discuss the choice of the ψ-function bounding the Pearson residual, where its boundedness condition ensures a bounded influence function for the resulting estimator (Hampel et al., 1986). A common choice is Huber’s ψfunction defined as ψ[Huber] (r; c) = max(−c, min(c, r)), with tuning constant c chosen to attain a specified trade-off between efficiency at the model and robustness under contamination. The Huber function for building robust estimators in the NB model has also been used in the framework of Mquantile regression by Chambers, Dreassi, and Salvati (2013). Nonetheless, other ψ-functions are available, including socalled redescending ones. In this article, we additionally consider Tukey’s biweight function

 ψ[Tukey] (r; c) =

2

(r/c)2 − 1 r

0

|r| ≤ c |r| > c

ψ[Hampel] (r; a, b, c) =

c − |r| ⎪ a sign(r) ⎪ ⎪ c−b ⎪ ⎩ 0

Bounding the Unscaled Deviance Components (UD Approach) Bianco and Yohai (1996) proposed an alternative way of constructing consistent robust estimators for logistic regression. By viewing the MLE as the value for β that minimizes the deviance, they achieve robustness in the response by applying a bounded function on the unscaled deviance component (up to a factor 2) 3.2.

di = log f (yi ; yi , σ) − log f (yi ; μi , σ) that appears in the estimating equation, hence the name of this approach. More recently, Bianco et al. (2013a, 2013b) extended this idea to the Poisson and gamma distributions in the case of missing values in the response. We extend further this approach to the NB distribution and define the correˆ , as the solution to the firstsponding estimator, denoted β UD order conditions n i=1

UUD,i (β, σ) =

n i=1

˜ i ) (yi − μi ) ∂μi w(xi )xi − ˜ ψ(d ai (β) = 0, V (μi ) ∂ηi (3)





a < |r| ≤ b

˜ i )(yi − μi ) V −1 (μi ) ∂μi w(xi )xi ensures where ˜ ai (β) = E ψ(d ∂ηi Fisher consistency, w(xi ) is a weight on the design space, ˜ and where ψ(·) (corresponding to the first derivative of the φt function, in the notation of Bianco et al., 2013a) plays ˆ . substantially the same role as ψ(·) in the definition of β PR The unscaled deviance component in the NB model is

b < |r| ≤ c

di = log(σμi + 1)/σ + yi log(yi /μi ) + log(σμi + 1)

and Hampel’s three part function

⎧ r ⎪ ⎪ ⎪ ⎪ ⎨ a sign(r)

using the MLE as starting value (see the simulation study in Section 4); if in doubt, we would then advocate starting with ˆ β PR with Huber’s function.

|r| ≤ a

|r| > c.

We have observed in practice that using Hampel’s function yield estimates very close to using Tukey’s biweight, the latter having the advantage of being tuned by a single constant. We will then focus more on the biweight function, although we implemented both and included both in the real data example in Section 5. We expect that using redescending ψfunctions may yield smaller biases under contamination than with ψ[Huber] , for a given efficiency at the model, and this being due to the asymmetric nature of the underlying NB distribution and of the contamination, appearing as high positive values; our simulations in Section 4 confirm this idea. Analogously to the MLE, the Fisher scoring method may be used for solving (2) and can be rewritten as an iteratively re-weighted least squares (IRWLS), leading to rather fast and stable calculations. Even though the algorithm requires the computation of several expectations for the NB distribution, such terms, including the one appearing in ai (β), only involve finite sums and are directly computable, see Web Appendix B. A unique solution to equation (2) is not guaranteed when using redescending ψ-functions, a reasonable starting value is therefore necessary. We did not experience any problems when





− log(σyi + 1)(yi + 1/σ),

(4)

when yi > 0 and simplifies to log(σμi + 1)/σ when yi = 0. ˜ Bianco and Yohai (1996) introduced a ψ-function defined as

 ˜ [BY] (di ; t) = ψ

1 − di /t

if di ≤ t

0

if di > t

for a tuning constant t playing an equivalent role as the tuning constant c found in Huber’s and Tukey’s biweight func˜ [BY] reaches 0 from the finite cut-off value t, the tions. Since ψ resulting estimator is redescending. Furthermore, Croux and ˜ Haesbroeck (2003) have introduced another ψ-function in the context of logistic regression which guarantees the existence of a solution to equation (3), whenever the MLE exists:

 ˜ [CH] (di ; t) = ψ

√ exp(− t) √ exp(− di )

if di ≤ t if di > t.

This function somehow mimics the behavior of the Huber weight ψ[Huber] (r)/r, yielding a non-redesecending estimator,

Robust Inference in the NB Model although it can only give a weight of 1 when the tuning constant t is 0, and this even when the observed deviance is exactly zero. Moreover, for values of t smaller than the smallest deviance observed in the sample, the function does not depend on t anymore. The consequence is that the resulting efficiency of the estimator is bounded and that this lower bound can potentially be close to 1, depending on the design. Because of these two features, we apply two rescalings in our implementation of the corresponding estimator: first, √ we rescale the original function by the constant term exp( t) so that we can ˜ [BY] ; second, we multiply interpret it on the same scale as ψ the deviance component di by 2 in order to increase the range of feasible efficiencies. These two rescalings do not affect the properties of the estimator. Bianco et al. (2013a) mention that equation (3) can be solved via a Newton–Raphson algorithm when there is a unique solution to the minimization problem. Using the Fisher scoring method, it is possible to rewrite the estimation scheme as an IRWLS for the NB model. The algorithm involves the computation of expectations whose expressions are given in Web Appendix C. Multiple solutions may appear when using ˜ [BY] , so we advise using Croux and Haesbroeck’s ψ-function ˜ ψ for a robust starting value. We did not however encounter any issues when starting with the MLE. 3.3. Unified Notation for Robust M-Estimators of β The two approaches defined in the previous subsections achieve robustness in the response by bounding different components that appear in the maximum likelihood estimating equations. They are therefore different in construction, but with a careful notation we can actually show how close they are in effect. Indeed, equations (2) and (3) can be written as a weighted version of the maximum likelihood estimating equations: n

w (yi , μi )β (yi , xi , β, σ)w(xi ) − ai (β, w , w) = 0,



i=1

where the new ingredients with respect to equation (1) are:

r r r

w (yi , μi ) ∈ [0, 1], a weight on the response, expressed as ˜ i ) for the UD apψ(ri )/ri for the PR approach and ψ(d proach; w(xi ), a weight on the design; and  a Fisher consistency correction term ai (β, w , w) = E w (yi , μi )β (yi , xi , β, σ)w(xi )].

The purpose of the w weight is to decrease toward 0 as yi moves away from μi , or, in other words, as the observed response behaves differently than what we would expect given our model, whether this may be represented as a large Pearson residual or a large deviance component. If this weight can reach 0 for finite yi and μi , then the corresponding estimator for β is redescending. Figure 1 displays the w functions under study against the response y, where both the expected value μ and the overdisperion parameter σ are kept fixed and the former ap˜ [CH] pears as a vertical straight line. The rescaled version of ψ yields weights w equal to unity for deviances close to zero,

923

which makes the comparison with other weight functions easier. The parallel, on one hand between the Huber weight (top left panel) and the rescaled Croux and Haesbroeck’s function (bottom left panel) and on the other hand between the Tukey’s biweight (top right panel) and the Bianco and Yohai’s function (bottom right panel), is notable. So, even though the ˜ functions are not applied to the same arguments, we ψ and ψ see that they can lead to weights looking quite alike, once they are represented against a common argument y, and thus may yield equivalent estimates as long as their tuning constants are chosen appropriately. The similarity of the weights as shown on Figure 1 is valid only for the NB distribution, through its definition of the deviance component as given in (4); we have however observed rather analogous patterns with the Poisson and the Gaussian distributions. 3.4. Robust M-Estimator for σ We introduce here a robust M-estimator for σ, given values for β, in the form of a weighted maximum likelihood estimator (WMLE). Its estimating equation writes n

Uσ,i (β, σ) =

i=1

n ψ(ri ) i=1

ri

σ (yi , xi , β, σ)w(xi ) − bi (σ) = 0, (5)

where ψ(ri )/ri plays the role of a weight, analogously to the ˆ , and bi (σ) = E ψ(ri ) σ (yi , xi , β, σ)w(xi ) is definition of β PR ri another Fisher consistency term. The ψ-function and weights on the design w(xi ) here need not be the same as the ones used for estimating β; we have implemented and studied the same Huber’s, Tukey’s biweight, and Hampel’s three part functions. For the M-quantile model, Chambers et al. (2013) have suggested using a robust version of the moment-based estimator for σ. It is a legitimate alternative to our WMLE, but the low efficiency of the classical moment estimator at the model, which we mentioned in Section 2, makes us prefer building upon the MLE. 3.5. Inference about β and σ Let the index k ∈ {PR, UD} distinguish between the two approaches for building a robust estimator for β and σˆ denote the WMLE we presented in the previous subsecˆ , σˆ )T can be viewed as an Mtion. The joint estimator (β k estimator which features an asymptotic normal distribution under mild conditions, such as the continuity and bounded

T ness of Uk,i (β, σ)T , Uσ,i (β, σ) (see Huber, 1967), which are satisfied in our framework. Expansions involving the influence function of the estimators lead us to √

ˆ n

βk − β σˆ − σ

 d





−T −→ N 0, M −1 , k Qk M k

(6)

n→∞

where the (p + 1) × (p + 1) matrices M k and Qk are detailed in Web Appendix A. We refer here to Huber and Ronchetti

924

Biometrics, December 2014

1.0 0.6

0.8

c=4 c=5 c=6

0.0

0.0

0.2

0.4

w * (y, m)

0.6

0.8

c=1 c = 1.5 c=2

0.2

w * (y, m)

PR approach, Tukey

0.4

1.0

PR approach, Huber

0

m 10

20

30

40

50

60

0

10

20

y

40

50

60

y

1.0

UD approach, Bianco and Yohai

0.6

0.8

t =3 t =4 t =5

0.2 0.0

0.0

0.2

0.4

w * (y, m)

0.6

0.8

t =1 t = 1.5 t =2

0.4

1.0

UD approach, Croux and Haesbroeck

w * (y, m)

30

0

m 10

20

30

40

50

60

y

0

m 10

20

30

40

50

60

y

Figure 1. Weight functions w (y, μ) with different tuning constant values, μ kept fixed at 6 and σ = 0.7: ψ(r)/r with ˜ Huber’s ψ-function (top left) and with Tukey’s biweight function (top right), ψ(d) with the rescaled Croux and Haesbroeck’s ˜ ψ-function (bottom left) and with Bianco and Yohai’s function (bottom right).

(2009) for general asymptotic results about M-estimators and to Bianco et al. (2013b) for a proof of the asymptotic normalˆ ity of β UD given any consistent estimator for σ. In practice, the −T analytical expressions in the “sandwich” formula M −1 k Qk M k are evaluated by plugging in estimates of β and σ and then can be used for approximating standard errors and building approximate confidence intervals. ˆ and σˆ are not orthogonal, not Result (6) implies that β k even asymptotically. However, our experience shows that ne−T glecting this and using M −1 k,11 Qk,11 M k,11 for the asymptotic ˆ variance of βk gives rise to little numerical difference. In the

−T data analysis in Section 5, we used the full version M −1 k Qk M k nonetheless.

4. Simulation Study We present here the results of a simulation study in which we compare the biases of the robust estimators we presented under different contamination schemes, while all being tuned to attain a given efficiency with respect to the MLE at the postulated model. All the implementations and simulations are done using R, version 3.0.1 (R Core Team, 2013). R code is available as a web supplement at the Biometrics website on

Robust Inference in the NB Model Wiley Online Library as well as on the authors’ personal web pages. 4.1. Design and Choice of Estimators We simulated response vectors of size n = 250 according to a NB distribution with mean parameter μi = exp(0.5 + 0.8x1i − 0.4x2i ), where x1i is a realization of a standard normal variable and x2i is a dummy variable, with half of the sample of zeros, and with overdispersion parameter σ = 0.7. Thus β = (β0 , β1 , β2 )T = (0.5, 0.8, −0.4)T . With such parameter values, most of the generated means are below 5 while the largest value is around 18. We believe that this design is realistic with respect to falls data, where the dummy would represent a treatment effect. Although we only include two covariates in the simulation setting, the proposed methods are general and the Fisher scoring algorithm we use does not suffer from higher dimensions. Other values for n and σ were investigated, see Section 4.2. Since the choice of the ψ-function used for estimating σ can ˜ used for β, many combinations be different from the ψ or ψ are possible; we chose to focus on a few:

r r r r

ˆ H/H: Huber’s function used for both β PR and the WMLE for σ; ˆ T/T: Tukey’s biweight function used for both β PR and the WMLE for σ; ˆ CH/H: β UD with the rescaled Croux and Haesbroeck’s function and ψ[Huber] for σ; ˆ BY/T: β UD with Bianco and Yohai’s function and ψ[Tukey] used for σ.

As for any GLM, the efficiency of such M-estimators has to be specified prior to the analysis. Its evaluation, given a tuning constant value, depends on the design and as such no universal tuning constant value exists. We evaluated the efficiency with the ratio of the traces of the variance–covariance matrices of the MLE and the robust estimators, at the true parameters. Some trial and error is often required in order to find the tuning constants values yielding the desired efficiency, up to a certain tolerance. We choose here to aim for efficiencies around 70% for both β and σ in an effort to mimic what we would do with real data (see Section 5). The resulting tuning constants are c = 0.4 for Huber’s function, c = 4 for Tukey’s biweight, t = 3 for the Bianco and Yohai function and t = 0.4 for the rescaled Croux and Haesbroeck function. This level of efficiency implies larger variances, but more robustness and we would advocate such precaution in real-life situations where the degree of contamination is unknown. Anyhow, the following results and remarks remain unchanged when allowing for higher efficiencies, as other simulations not presented here have confirmed. These tuning values are obtained with all weights on the design w(xi ) set to one; see Section 4.3 for a comment on the loss of efficiency such weights can cause. In addition to the robust estimators, we also computed the MLE for both β and σ as a reference. 4.2. Estimations under Contamination in the Response We first consider only contaminating the response, so as to emulate the behavior of multiple fallers. We contaminated the response vector by selecting a proportion of the sample

925

at random and then by adding to the corresponding original counts a predefined positive value add. The proportion , usually small, defines a neighborhood of distributions around the postulated model while the add argument controls for the magnitude of the contamination. Figure 2 shows how the five considered estimators evolve with the integer add moving from 0 to 20, with = 5% and all the weights on the design w(xi ) set to unity; for each value of add we generated 500 replications. For each estimator, we linearly interpolated the first and third quartile across the add values, thus creating a colored envelope around the line that consists of the interpolated medians of estimations. For each panel, the horizontal dashed line is the true parameter value. First, note that for add = 0, that is, at the model, all estimators are unbiased and that the MLE gets dragged away from the true parameter value as add increases, as expected. Next, all robust estimators tend to follow the MLE until add reaches 4; we believe that these low-magnitude contaminations are so subtle that the robust estimating procedures cannot distinguish between outlying and somehow slightly deviating yet genuine observations. Starting from add = 5, two patterns emerge among the robust estimators, in particular for σ and to some extent for β0 : the non-redescending ones (H/H and CH/H) tend to stabilize while the bias of the redescending ones (T/T and BY/T) decreases as add increases. This is a direct consequence of the redescending nature of T/T and BY/T: contaminated observations that are moved too far away from their original position get a zero weight and the estimator progressively “redescends” to the right value as the magnitude of contamination increases. In other words, the more obvious the outliers are, the best the redescending estimates will be. As a matter of fact, this also applies to nonredescending estimators but at a much slower rate, since a zero weight can only occur when the observed response tends to infinity. For β1 and β2 , all four robust estimators stay reasonably close to the true parameter values. In the overall, the robust estimators may exhibit some non-negligible bias, although always smaller than the bias of the MLE, but this goes along with the M-estimation theory: a bounded influence function only implies a bounded maximum bias under an arbitrary contamination, but not a maximum bias equal to zero. In conclusion, based on these observations and remarks, using ˜ [BY] over ψ ˜ [CH] seems preferable ψ[Tukey] over ψ[Huber] and ψ from the point of view of minimizing the bias due to contamination. We also ran simulations with other values for n and σ. Similar results were obtained when simulating data with high overdispersion (σ = 1.5) or low overdispersion (σ = 0.2, yielding outcomes close to a Poisson distribution). With sample size n = 500 the behaviors of the considered estimators are comparable with those under n = 250; however, with n = 100 the underlying variability is too large for observing the different patterns described above: all the robust estimators feature negligible bias under contamination whereas the MLE is slightly biased. Estimations under Contamination in the Response and in the Design We now consider contaminating both the response and the design. Here, we created a = 5% overall contamination by 4.3.

926

Biometrics, December 2014

Figure 2. Estimations under contamination in the response with varying magnitude for β0 (top left), β1 (top right), β2 (bottom left), and σ (bottom right). The simulation setting is: n = 250, σ = 0.7, = 0.05 and all weights w(xi ) equal to one.

randomly selecting 2.5% of the response and applying the same scheme with the add argument as in the previous subsection, and 2.5% of the x1i values and fixing them to −5, thus producing rather high leverage points. The weights in the design we use are hard-rejection (as in Bianco et al.,  weights  2 −1 2013a), defined as w(xi ) = 1 Di2 ≤ χp−1 (0.95) , where 1 is the indicator function, Di is the Mahalanobis distance based on robust estimates of the center and scatter matrix of the design matrix (excluding the column for the intercept) and 2 −1 χp−1 (0.95) stands for the 95% quantile of a χ2 distribution with p − 1 degrees of freedom. For robustly estimating the center and scatter matrix, we use the minimum volume ellipsoid (MVE; Rousseeuw, 1985) estimator, set to enclose a subset of 80% of the sample. Other choices of weights on the design are possible, see Heritier et al. (2009).

Web Figure 2 shows colored envelopes, in the same fashion as in Figure 2, where 500 replications were generated for each value of add. Now the MLE is biased even for add = 0, this being due to the leverage points we introduced. The robust estimators do not feature such a bias because the hardrejection weights effectively detect the outliers in the design and assign them a zero weight right from the beginning. With the magnitude add increasing, we observe the same pattern as in Figure 2: the bias of the MLE only gets larger, the nonredescending robust estimators stabilize while the redescending ones get closer to the true parameter values, and this particularly noticeable for σ and the intercept. The only major difference with the previous results, which does not appear in the figures, is the massive loss of efficiency the weights in the design involve. The efficiencies now fall around 60% and we

Robust Inference in the NB Model

927

Figure 3. Estimations under contamination in the response with varying contamination proportion for β0 (top left), β1 (top right), β2 (bottom left), and σ (bottom right). The simulation setting is: n = 250, σ = 0.7, add = 20 and all weights w(xi ) equal to one. have observed they do not exceed 80% even when setting all weights in the response w to unity. It is a high price to pay, thus we would recommend using weights on the design only when they seem really necessary, possibly after an exploratory analysis involving robust Mahalanobis distances. 4.4. Breakdown Point Study In this final subsection, we illustrate empirically the breakdown point of the estimators under study. We presume theoretical results could only be proven by explicitly specifying the contamination in very specific cases, this being out of the scope of this article. An upper bound exists in linear regression (Hampel et al., 1986), taking the form 1/(p + 1) in our notation; our non-linear framework here makes us believe we cannot hope for a better upper bound.

We contaminate only the response but now fix the magnitude to add = 20, while the contamination proportion varies. Figure 3 displays colored envelopes in which moves from 0 to 40% by increments of 5%, with 500 replications each time. The main result is that none of the considered estimators benefit from what is known as a “high” BDP, namely a value at or close to 50%. This is expected from such M-estimators. However, the envelopes hint that the redescending estimators (T/T and BY/T) may have a higher BDP than the nonredescending ones (H/H and CH/H).

5. Falls Data Analysis: The PD-FIT Study We illustrate here the use of the proposed robust estimators through the analysis of falls data coming from a recent

928

Biometrics, December 2014

Table 1 Estimations of the NB model on the PD-FIT data (standard errors are between parentheses): MLE stands for the MLEs for ˆ ˆ both β and σ; T/T stands for β PR and the WMLE for σ, both with Tukey’s biweight function; BY/T stands for βUD with ˆ ˜ Bianco and Yohai’s ψ-function and the WMLE for σ with Tukey’s biweight; HA/HA stands for β PR and the WMLE for σ, both with Hampel’s three part function; MLEc stands for the MLEs computed on the “cleaned” data.

Intercept log(pastf+1) severity intervention σ

MLE

T/T

BY/T

HA/HA

MLEc

−5.891 (0.261) 0.942 (0.067) 0.021 (0.009) −0.215 (0.162) 1.034

−5.600 (0.276) 0.790 (0.069) 0.013 (0.009) −0.297 (0.173) 0.715

−5.620 (0.264) 0.776 (0.067) 0.013 (0.009) −0.308 (0.166) 0.694

−5.618 (0.255) 0.783 (0.064) 0.013 (0.009) −0.295 (0.160) 0.675

−5.668 (0.230) 0.775 (0.056) 0.013 (0.008) −0.311 (0.145) 0.568

randomized controlled trial (Canning et al., 2014). This study was conducted between 2008 and 2012 in the metropolitan area of Sydney, Australia, and aimed at evaluating the effectiveness of an exercise intervention at reducing the number of falls among elder people suffering from Parkinson’s disease, from that follows the name of the study: Parkinson’s Disease Frailty Intervention Trial (PD-FIT). The 231 participants were randomly allocated either to a treatment group, undertaking exercise programs targeting balance and lower limbs muscle strength, or to a usual care control group. The response variable is the reported number of falls during the intervention period of 6 months. In addition to the intervention dummy variable, where the control group is the reference, two other covariates are available: pastf, the reported number of falls during the past 12 months; and severity, a score indicating the severity of the disease measured at baseline with the Unified Parkinson’s Disease Rating Scale (Fahn, Elton, and UPDRS Program Members, 1987). We fit a NB GLM with a logarithmic link to these data with all covariates and including the number of days of follow-up as an offset. We introduce the past falls in our model through a natural logarithm because this covariate shares a relationship very close to linearity with the response, so they need to be considered on the same scale. We compare four sets of estimations: MLE, T/T, BY/T, and HA/HA. We include the T/T and BY/T sets because of their good performance in the simulation study and introduce the HA/HA set conˆ sisting of β PR and the WMLE for σ, both using Hampel’s three part function, for illustrating how close it can be to the T/T set of estimates. We use the same tuning constant value as in the simulation study: c = 4 for Tukey’s biweight ˜ and t = 3 for Bianco and Yohai’s ψ-function and use (2, 3, 4) for Hampel’s three part function. With these data, these values provide high robustness, with an acceptable increase of variance we tolerate in order to be on the safe side. The estimated efficiencies are computed at the postulated model, by fixing the design and using the estimates as the true parameter values. As such, our estimated efficiencies, between 70% and 80%, have to be considered as approximations. We kept all the weights on the design equal to one, because our exploratory analysis ruled out the presence of leverage points. Table 1 reports the estimations, where standard errors are given between parentheses and all numbers are rounded to the third decimal. The standard errors for the robust estimates for

β are computed using the empirical “sandwich” variance expression given in Section 3.5. The last column of Table 1 gives the estimates according to the MLEs on the “cleaned” data, where the observations receiving very low weight from the three robust estimators were excluded (14 distinct observations, as seen on the plots on the first row of Figure 4). We do not report these estimates for inference purposes, but only to confirm the results found with the robust estimators. At first glance, we see that the three robust (and redescending) estimators yield rather similar estimations. We also notice that estimates are of the same order of magnitude between the classical and the three robust methods: the past falls apparently explain a large part of the variability of the prospective falls while there is not a significant difference between the intervention group and the control group. That said, a closer look reveals that the severity of the Parkinson’s disease appears as a good predictor of the falls according to the MLE whereas the robust estimates do not support this idea: the z-statistic for MLE equals 2.454 and thus exceeds the 97.5% standard normal quantile while the z-statistics for T/T, BY/T, and HA/HA are clearly below, with values 1.363, 1.472, and 1.494, respectively. We also note that the intervention dummy variable is close to be considered significant at a 5% level according to the robust estimates, with respective z-statistics equal to −1.714, −1.852, and −1.837. Moreover, we can suspect that the MLE overestimates σ when comparing its value to the robust counterparts, by analogy with the results seen in the simulations. In conclusion, the observed decrease in falls due to the intervention exercise versus the usual care program cannot be considered different from zero, though we cannot exclude the possibility of a lack of test power when considering robust estimators. The subtle differences between likelihood-based and robust estimates can be explained when looking at Figure 4, displaying the robustness weights w computed on the data on the first row of panels and representing the corresponding weight functions for varying y with μ kept fixed at 4 and σ fixed at 0.7 on the second row. The resemblance between T/T and BY/T is rather clear: most observations receive weights above 0.8, a few receives average weights, and 11 receive weights equal to zero. The HA/HA weights however look like hard-rejection weights: a great majority (more than 97%) of the observations receives a weight exactly equal to unity, while only a few are ˜ BY function tends between zero and one. In addition, the ψ to downweight observations more easily than the Tukey’s

Robust Inference in the NB Model BY/T

150

1.0 0.8 50

150

0

1.0

1.0

0.8 0.2 0.0

0.0 40

0.6

w * (y,m)

0.6 0.2

0.4

w * (y,m) 30

150

i

0.8

0.8 0.6 0.4 0.2 0.0

20

50

i

1.0

i

0 µ 10

0.6 0.2 0.0

0

0.4

50

0.4

w * (y i ,m i)

0.6 0.0

0.2

0.4

w * (y i ,m i)

0.8

1.0 0.8 0.6 0.4 0.0

0.2

w * (y i ,m i)

0

w * (y,m)

HA/HA

1.0

T/T

929

0 µ 10

y

20

y

30

40

0 µ 10

20

30

40

y

Figure 4. Weights w (yi , μi ) computed on the PD-FIT data: T/T estimations (top left panel), BY/T estimations (top middle panel), and HA/HA estimations (top right panel); corresponding weight functions with μ kept fixed at 4 and σ at 0.7: PR approach with Tukey’s biweight tuned with c = 4 (bottom left panel), UD approach with Bianco and Yohai’s function tuned with t = 3 (bottom middle panel), PR approach with Hampel’s three part function with tuning constants 2, 3, and 4 (bottom right panel). biweight function; this can be explained by the rather steep slope on the left-hand side of the weight function, indicating that so-called inliers can be downweighted as well. Across all three sets of estimations, an important feature is that the 11 observations receiving an exact zero weight are the same. By inspecting the data, we note that the common pattern among these patients is that their past falls are quite low (generally below 10) while their measured falls during the study are rather large (around 20). We can therefore argue that,

because the past falls explain very well the prospective falls for the vast majority of the data, a person whose past falls do not match the measured falls would receive a low robustness weight and appear as atypical. This situation illustrates well the use of robustness weights as diagnostic/exploration tools. In the original analysis of these data, Canning et al. (2014) used a Poisson-Inverse Gaussian specification for the response. They actually needed a more flexible distribution, allowing for even more overdispersion than the NB, because their pri-

930

Biometrics, December 2014

mary analysis was unadjusted, that is, they only considered the intervention dummy variable as covariate. With only an intercept and a dummy variable, there is not enough information for the NB model to account for the heavy right tail observed in the PD-FIT data, but with the covariates we considered the NB fit is totally satisfying. In fact, we suspect that the inclusion of the past falls as an explanatory variable is what really helps capturing the extra variability. 6. Discussion We have extended two approaches for building robust Mestimators in the class of GLM to the NB distribution and introduced a robust WMLE for the overdispersion parameter of this model. Using a unified notation, the two approaches can be viewed as weighted versions of the classical MLE and lead to rather similar estimates as long as the corresponding functions, bounding either the Pearson residual or the unscaled deviance component, are chosen and tuned accordingly. Our simulations show that redescending estimators feature smaller biases under contamination, and possibly benefit from a higher empirical breakdown point, than non-redescending ones, while retaining high efficiency at the model. Further research is needed to study these findings theoretically and to which extent they may be valid. Through the analysis of falls data coming from a recent randomized control trial, we have highlighted the usefulness of these robust estimators for safe inference and as diagnostic tools. That said, while most of the results we presented make sense for moderate to large sample sizes, we understand the need for equivalent techniques in small sample sizes. This represents future work as is the extension of the two robust approaches to more flexible family of distributions for counts, such as the Poisson-Inverse Gaussian. 7. Supplementary Materials Web Appendices A, B, and C and Web Figures 1 and 2, referenced in Sections 2, 3, and 4, as well as R code are available with this paper at the Biometrics website on Wiley Online Library.

Acknowledgements Part of this work was done thanks to the financial support of the Academic Society of Geneva. The authors wish to thank Ana Bianco for sharing R code and the PD-FIT investigators, in particular Colleen Canning and Catherine Sherrington, for providing the PD-FIT data. Finally, the authors thank the associate editor and reviewer for their relevant comments which improved the article.

References Amiguet, M. (2011). Adaptively weighted maximum likelihood estimation of discrete distributions. PhD thesis, Universit´e de Lausanne, Switzerland. Anscombe, F. J. (1949). The statistical analysis of insect counts based on the negative binomial distribution. Biometrics 5, 165–173. Bianco, A. M., Boente, G., and Rodrigues, I. M. (2013a). Resistant estimators in Poisson and Gamma models with missing re-

sponses and an application to outlier detection. Journal of Multivariate Analysis 114, 209–226. Bianco, A. M., Boente, G., and Rodrigues, I. M. (2013b). Robust tests in generalized linear models with missing responses. Computational Statistics & Data Analysis 65, 80–97. Bianco, A. M. and Yohai, V. J. (1996). Robust estimation in the logistic regression model. In Robust Statistics, Data Analysis, and Computer Intensive Methods, Rieder, H. (ed), 17–34. New York, NY: Springer-Verlag. Breslow, N. E. (1984). Extra-Poisson variation in log-linear models. Journal of the Royal Statistical Society, Series C 33, 38–44. Cadigan, N. G. and Chen, J. (2001). Properties of robust Mestimators for Poisson and negative binomial data. Journal of Statistical Computation and Simulation 70, 273–288. Canning, C. G., Sherrington, C., Lord, S. R., Fung, V. S. C., Close, J. C. T., Heritier, S., and Heller, G. (2014). A randomised controlled trial of the effect of an exercise and cueing program on fall prevention in Parkinson’s disease: The PD-FIT study. Submitted. Canning, C. G., Sherrington, C., Lord, S. R., Fung, V. S. C., Close, J. C. T., Latt, M. D., Howard, K., Allen, N. E., O’Rourke, S. D., and Murray, S. M. (2009). Exercise therapy for prevention of falls in people with Parkinsons disease: A protocol for a randomised controlled trial and economic evaluation. BMC Neurology 9, 1–7. Cantoni, E. and Ronchetti, E. M. (2001). Robust inference for generalized linear models. Journal of the American Statistical Association 96, 1022–1030. Cantoni, E. and Ronchetti, E. M. (2006). A robust approach for skewed and heavy-tailed outcomes in the analysis of health care expenditures. Journal of Health Economics 25, 198– 213. Chambers, R., Dreassi, E., and Salvati, N. (2013). Disease mapping via negative binomial M-quantile regression. Technical report, Dipartimento di Statistica “Giuseppe Parenti,” Universit` a Degli Studi di Firenze. Clark, S. J. and Perry, J. N. (1989). Estimation of the negative binomial parameter κ by maximum quasi-likelihood. Biometrics 45, 309–316. Croux, C. and Haesbroeck, G. (2003). Implementing the Bianco and Yohai estimator for logistic regression. Computational Statistics & Data Analysis 44, 273–295. Fahn, S., Elton, R. L., and UPDRS Program Members (1987). Unified Parkinsons disease rating scale. In Recent Developments in Parkinsons Disease, Fahn, S., Marsden, C. D., Goldstein, M., and Calne, D. B. (eds), 153–163. Florham Park, NJ: Macmillan Healthcare Information. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (1986). Robust Statistics: The Approach Based on Influence Functions. New York, NY: Wiley. Heritier, S., Cantoni, E., Copt, S., and Victoria-Feser, M.-P. (2009). Robust Methods in Biostatistics. Chichester, UK: John Wiley & Sons Ltd. Hilbe, J. M. (2011). Negative Binomial Regression, 2nd edition. New York, NY: Cambridge University Press. Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. 221–233. Berkeley, CA: University of California Press. Huber, P. J. and Ronchetti, E. M. (2009). Robust Statistics, 2nd edition. New York, NY: Wiley. Lawless, J. F. (1987). Negative binomial and mixed Poisson regression. Canadian Journal of Statistics 15, 209–225.

Robust Inference in the NB Model McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd edition. Boca Raton, FL: Chapman and Hall/CRC. Piegorsch, W. W. (1990). Maximum likelihood estimation for the negative binomial dispersion parameter. Biometrics 46, 863–867. Rousseeuw, P. J. (1985). Multivariate estimation with high breakdwon point. In Mathematical Statistics and Applications, Grossman, W., Pflug, G., Vincze, I., and Wertz, W. (eds), 283–297. Dordrecht, Netherlands: Reidel Publishing.

931

R Core Team (2013). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.

Received July 2013. Revised May 2014. Accepted May 2014.

Robust inference in the negative binomial regression model with an application to falls data.

A popular way to model overdispersed count data, such as the number of falls reported during intervention studies, is by means of the negative binomia...
1MB Sizes 0 Downloads 13 Views