This article was downloaded by: [Fudan University] On: 12 May 2015, At: 14:42 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Nonparametric Statistics Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/gnst20

Parametrically guided estimation in nonparametric varying coefficient models with quasi-likelihood a

a

a

Clemontina A. Davenport , Arnab Maity & Yichao Wu a

Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA Published online: 07 Apr 2015.

Click for updates To cite this article: Clemontina A. Davenport, Arnab Maity & Yichao Wu (2015) Parametrically guided estimation in nonparametric varying coefficient models with quasi-likelihood, Journal of Nonparametric Statistics, 27:2, 195-213, DOI: 10.1080/10485252.2015.1026903 To link to this article: http://dx.doi.org/10.1080/10485252.2015.1026903

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

Downloaded by [Fudan University] at 14:42 12 May 2015

Conditions of access and use can be found at http://www.tandfonline.com/page/termsand-conditions

Journal of Nonparametric Statistics, 2015 Vol. 27, No. 2, 195–213, http://dx.doi.org/10.1080/10485252.2015.1026903

Parametrically guided estimation in nonparametric varying coefficient models with quasi-likelihood Clemontina A. Davenport† , Arnab Maity and Yichao Wu ∗

Downloaded by [Fudan University] at 14:42 12 May 2015

Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA (Received 5 November 2013; accepted 25 February 2015) Varying coefficient models (VCMs) allow us to generalise standard linear regression models to incorporate complex covariate effects by modelling the regression coefficients as functions of another covariate. For nonparametric varying coefficients, we can borrow the idea of parametrically guided estimation to improve asymptotic bias. In this paper, we develop a guided estimation procedure for the nonparametric VCMs. Asymptotic properties are established for the guided estimators and a method of bandwidth selection via bias-variance tradeoff is proposed. We compare the performance of the guided estimator with that of the unguided estimator via both simulation and real data examples. Keywords: generalised linear models; local polynomial smoothing; nonparametric regression; parametrically guided estimation; varying coefficient model AMS Subject Classification: 62J12; 62G08

1.

Introduction

A common scenario in studies is when the researcher observes some covariates and a response and wants to estimate the conditional mean function of the response given the covariates. A typical approach is to fit a generalised linear model (GLM) (Nelder and Wedderburn 1972) where a parametric assumption is made on a transformation of the conditional mean. This approach is easily interpreted and efficient if the correct parametric model is chosen, but can have serious consequences when the model is misspecified, a common problem in real-world applications. Alternatively, nonparametric methods make little or no assumptions of the model and are robust to model misspecification, however, they are slower to converge (Glad 1998) and can fail when the dimensions of the covariates are too high (Fan and Zhang 2008). In this paper, we combine the benefits of both parametric and nonparametric methods by considering estimation of nonparametric varying coefficient models (VCMs) using a pre-specified parametric family of functions as a guide. We use a local likelihood kernel smoothing based estimation procedure and investigate the asymptotic properties of the resulting estimates. We evaluate the finite sample performance of our method via a simulation study and demonstrate it by applying to the AIDS Clinical Trials Group (ACTG) Protocol 315 data described in Section 5. *Corresponding author. Email: [email protected] † Current address: Department of Biostatistics and Bioinformatics Duke University Medical Center, Durham NC 27710, USA. © American Statistical Association and Taylor & Francis 2015

Downloaded by [Fudan University] at 14:42 12 May 2015

196

C.A. Davenport et al.

VCMs are nonparametric generalised additive models (Wood 2006) that increase the flexibility of linear models by using a smooth function to model the parameters. There are numerous advantages of using VCMs over their parametric linear model counterparts. First, fitting a standard linear model is too restrictive because few real world problems satisfy the assumption of linearity. The true, possibly nonlinear, relationship between the covariates are not well captured by polynomial fitting which can lead to large bias in estimation. Second, VCMs allow for the interaction between covariates to be modelled in a nonparametric way. Models with only main effects included may miss the effect from the interaction between covariates. Another advantage for using VCMs is that the dimensionality issue is avoided because the coefficient functions are allowed to vary as a function of another covariate (Hastie and Tibshirani 1993). VCMs are easy to interpret and arise when it is desirable to know how regression coefficients change over groups, or over time in longitudinal studies. These flexible models can be applied to a variety of data types, including longitudinal data (Brumback and Rice 1998; Hoover, Rice, Wu, and Yang 1998), time series data (Chen and Tsay 1993; Huang and Shen 2004), environmental data (Fan and Zhang 1999), and genetic data (Ma, Yang, Romero, and Cui 2011). VCMs consist of smooth, functional parameters that need to be estimated and this can be done using penalised spline approaches (Eilers and Marx 1996; Cao, Lin, Wu, and Yu 2010), basis expansion methods (Holdeman 1969), or by applying regression locally (Chen and Tsay 1993; Cai, Fan, and Li 2000). In this paper, we utilise the latter method since these estimators are efficient and have nice sampling properties (Fan 1993). This method involves using a kernel function to weight the likelihood and applying polynomial regression locally using Taylor series approximations. In practice, the full likelihood may be unknown or difficult to construct, so we replace it with the quasi-likelihood introduced by Wedderburn (1974). With the quasi-likelihood, only the relationship between the mean and the variance needs to be specified, and the model will still retain most of the efficiency of a maximum likelihood estimation procedure. Nonparametric estimators can be enhanced by using a parametric guide. In practice, previous information or exploratory analysis may give some insight on the shape of the unknown functions and this information can be used to speed up convergence and reduce bias. The basic process is as follows: (1) identify a parametric family of functions that captures the shape, (2) remove the trend and carry out local polynomial estimation, and (3) add the trend back to obtain the final estimators of the functions. This guided estimation scheme was first studied in the density estimation framework where Hjort and Glad (1995) showed that their estimator had better bias and similar variance compared to the traditional nonparametric estimator, even when the guide was superficial. It has also been studied in least squares regression (Glad 1998; MartinsFilho, Mishra, and Ullah 2008) quasi-likelihood models (QLMs) (Fan, Wu, and Feng 2009) and nonparametric additive models (Fan, Maity, Wang, and Wu 2013). In this paper, we propose a way to improve the coefficient functions in VCMs by using guided techniques. Cai et al. (2000) first proposed local estimation techniques for VCM coefficient functions using local-likelihood equations, and present asymptotic properties of their estimators. Fan et al. (2009) propose using QLMs as a general case when the likelihood is unavailable and present guided estimation when a single covariate and response are observed. Fan et al. (2013) extend this methodology to an additive model where multiple covariates are observed but do not interact with each other. The main contribution of this paper is that we allow for interaction between two observed covariates and thus extend previous works to VCMs, of which Fan et al. (2009) is a special case. We borrow the generality of the QLMs and the idea of parametrically guided estimation to improve the bias of our estimators, but we extend the optimal bandwidth selection methodology for the kernel weight function when we apply local polynomial regression. We also give asymptotic theory and results for the guided estimators in the VCM framework, different than the GLM and the additive model in Fan et al. (2009) and Fan et al. (2013), respectively. We show in simulations that the guided estimators have lower bias and similar variance when a fixed

Journal of Nonparametric Statistics

197

Downloaded by [Fudan University] at 14:42 12 May 2015

bandwidth is used, and lower bias and variance when the optimal bandwidth is used. We then estimate the functional parameters in the (ACTG) Protocol 315 data. The rest of this paper is organised as follows: in Section 2 we give an overview of QLMs and the standard nonparametric estimation procedure using local polynomial fitting. We propose our parametrically guided estimation scheme in Section 2.2 using two different types of corrections, and give some asymptotic properties of our estimators. In Section 3, we present one method of choosing the bandwidth parameter in local polynomial fitting. We evaluate the performance of our estimators compared to the standard ones in Section 4 and found that our estimators had lower bias when a fixed bandwidth was used, and lower bias and variability when the optimal bandwidth was chosen. We then applied our methodology to the ACTG data in Section 5 and provide some concluding remarks in Section 6.

2.

Guided estimation for VCMs

Assume that for each of n subjects we observe covariates Xi = (1, Xi1 , . . . , Xiq )T and Ti , and a response Yi . The VCM for these covariates is defined as g(μi ) = θ0 (Ti ) + Xi1 θ1 (Ti ) + · · · + Xiq θq (Ti )

(1)

= XTi θ(Ti ), where μi = E(Yi |Xi , Ti ) is the conditional mean of the response, g(·) is a link function from the GLM framework, and θ (·) = {θ0 (·), θ1 (·), . . . , θq (·)}T are unknown, smooth functions. The first term models the unique effect of T and the remaining terms model the interaction between X and T. This VCM is more flexible than a linear regression model because it allows the effect of X to vary smoothly with T and the effect of T is not restricted to a linear assumption. The goal is to estimate θ(·) and there are several ways to do this (Cleveland, Grosse, and Shyu 1991; Hastie and Tibshirani 1993). The method we adopt is to use local-likelihood kernel smoothing using a quasi-likelihood. 2.1.

Framework and local likelihood estimation

QLMs, an extension of GLMs, are ideal because often the full likelihood may be unknown or difficult to construct. In QLMs, only the relationship between the conditional mean and variance of the responses need to be specified, which is often doable in practice. The full conditional loglikelihood is replaced with a quasi-likelihood function Q(Yi , μi ) and if we define var(Yi |Xi , Ti ) = V (μi ), then Q satisfies ∂ y−μ Q(y, μ) = . ∂μ V (μ) Wedderburn (1974) shows that Q has similar properties to the log-likelihoods and that Q is exactly the likelihood when the response comes from a single parameter exponential family. The standard, unguided, nonparametric procedure for estimating θ (·) is to use local polynomial fitting by first approximating the functions using a Taylor series expansion so that θj (Ti ) ≈ βj0 + (Ti − t0 )βj1 + · · · + (Ti − t0 )P βjP , (p)

(2)

where βjp = θj (t0 )/p! for p = 0, . . . , P and j = 0, . . . , q. Substituting this approximation into Equation (1) yields g(μi ) = GTi β where Gi = {1, (Ti − t0 )1 , . . . , (Ti − t0 )P }T ⊗ Xi , and β = (β T0 , . . . , β TP )T where β p = (β0p , . . . , βqp )T . The approximation in Equation (2) is only accurate

198

C.A. Davenport et al.

when t0 is close to Ti , so the quasi-likelihood is weighted in such a way that more weight is given to t0 ’s close to Ti and little to no weight to those far from Ti . This is done by using a kernel function Kh (·) = K(·/h)/h and defining the local quasi-likelihood as n 

Q{g−1 (GTi β), Yi }Kh (Ti − t0 ).

(3)

i=1

The parameter h is the bandwidth and needs to be estimated (see Section 3). We maximise Equation (3) with respect to β and the solution βˆ 0 will be the estimate of θ (t0 ).

Downloaded by [Fudan University] at 14:42 12 May 2015

2.2.

Guided estimation

The local likelihood estimators can be enhanced by using a parametric guide. Intuitively speaking, more curvature in a function makes it more difficult to estimate. If, through exploratory analysis or prior information, one has some idea of the shape of the true function, then one can identify a parametric family that captures this trend. Using the parametric guide, the curvature of the function can be removed yielding a flatter curve that is easier to estimate. Once this flatter curve is estimated, then the guide can be used to add the trend back and obtain the final estimate of the original function. This type of guided estimation has been shown to reduce bias of nonparametric estimators (Hjort and Glad 1995; Glad 1998; Martins-Filho et al. 2008) and improve variance since a larger bandwidth can be selected (Fan et al. 2009, 2013). Define a parametric family that captures the trend of the function as {θjg (t, α j ) : α j = (αj1 , . . . , αjmj )T ∈ A ⊂ Rmj } for j = 0, . . . , q. The optimal guides can be found by maximising the quasi-likelihood n 

Q[g−1 {XTi θ g (Ti , α)}, Yi ]

i=1

with respect to α = (α T0 , . . . , α Tq )T where θ g (Ti , α) = {θ0g (Ti , α 0 ), . . . , θqg (Ti , α q )}T . The best ˆ where we suppress the dependence on α in our notation. In fit is denoted by θˆ g (t) = θ g (t, α) this paper we present two methods of removing the trend using an additive correction or a multiplicative correction. 2.2.1.

Additive correction

If the curvature of θj is well approximated by θˆjg , then estimating the quantity θj (t) − θˆjg (t) will yield more accurate and less variable estimates since this quantity is close to flat. Once estimated, the guide is added back to give the final estimate of θj . This process can be achieved in one step by defining ηj (t) = θj (t) − θˆjg (t) + θˆjg (t0 ) and estimating ηj at t0 . This definition of ηj is known as the additive correction. Using this correction, the VCM in Equation (1) can be rewritten as g(μi ) = XTi η(Ti ) + XTi h(Ti ), where η(t) = {η0 (t), . . . , ηq (t)}T , and h(t) = {θˆ0g (t) − θˆ0g (t0 ), . . . , θˆqg (t) − θˆqg (t0 )}T . Similar to Section 2.1 and with a slight abuse of notation, a Taylor series approximation is used for ηj (Ti ) about the point t0 such that ηj (Ti ) ≈ βj0 + (Ti − t0 )βj1 + · · · + (Ti − t0 )P βjP .

Journal of Nonparametric Statistics

199

(p)

where βjp = ηj (t0 )/p! for p = 0, . . . , P. The local quasi-likelihood ˆ Q(β) ≡ Q(β; h, t0 , α) =

n 

Q[g−1 {GTi β + XTi h(Ti )}, Yi ] × Kh (Ti − t0 )

i=1

ˆ 0 ) which gives the is maximised with respect to β and the estimate of βˆ 0 corresponds to η(t ˆ 0 ). Because h(t) is known for fixed T = t, our model can be fit using standard final estimate θ(t software with h(t) as an offset.

Downloaded by [Fudan University] at 14:42 12 May 2015

2.2.2.

Multiplicative correction

An alternative correction which leads to a different guided estimator is the multiplicative correction. As in the additive case, the ratio θj (t)/θˆjg (t) will be flat if θˆjg (t) captures the trend of θj (t) and estimating this ratio will be less biased than estimating the unknown function directly. Once estimated, the ratio is then multiplied by the guide to get the final estimate of θj . The multiplicative correction is defined as ηj (t) = θj (t){θˆjg (t0 )/θˆjg (t)} and the one step solution requires estimating ηj (t) at t0 . An alternative correction which leads to a different guided estimator is the multiplicative correction. As in the additive case, the ratio θj (t)/θˆjg (t) will be flat if θˆjg (t) captures the trend of θj (t) and estimating this ratio will be less biased than estimating the unknown function directly. Once estimated, the ratio is then multiplied by the guide to get the final estimate of θj . The multiplicative correction is defined as ηj (t) = θj (t){θˆjg (t0 )/θˆjg (t)} and the one step solution requires estimating ηj (t) at t0 . Using the multiplicative correction, Equation (1) is written as g(μi ) =

θˆ0g (Ti ) η0 (Ti ) + θˆ0g (t0 )

θˆ1g (Ti ) Xi1 η1 (Ti ) + · · · + θˆ1g (t0 )

θˆqg (Ti ) Xiq ηq (Ti ). θˆqg (t0 )

Estimating ηj (T) is achieved by first using a Taylor series expansion of ηj (t) about the point t0 and then maximising n  Q{g−1 (G∗T Q(β) = i β), Yi }Kh (Ti − t0 ) i=1

= {1, Ti − t0 , . . . , (Ti − t0 )p } ⊗ (θˆ0g (Ti )/θˆ0g (t0 ), Xi1 θˆ1g (Ti )/θˆ1g (t0 ), with respect to β, where ˆ ˆ . . . , Xiq θqg (Ti )/θqg (t0 )), and ⊗ denotes the Kronecker product between the two vectors. The solution βˆ 0 give our final estimates of θˆ (t0 ). By manipulating the design matrix, there is no offset for the multiplicative correction and this model can easily be fit using standard GLM software. G∗i

2.3.

Asymptotic properties

In this section, we investigate the asymptotic properties of the proposed guided estimators. For illustration purposes, we present the case where estimation is performed using a local linear approximation (P = 1) and an additive guide. Similar results for general P and the multiplicative guide can be obtained by straightforward   but tedious algebra. Define κd = ud K(u) du and νd = ud K 2 (u) du. Define M to be a 2 × 2 matrix with elements Mkl = κk+l−2 and R to be a 2 × 2 matrix with elements Rkl = νk+l−2 . Let ρ(x, t) =

200

C.A. Davenport et al.

1/[V {μ(x, t)}g2 {μ(x, t)}] and γ (x, t0 ) = var(Y1 |X1 = x, T1 = t0 )/[V {μ(x, t0 )}g {μ(x, t0 )}]2 . Make the definitions V1 (t0 ) = ft (t0 )M ⊗ E{ρ(X1 , T1 )X1 XT1 |T1 = t0 }, V2 (t0 ) = ft (t0 )R ⊗ E{γ (X1 , T1 )X1 XT1 |T1 = t0 }, B1 (t0 ) = M−1 (κ2 , κ3 )T ⊗

η (t0 ) , 2

Downloaded by [Fudan University] at 14:42 12 May 2015

where ft (·) is the marginal density function of T. Then for a fixed guide, we have the following result. Theorem 2.1 Fix a point t0 and assume the guide is fixed. Under the conditions stated in Section A.1 of the appendix, as h → 0, nh → ∞, and nh5 → constant, we have ˆ 0 ) − θ (t0 ) − h2 B1 (t0 )} →d N(0, ), (nh)1/2 {θ(t −1 where  is the leading (q + 1) × (q + 1) submatrix of V−1 1 V2 V1 .

There are two noteworthy points to make about Theorem 1. The first is that if there is no (or constant) guide and the model belongs to a one-parameter exponential family with the canonical link and correctly specified variance function, then η (t0 ) = θ  (t0 ), ρ(x, t) = γ (x, t) = V {μ(x, t)}, and our result reduces to Theorem 1 of Cai et al. (2000). The second is that only the bias term B1 (t0 ) is affected when a guide is used, and not the asymptotic variance. To be specific, consider the integrated squared bias for each function θj (·). It is evident that the integrated squared biasof the guided estimate is smaller that that of the unguided one when     {θj (t) − θjg (t)}2 dt < {θj (t)}2 dt. This is analogous to Remark 3 of Fan et al. (2009). Therefore when a fixed bandwidth is used, the squared bias (and hence the mean squared error [MSE]) is reduced if an appropriate guide is chosen. Furthermore, finding the optimal bandwidth using the procedure from Section 3 allows for a larger bandwidth to be selected compared to the unguided estimate which will reduce the variance as well as the bias. We demonstrate this in our simulation study in Section 4. Theorem 2.1 assumes a fixed guide but in practice the guide needs to be estimated. Theorem 2.2 states that the expressions above are still valid for estimated guides as well. Theorem 2.2 Let fjnt (x, t, y) = ft (t) exp(Q[g−1 {xT θ (t)}, y]) be the true joint density and fprp (x, t, y, α) = ft (t) exp(Q[g−1 {xT θ g (t, α)}, y]) be the proposed joint density. Define α ∗ to be the minimiser of the Kullback–Leibler distance between fprp and fjnt . Then under the White (1982)type conditions in the appendix, the same result as in Theorem 2.1 holds when an estimated guide θˆ g (·) is used in place of a fixed guide with the modification that α is now replaced by α ∗ . The proofs of Theorems 2.1 and 2.2 are given in the appendix. Remark 1 The asymptotic results presented in this section are an extension of the results from Fan et al. (2009). Consider the special case where the covariate contains only the intercept Xi = 1 and parameter function θ(·) ≡ θ0 (·). Then we have ρ(x, t) = ρ(t) = 1/[V {μ(t)}g2 {μ(t)}], γ (x, t) = γ (t) = var(Y1 |T1 = t)/[V {μ(t)}g {μ(t)}]2 , V1 (t0 ) = ft (t0 )ρ(t0 )M, V2 (t0 ) = ft (t0 ) γ (t0 )R, and B1 (t0 ) = M−1 (κ2 , κ3 )T η (t0 )/2. With these simplified definitions, the result in

Journal of Nonparametric Statistics

201

Theorem 2.1 implies that  (nh)

ˆ 0 ) − θ (t0 ) − θ(t

1/2

h2 κ0 κ2 η (t0 ) 2

 →d N(0, σ (t0 )),

where σ (t0 ) = ν0 var(Y1 |T1 = t0 )g2 {μ(t)}/f  t (t0 ). This result is as in Theorem 1 in Fan et al. (2009) with the assumption that κ0 = K(z) dz = 1.

Downloaded by [Fudan University] at 14:42 12 May 2015

3.

Optimal bandwidth selection

Note that for simplicity of presentation, we will consider our bandwidth selection method using the additive correction; the multiplicative correction follows easily by using the multiplicative definition of ηj (·), replacing Gi with G∗i , and omitting the offset term XTi h(Ti ). Once βˆ is obtained, the bias arises from the approximation term of the Taylor series expansions. Hence, using more terms in the series should theoretically produce less bias. Let rj (Ti ) =  ηj (Ti ) − Pk=0 ηj(k) (t0 )(Ti − t0 )k /k! be the approximation error. If a higher order Taylor approx def (p+k) imation is substituted for ηj (Ti ) in rj , then rj (Ti ) ≈ ak=1 ηj (t0 )(Ti − t0 )p+k /(p + k)! = rji . We then maximise the local quasi-likelihood including the approximation errors Q∗ (β) =

n 

Q[g−1 {GTi β + XTi h(Ti ) + XTi ri }, Yi ] × Kh (Ti − t0 )

i=1 ∗ with respect to β where ri = (r0i , . . . , rqi )T . Define the maximiser as βˆ . The local quasilikelihood Q∗ (β) is differentiated with respect to β to get the gradient vector

Q∗ (β) =

n  Yi − g−1 {GT β + XT h(Ti ) + XT ri } i

i=1

i

i

V [g−1 {GTi β + XTi h(Ti ) + XTi ri }]

(g−1 ) (GTi β + XTi h(Ti ) + XTi ri )Gi Kh (Ti − t0 )

and the second derivative is taken to get the Hessian matrix Q∗ (β). A Taylor series expansion is then applied to Q∗ around βˆ to get ∗ ˆ + Q∗ (β)( ˆ βˆ ∗ − β) ˆ =0 Q∗ (βˆ ) ≈ Q∗ (β) ∗ ˆ −1 Q∗ (β). ˆ and thus, an approximation of the estimation bias is βˆ − βˆ ≈ {Q∗ (β)} ˆ is done about the To get an approximation of the variance, a Taylor series expansion of Q (β) true β, denoted by β 0 . Note that

ˆ ≈ Q (β 0 ) + Q (β 0 )(βˆ − β 0 ) 0 = Q (β) which implies βˆ − β 0 ≈ −{Q (β 0 )}−1 Q (β 0 ), and the estimate of the conditional variance is var(βˆ − β 0 |X, T) ≈ {Q (β 0 )}−1 var{Q (β 0 )|X, T}{Q (β 0 )}−1 ,

202

C.A. Davenport et al.

ˆ To approximate the variance term, note that where Q (β 0 ) can be approximated by Q (β). var{Q (β 0 )|X, T}   n  ∂ var Q[g−1 {GTi β + XTi h(Ti )}, Yi ]|Xi , Ti Kh2 (Ti − t0 ) = ∂β i=1   n  Yi − g−1 {GTi β + XTi h(Ti )} −1  T T (g var ) {G β + X h(T )}|X , T Gi GTi Kh2 (Ti − t0 ) = i i i i i −1 {GT β + XT h(T )}] V [g i i i i=1

Downloaded by [Fudan University] at 14:42 12 May 2015



[(g−1 ) {XTi θ (t0 )}]2 Gi GTi Kh2 (Ti − t0 ). V [g−1 {XTi θ (t0 )}]

The last approximation follows from the fact that Ti only has significant weight in the neighbourhood of t0 . To compute the MSE, we denote the bias of θˆ as B(t0 ; h) = {B0 (t0 ; h), . . . , Bq (t0 ; h)}T correˆ −1 Q∗ (β). ˆ The variance–covariance matrix sponding to the first q + 1 components of [Q∗ (β)] ˆ V(t0 ; h) of θ is the first (q + 1) × (q + 1) submatrix of var(βˆ − β 0 |x, t) and the variance of the estimated VCM XTi θˆ is XTi V(t0 ; h)Xi . The conditional MSE of XT θˆ given X = x is MSE(t0 ; h) = xT {B(t0 ; h)B(t0 ; h)T + V(t0 ; h)}x. The sample MSE is derived as   0 ; h) = 1 MSE(t XT {B(t0 ; h)B(t0 ; h)T + V(t0 ; h)}Xi n i=1 i

n  T −1 T = trace {B(t0 ; h)B(t0 ; h) + V(t0 ; h)}n Xi Xi . n

i=1

We propose to choose h such that hˆ = argmin



 h) dt. MSE(t;

h

To summarise, we use a grid of t0 ’s and a grid of candidate bandwidths. We fit the local quasi-likelihood for each t0 and h candidate and calculate the extended residual squares criterion (ERSC) defined as ERSC(x, t; h) = σˆ 2 {1 + (p + 1)/N}, where σˆ 2 is the weighted residual sum of squares after fitting a local pth-order polynomial and N is the number of local data points (see (5.6) of Fan, Farmen, and Gijbels (1998) for details). We then sum the ERSCs over t0 and the bandwidth with the lowest sum becomes the pilot bandwidth. Using this bandwidth, we fit ˆ For a new set of candidate bandwidths, we fit a local quasi-likelihood for each t0 to obtain β. ∗ the local quasi-likelihood using higher order Taylor series approximation and obtain βˆ , which is theoretically more accurate. We compute the bias and variance using the gradient and Hessian of the quasi-likelihood, compute the MSE and the candidate bandwidth with the lowest MSE is our optimal bandwidth.

4.

Simulation

We conducted a simulation study to evaluate the performance of our estimators. We generated each observation (Xi , Ti , Yi ) by first simulating the covariates Xi and Ti from a uniform

Journal of Nonparametric Statistics Table 1.

Results of trimmed average bias, variance, and MSE for Example 1. Bias2

n = 100

Same h Best h

n = 200

203

Same h Best h

θˆ0 (t) θˆ1 (t) θˆ0 (t) θˆ1 (t) θˆ0 (t) θˆ1 (t) θˆ0 (t) θˆ1 (t)

Variance

Naive

Additive

Multiplicative

0.129 0.003 0.129 0.003 0.080 0.003 0.080 0.003

0.022 0.001 0.098 0.001 0.014 0.001 0.026 0.001

0.022 0.001 0.096 0.001 0.015 0.001 0.026 0.001

Naive 0.179 0.552 0.179 0.552 0.090 0.269 0.090 0.269

Additive

MSE

Multiplicative

0.173 0.522 0.104 0.294 0.088 0.261 0.074 0.217

0.174 0.522 0.106 0.296 0.089 0.262 0.075 0.217

Naive 0.308 0.556 0.308 0.556 0.171 0.272 0.171 0.272

Additive 0.195 0.522 0.203 0.295 0.103 0.262 0.100 0.217

Multiplicative 0.196 0.522 0.202 0.297 0.104 0.262 0.101 0.217

Downloaded by [Fudan University] at 14:42 12 May 2015

Notes: ‘Same h’ refers to fixed bandwidth and best ‘h’ refers to optimal bandwidth. All values are multiplied by 100.

distribution. Then, given Xi and Ti , the conditional mean μi of the response was generated as μi = g−1 {θ0 (Ti ) + Xi1 θ1 (Ti )}, where g is the canonical link. We used a grid of equally spaced values tk for k = 1, . . . , K = 100 to estimate the two functions. A cubic guide θ0g (t, α 0 ) = α01 + α02 t + α03 t2 + α04 t3 was used for estimating θ0 and a quadratic guide θ1g (t, α 1 ) = α11 + α12 t + α13 t2 was used for estimating θ1 . We used local linear polynomial estimators with the Epanechnikov kernel weight, and for bias calculations, we chose the degree of the Taylor expansion to be a = 1. For R = 1000 simulations, we generated the data and estimated the parameters for the cubic and quadratic guides as described above. To get the final estimates θˆ0 and θˆ1 , we maximised the local quasi-likelihood with the appropriate distribution and canonical link, and Epanechnikov kernel weight. The design matrix of the local likelihood was constructed using both the additive and multiplicative corrections (Gi and G∗i , respectively). To find the optimal smoothing bandwidth, we first simulated 15 data sets and applied our methods from Section 3. We took the median of these 15 values as the optimal bandwidth and used that value as fixed for the 1000 simulations. We also compared the two methods using the optimal bandwidth from the unguided method as the fixed bandwidth. Once the two functions were estimated, we computed the  marginal squared bias, marginal variance, and marginal MSE of each. Define Bjk = R−1 Rr=1 [θˆj,r (tk ) − θj (tk )], Vjk =   R−1 Rr=1 [θˆj,r (tk ) − R−1 Rr =1 θˆj,r (tk )]2 , and MSEjk = B2jk + Vjk where j = 0, 1 and r indexes the  simulation. The average marginal squared bias of θˆj is B2j = K −1 Kk=1 B21k , the average marginal   variance is Vj = K −1 Kk=1 Vjk and the average marginal MSE is MSEj = K −1 Kk=1 MSEjk . However instead of averaging over all k, we used the 10% trimmed mean. Tables 1–3 show the results of our simulations. ‘Best h’ is each estimation method’s optimal smoothing bandwidth obtained via the bias-variance tradeoff in Section 3. All values for squared bias, variance, and MSE in the table are multiplied by 100. ‘Same h’ refers to the fixed bandwidth obtain from the optimal bandwidth from the unguided estimators. 4.1.

Example 1: Poisson response

For the Poisson response, n = 100 or 200 covariates Xi and Ti were generated with Xi ∼ Unif[−1, 1] and Ti ∼ Unif[−2, 2]. The true functions were θ0 (t) = sin(πt/2) + 4 and θ1 (t) = sin(πt/4 − π/2)/2 + 1. The response Yi was generated from a Poisson(μi ). We used a grid of K = 100 equally spaced values in [−2, 2] for t0 to estimate the two functions. Table 1 gives the (trimmed) average marginal squared bias, variance, and MSE for the two functions estimated by

204

C.A. Davenport et al.

Table 2.

Results of trimmed average bias, variance, and MSE for Example 2. Bias2

n = 100

Same h Best h

n = 200

Same h Best h

θˆ0 (t) θˆ1 (t) θˆ0 (t) θˆ1 (t) θˆ0 (t) θˆ1 (t) θˆ0 (t) θˆ1 (t)

Variance

MSE

Naive

Additive

Multiplicative

Naive

Additive

Multiplicative

Naive

Additive

Multiplicative

1.05 0.20 1.05 0.20 0.46 0.13 0.46 0.13

0.11 0.02 0.22 0.04 0.07 0.01 0.11 0.01

0.10 0.02 0.21 0.03 0.07 0.01 0.11 0.01

3.82 11.68 3.82 11.68 2.20 6.65 2.20 6.65

4.15 11.81 3.41 8.73 2.31 6.70 2.00 5.55

4.29 12.21 3.51 9.92 2.38 6.85 2.05 5.63

4.86 11.89 4.86 11.89 2.66 6.78 2.66 6.78

4.26 11.83 3.63 8.77 2.38 6.71 2.11 5.56

4.39 12.23 3.73 8.96 2.45 6.85 2.16 5.64

Downloaded by [Fudan University] at 14:42 12 May 2015

Note: All values are multiplied by 100.

Table 3.

Results of trimmed average bias, variance, and MSE for Example 3. Bias2 Naive

n = 500

Same h Best h

θˆ0 (t) θˆ1 (t) θˆ0 (t) θˆ1 (t)

0.84 0.21 0.84 0.21

Variance

MSE

Additive

Naive

Additive

Naive

Additive

0.17 0.02 0.18 0.02

62.43 27.10 62.34 27.10

67.23 29.08 57.60 24.95

63.27 27.31 63.27 27.31

67.40 29.10 57.77 24.96

Note: All values are multiplied by 100.

the original method using no guide and by our method using guided estimation with additive and multiplicative corrections. When the same bandwidth is used the guided estimation procedure reduces bias but has no effect on variance. When the optimal bandwidth is used, the guided estimates have lower bias and lower variance. This is because the guides account for much of the trend in the true curve and the nonparametric correction is flatter and easier to estimate, resulting in lower bias. As the sample size increased, we saw a reduction in bias, variance and MSE. 4.2.

Example 2: Normal response

For the Gaussian response, n = 100 or 200 covariates Xi and Ti were generated with Xi ∼ Unif[−1, 1] and Ti ∼ Unif[−2, 2]. The true functions were θ0 (t) = sin(πt/2) − 2 and θ1 (t) = 2 sin(πt/4 − π/2) + 3. The response Yi was generated from a Normal(μi , 1). Table 2 gives the (trimmed) average marginal squared bias, variance, and MSE for the two functions estimated by the original method using no guide and by our method using guided estimation with additive and multiplicative corrections. In this example, the guided estimates still have lower bias and variance when the optimal bandwidth is used. When the same bandwidth is used, the variance of the additive and multiplicative correction is slightly higher than the unguided estimates. The gains in bias reduction are counteracted by the higher variance so the MSE is approximately the same for both methods. 4.3.

Example 3: Bernoulli response

For the Bernoulli response, we chose a larger sample size of n = 500 since the estimation of the success probability is more difficult than estimating the mean in the Gaussian and Poisson case. The covariates Xi and Ti were generated with Xi ∼ Unif[1, 2] and Ti ∼ Unif[−1, 1]. The true

Journal of Nonparametric Statistics

205

functions were θ0 (t) = sin(πt)/2 + 1 and θ1 (t) = 0.7 sin{(t + 1)π/2} − 1. The response Yi was generated from a Bernoulli(μi ). The results from this example are presented in Table 3. Using a multiplicative correction with Bernoulli data is very unstable due to the possibility of dividing by zero, thus this correction was not used. We see that using the guides reduces bias and variance when the optimal bandwidth is used, and reduces bias but has little effect on variance when a fixed bandwidth is used.

Downloaded by [Fudan University] at 14:42 12 May 2015

5.

HIV data analysis

In the ACTG Protocol 315, 48 individuals infected with HIV-1 were given potent antiviral medicine to evaluate the efficacy of treatment on the reduction of viral load (plasma HIV-1 RNA). The viral load was measured repeatedly over three months and 31 baseline covariates were measured for each individual. Details of this study can be found in Lederman et al. (1998) and Liang, Wu, and Carroll (2003). Of these 31 covariates, Wu and Wu (2002) identified those that were significant predictors and in this illustration, we chose two of these covariates that appeared frequently in their models, baseline viral load and baseline CD4 + counts. Viral load is the number of copies per milliliter and is measured on a logarithmic scale with base 10. CD4 + counts are the number of lymphocytes that are CD4 +. The response was the change from baseline viral load measured at day 7. If an individual did not have a viral load measurement on day 7, then the preceding or following day measurement was used. The data for our illustration are presented in Figure 1. We would like to determine if baseline viral load and baseline CD4 + counts have an effect on the change from baseline viral load measurement while adjusting for interaction between them. Using a grid of size of 25 for t0 , the Epanechnikov kernel, and the Gaussian log-likelihood for Q, we fit the model in Equation (1) and estimated θ0 and θ1 using the original unguided

Figure 1. Scatterplot of the HIV Data variables of interest presented in Section 5. The response is change from baseline viral load at day 7.

206

C.A. Davenport et al.

(a)

(b)

Downloaded by [Fudan University] at 14:42 12 May 2015

Figure 2. Nonparametric estimate (solid red) and parametrically guided estimate (green dot-dashed) of θ0 (a) and θ1 (b) along with the cubic guide (black dashed).

(a)

(b)

(c)

(d)

Figure 3. Nonparametric estimate (solid red) of θ0 (a) and θ1 (c) and parametrically guided estimate (solid green) of θ1 (b) and θ2 (d) along with a pointwise 95% bootstrap confidence intervals (black dashed). Figures available in colour online.

nonparametric method, represented by the solid red line in Figure 2. The pre-asymptotic bandwidth selector gave bandwidth 0.67. Based on the shape of these unguided estimates, we chose a cubic guide for our guided estimation of θ0 and θ1 and used the additive correction. The results from our fit are represented by the green dot-dashed lines in Figure 2 and the guides used are the black dashed line. The bandwidth for our method was also 0.67. Our estimated θˆ0 had more curvature than the naive counterpart and followed the parametric guide very closely, suggesting that there is little model misspecification when using a cubic guide for θ0 . For θˆ1 , the naive estimate and our guided estimate had somewhat similar shapes, with the most difference in the left endpoints.

Downloaded by [Fudan University] at 14:42 12 May 2015

Journal of Nonparametric Statistics

207

(a)

(b)

(c)

(d)

(e)

(f)

Figure 4. Nonparametric estimate (solid red) and parametrically guided estimate (green dot-dashed) of θ0 (a)–(c) and θ1 (d)–(f) for day 14, day 21, and day 28 responses. A cubic guide was used in (a), (c), and (f), a quadratic guide was used in (b) and (e), and a quartic guide was in (d).

The point-wise bootstrap 95% confidence intervals are given in Figure 3. The confidence intervals for the guided estimates were slightly wider in the boundaries but overall were similar to those of the naive estimate. In Figure 3(c) and 3(d), the entire confidence interval for the functions contain zero. Recall that θ1 is the slope function and this term models the interaction between the two covariates. This indicates that there is no interaction between baseline viral load and baseline CD4 + counts and the response can be adequately modelled as a cubic function of baseline viral load alone. We also used the VCM to separately model the viral load after day 14, day 21, and day 28 with the same two covariates in order to compare the estimated functions on different days to day 7. Again, if an individual did not have a viral load measurement on these exact days, then the preceding or following day was used, and if all three days were missing, the individual was dropped from the analysis. This yielded 35 individuals for the day 14 analysis, 39 individuals for the day 21 analysis, and 38 individuals for the day 28 analysis. The estimated functions corresponding to these responses are given in Figure 4. The pre-asymptotic bandwidth selector gave bandwidths 0.53, 1.22 and 0.77 for the naive estimates of day 14, 21, and 28, respectively. The bandwidth for the guided estimates were 0.63, 1.26, and 0.77 for day 14, 21, and 28 respectively. The shape of the slope function θ1 changes drastically for the different days, but the entire confidence interval for all three θ1 functions (not presented) contains zero. Thus CD4 + has a very different interaction effect on baseline viral load for different days, but the overall effect is not significant. The shape of the intercept function θ0 was similar for days 14 and 21, and had more of a cubic shape for day 28. Similar to day 7, the parametrically guided estimates of θ0 followed their respective guides very closely.

6.

Discussion

In this paper, we used parametric guides to enhance the performance of nonparametric estimators of the parameter functions inVCMs. We generalised to quasi-likelihoods since the true likelihood

Downloaded by [Fudan University] at 14:42 12 May 2015

208

C.A. Davenport et al.

is often unavailable. We presented two ways of using the guide and estimated the corrected functions using local polynomial fitting. We developed the asymptotic properties of the guided estimators and a method of selecting the optimal bandwidth parameter of the kernel function. We conducted a simulation study to compare our guided estimators to their standard nonparametric equivalents, and found that the guided estimators had lower bias when a fixed bandwidth was used, and lower bias and variance when the optimal bandwidth was used. In general, even if the shape of the parameter function is not captured by the guide, the guided estimator will still have better bias than the unguided counterpart and the two will have similar variability. In this paper we present the additive and multiplicative correction which are special cases of a unified family of guided estimators proposed by Fan et al. (2009). This work could be extended to include this unified family and its asymptotic properties. Other future work includes extending our methods to functional data where the covariates of interest are smooth functions and the response can be functional or scalar. The functional covariates correspond to unknown functional parameters that need to be estimated, and this can be done using our guided estimation scheme. This work can also be extended to multivariate unknown functions (e.g. θ (Ti )) by using an empirical basis expansion of the function and reducing it to a VCM. Acknowledgements We thank two anonymous referees, an anonymous Associate Editor and the editor for their constructive and helpful comments, which has improved the presentation of the paper.

Disclosure statement No potential conflict of interest was reported by the authors.

Funding Maity’s research was supported by NIH grant [R00 ES017744] and a NCSU Faculty Research and Professional Development (FRPD) grant. Wu’s research was partially supported by NSF grant [DMS-1055210] and NIH grant [R01-CA149569].

References Brumback B.A., and Rice J.A. (1998), ‘Smoothing Spline Models for the Analysis of Nested and Crossed Samples of Curves (with discussion)’, Journal of the American Statistical Association, 93, 961–994. Cai Z., Fan J., and Li R. (2000), ‘Efficient Estimation and Inferences for Varying-coefficient Models’, Journal of the American Statistical Association, 95, 888–902. Cao Y., Lin H., Wu T.Z., and Yu Y. (2010), ‘Penalized Spline Estimation for Functional Coefficient Regression Models’, Computational Statistics and Data Analysis, 54, 891–905. Chen R., and Tsay R.S. (1993), ‘Functional-coefficient Autoregressive Models’, Journal of the American Statistical Association, 88, 298–308. Cleveland W., Grosse E., and Shyu W. (1991), ‘Local Regression Models’, in Statistical Models in S (2nd ed.), eds. J. Chambers and T. Hastie, Pacific Grove, CA: Wadsworth and Brooks/Cole, pp. 309–376. Eilers P.H.C., and Marx B.D. (1996), ‘Flexible Smoothing with B-splines and Penalities’, Statistical Science, 11, 89–102. Fan J. (1993), ‘Local Linear Regression Smoothers and their Minimax Efficiencies’, The Annals of Statistics, 21, 196– 216. Fan J., Farmen M., and Gijbels I. (1998), ‘Local Maximum Likelihood Estimation and Inference’, Journal of the Royal Statistical Society Series B, 60, 591–608. Fan J., and Gijbels I. (1996), Local Polynomial Modelling and its Applications, London: Chapman and Hall. Fan J., Maity A., Wang Y., and Wu Y. (2013), ‘Parametrically Guided Generalised Additive Models with Application to Mergers and Acquisitions Data’, Journal of Nonparametric Statistics, 25, 109–128. Fan J., Wu Y., and Feng Y. (2009), ‘Local Quasi-likelihood with a Parametric Guide’, The Annals of Statistics, 37, 4153–4183.

Downloaded by [Fudan University] at 14:42 12 May 2015

Journal of Nonparametric Statistics

209

Fan J., and Zhang W. (1999), ‘Statistical Estimation in Varying Coefficient Models’, The Annals of Statistics, 27, 1491– 1518. Fan J., and Zhang W. (2008), ‘Statistical Methods with Varying Coefficient Models’, Statistics and Its Interface, 1, 179–195. Glad I.K. (1998), ‘Parametrically Guided Non-parametric Regression’, Scandinavian Journal of Statistics, 25, 649–668. Hastie T., and Tibshirani R. (1993), ‘Varying-coefficient Models’, Journal of the Royal Statistical Society Series B, 55, 757–796. Hjort N.L., and Glad I.K. (1995), ‘Nonparametric Density Estimation with a Parametric Start’, The Annals of Statistics, 23, 882–904. Holdeman J.T. (1969), ‘A Method for the Approximation of Functions Defined by Formal Series Expansions in Orthogonal Polynomials’, Mathematics of Computation, 23, 275–287. Hoover D.R., Rice J.A., Wu C.O., and Yang L.P. (1998), ‘Nonparametric Smoothing Estimates of Time-varying Coefficient Models with Longitudinal Data’, Biometrika, 85, 809–822. Huang J.Z., and Shen H. (2004), ‘Functional Coefficient Regression Models for Nonlinear Time Series: A Polynomial Spline Approach’, Scandinavian Journal of Statistics, 31, 515–534. Lederman M.M., Connick E., Landay A., Kuritzkes D.R., Spritzler J., St. Clair M., Kotzin B.L., Fox L., Chiozzi M.H., and Leonard J.M. et al. (2002), ‘Immunologic Responses Associated with 12 Weeks of Combination Antiretroviral Therapy Consisting of Zidovudine, Lamivudine, and Ritonavir: Results of AIDS Clinical Trials Group Protocol 315’, The Journal of Infectious Diseases, 179, 70–79. Liang H., Wu H., and Carroll R.J. (2003), ‘The Relationship between Virologic and Immunologic Responses in AIDS Clinical Research using Mixed-effects Varying-coefficient Models with Measurement Error’, Biostatistics, 4, 297– 312. Ma S., Yang L., Romero R., and Cui Y. (2011), ‘Varying Coefficient Model for Gene-environment Interaction: A Nonlinear Look’, Bioinformatics, 27, 2119–2126. Martins-Filho C., Mishra S., and Ullah A. (2008), ‘A Class of Improved Parametrically Guided Nonparametric Regression Estimators’, Econometric Reviews, 27, 542–573. Nelder J.A., and Wedderburn R.W.M. (1972), ‘Generalized Linear Models’, Journal of the Royal Statistical Society Series A, 135, 370–384. Wedderburn R.W.M. (1974), ‘Quasi-likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method’, Biometrika, 61, 439–447. White H. (1982), ‘Maximum Likelihood Estimation of Misspecified Models’, Econometrica, 50, 1–25. Wood S.N. (2006), Generalized Additive Models: An Introduction with R, Boca Raton, FL: Chapman and Hall/CRC. Wu H., and Wu L. (2002), ‘Identification of Significant Host Factors for HIV Dynamics Modelled by Non-linear Mixedeffects Models’, Statistics in Medicine, 21, 753–771.

Appendix A.1.

Assumptions, definitions and facts

Recall that Xi = (1, Xi1 , . . . , Xiq )T and that Gi = {1, (Ti − t0 )}T ⊗ Xi for P = 1. Also recall that ρ(x, t) = 1/[V {μ(x, t)}g2 {μ(x, t)}]. Define Qk (u, v) = ∂ k Q{g−1 (u), v}/∂uk . Since Q(·) is a quasi-likelihood function, Qk (u, v) is linear in v for each fixed u. Thus for fixed X = x, we have Q1 {xT θ (t0 ), μ(x, t0 )} = 0, Q2 {xT θ (t0 ), μ(x, t0 )} = −ρ(x, t0 ). The following assumptions that we use to prove Theorems 2.1 and 2.2 are well used in the nonparametric regression and nonparametric varying coefficient modelling literature, see for examples Cai et al. (2000) and Fan et al. (2009). • The  kernel K(·) is a symmetric positive bounded function with [−1, 1] as support, and follows the property that uK(u) du = 0. • E(|X|3 |T = t0 ) is continuous at t0 . • E(Y 4 |X = x, T = t0 ) is bounded in a neighbourhood of t0 . • The function Q2 (u, v) < 0 for each u and v in the range of the response variable. • The functions ft (·), V1 (·), V {μ(x, ·)}, V  {μ(x, ·)} and g {μ(x, ·)} are continuous at t0 . Also, ft (t0 ) > 0 and V1 (t0 ) > 0. • The functions θj (·) for j = 0, . . . , q are continuous in a neighbourhood around t0 . For our results to be valid for the case when the guide is estimated, we use the following White (1982)-type condition, ˆ see for example, Fan et al. (2009). Recall that we denote the guides as θjg (t) = θjg (t, α) and similarly θˆjg (t) = θjg (t, α), omitting the dependence on α to ease notation. • We assume that E[log{fjnt (x, t, y)}] exists. Also, there exists a function m(x, t, y) such that |fprp (x, t, y)}| ≤ m(x, t, y).

210

C.A. Davenport et al.

• E[log{fjnt (x, t, y)} − log{fprp (x, t, y)}] has a unique minimiser α ∗ . • Each element (∂/∂α) log{fprp (x, t, y)} is continuously differentiable in α. • There are functions m2 (x, t, y) and m3 (x, t, y) such that for any α, the absolute value of each element of [(∂/∂α) log{fprp (x, t, y)}][(∂/∂α) log{fprp (x, t, y)}]T is bounded by m2 (x, t, y), and that the absolute value of each element of [(∂ 2 /∂αα T ) log{fprp (x, t, y)}] is bounded by m3 (x, t, y). Also E{m2 (x, t, y)} and E{m3 (x, t, y)} exist. • The matrix E([(∂/∂α) log{fprp (x, t, y)}][(∂/∂α) log{fprp (x, t, y)}]T ) is non-singular at α ∗ and that α ∗ is a regular point of the matrix E[(∂ 2 /∂αα T ) log{fprp (x, t, y)}].

A.2.

Proof of Theorem 2.1

Downloaded by [Fudan University] at 14:42 12 May 2015

We first prove the result with a fixed guide and then address the case when the guide is estimated. Fix a point t0 . Recall that ηj (t) = θj (t) − θjg (t) + θjg (t0 ). Define bn = (nh)−1/2 and δ = b−1 n [β00 − θ0 (t0 ), . . . , βq0 − θq (t0 ), h{β01 − η0 (t0 )}, . . . , h{βq1 − ηq (t0 )}]T . Then it is straightforward to see that δˆ minimises n 

Q[g−1 {bn GTi δ + s(Ti , t0 ) + XTi θ g (Ti )}, Yi ]K

i=1



(Ti − t0 ) h



as a function of δ, where s(Ti , t0 ) = XTi {η(t0 ) + (Ti − t0 )η (t0 )}. By Taylor’s expansion we obtain L(δ) ≡

n 

Q[g−1 {bn GTi δ + s(Ti , t0 ) + XTi θ g (Ti )}, Yi ]K

i=1

= L0 + LT1 δ +



(Ti − t0 ) h



δ T L2 δ + L3 (δ), 2

where L0 =

n 

Q[g−1 {s(Ti , t0 ) + XTi θ g (Ti )}, Yi ]K



 (Ti − t0 ) ; h



 (Ti − t0 ) Gi ; h



 (Ti − t0 ) Gi GTi ; h

i=1

L1 = bn

n 

Q1 [{s(Ti , t0 ) + XTi θ g (Ti )}, Yi ]K

i=1

L2 = b2n

n 

Q2 [{s(Ti , t0 ) + XTi θ g (Ti )}, Yi ]K

i=1

 L3 (δ) =

b3n 6

 n

Q3 [{s∗i + XTi θ g (Ti )}, Yi ]K

i=1



 (Ti − t0 ) (GTi δ)3 h

with s∗i lying between s(Ti , t0 ) and bn GTi δ + s(Ti , t0 ). We analyse each term separately. First we note that using similar argument as in Cai et al. (2000) and Fan et al. (2009), E(|L3 (δ)|) is bounded by a term of the order     (Ti − t0 ) O nb3n E Q3 (s∗i , Y1 )(XT1 δ)3 K = O(bn ). h

(A1)

The previous line makes use of the assumption that K(·) is bounded, Q3 (u, v) is linear in v, and that E(|X|3 |T = t0 ) is continuous at t0 . Next, we take up L2 . For each Ti such that |Ti − t0 | < h, a Taylor’s series expansion gives us XTi η(Ti ) = s(Ti , t0 ) +

(Ti − t0 )2 XTi η (t0 ) + op (h2 ). 2

Thus we have Q2 [{s(Ti , t0 ) + XTi θ g (ti )}, μ(Xi , Ti )] = Q2 [XTi θ (Ti ), μ(Xi , Ti )] + op (1) = −ρ(Xi , Ti ) + op (1).

(A2)

Journal of Nonparametric Statistics

211

The expected value of the (j, k)th element of L2 , with 1 ≤ j, k ≤ (q + 1) is E(L2,jk ) = h−1

Q2 [{s(T1 , t0 ) + XT1 θ g (T1 )}, μ(X1 , T1 )]X1j X1k 

×K =

 (T1 − t0 ) fx,t (X1 , T1 ) dT1 dX1 h

Q2 [{s(wh + t0 , t0 ) + XT1 θ g (wh + t0 )}, μ(X1 , wh + t0 )]X1j X1k

× K(w)fx,t (X1 , wh + t0 ) dw dX1 = − ρ(X1 , wh + t0 )X1j X1k K(w)fx,t (X1 , wh + t0 ) dw dX1 + op (1)

Downloaded by [Fudan University] at 14:42 12 May 2015

= −ft (t0 )E[ρ(X1 , T1 )X1j X1k |T1 = t0 ] + op (1). Similarly, it is fairly straightforward to derive that E(L2 ) = −ft (t0 )M ⊗ E[ρ(X1 , T1 )X1 XT1 |T1 = t0 ] + o(1), where M is a 2 × 2 matrix with elements Mkl = κk+l−2 . Using similar calculations, we can show that var(L2,jk ) = O{(nh)−1 }. Thus we have that L2 = −ft (t0 )M ⊗ E[ρ(X1 , T1 )X1 XT1 |Ti = t0 ] + op (1) = −Vi (t0 ) + op (1).

(A3)

Hence we can write L(δ) − L0 = LT1 δ − δ T V1 (t0 )δ + op (1). Note that δˆ actually minimises L(δ) − L0 since L0 is free of δ. Now we use the quadratic approximation lemma (Fan and Gijbels, 1996) and derive that δˆ = V−1 1 L1 + op (1), provided that L1 is a sequence of stochastically bounded vectors. ˆ it is sufficient to show the same for L1 . Note that using Equation (A2) we To prove the asymptotic normality of δ, have Q1 [{s(Ti , t0 ) + XTi θ g (Ti )}, μ(Xi , Ti )] =

ρ(Xi , Ti )(Ti − t0 )2 XTi η (t0 ) + op (h2 ). 2

Denote the joint density of X and T by fx,t (·, ·). Using the assumption that nh5 → constant we derive for 1 ≤ j ≤ q + 1, 

E(L1,j ) = nbn

Q1 [{s(T1 , t0 ) + XT1 θ g (T1 )}, μ(X1 , T1 )]K

(T1 − t0 ) h



× X1j fx,t (X1 , T1 ) dX1 dT1 = (nh)1/2 Q1 [{s(wh + t0 , t0 ) + XT1 θ g (wh + t0 )}, μ(X1 , wh + t0 )] × K(w)X1j fx,t (X1 , wh + t0 ) dX1 dw   (nh5 )1/2 ρ(X1 , wh + t0 )w2 XTi η (t0 ) = 2 × K(w)X1j fx,t (X1 , wh + t0 ) dX1 dT1 + op {(nh5 )1/2 }   (nh5 )1/2 κ2 ft (t0 )E[XTi η (t0 )ρ(X1 , T1 )X1j |T1 = t0 ] + op {(nh5 )1/2 }. = 2

212

C.A. Davenport et al.

Similarly we derive that E(L1,j ) = {(nh5 )1/2 /2}κ3 ft (t0 )E[XTi η (t0 )ρ(X1 , T1 )X1j |T1 = t0 ] + op {(nh5 )1/2 } for q + 2 ≤ j ≤ 2(q + 1). Hence we have that E(L1 ) = (nh5 )1/2 {ft (t0 )/2}

  κ2 ⊗ E[ρ(X1 , T1 )X1 XT1 |T1 = t0 ]η (t0 ) + op {(nh5 )1/2 } κ3

= (nh5 )1/2 B1 (t0 ) + op (1). Now we take up var(L1 ). Define Zi to be the term inside the sum in the definition of L1 . Since for any j, k, Zj is independent of Zk when j = k, it is straightforward to derive that cov(L1,j L1,k ) = nb2n [E(Z1j Z1k ) − E(Z1j )E(Z1k )].

Downloaded by [Fudan University] at 14:42 12 May 2015

Using similar argument to those in the computation of E(L1 ), we derive that E(Z1j ) = Op (h3 ) for any j and thus nb2n E(Z1j )E(Z1k ) = op (1). Next, for 1 ≤ j, k ≤ (q + 1),

   (T1 − t0 ) nb2n E(Z1j Z1k ) = nb2n E Q21 [{s(T1 , t0 ) + XT1 θ g (T1 )}, Y1 ]K 2 X1j X1k h = ν0 ft (t0 )E{γ (X1 , T1 )X1j X1k |T1 = t0 } + op (1), where γ (x, t0 ) = var(Y1 |X = x, T1 = t0 )/[V {μ(x, t0 )}g {μ(x, t0 )}]2 . Similarly we obtain var(L1 ) = ft (t0 )R ⊗ E{γ (X1 , T1 )X1 XT1 |T1 = t0 } + op (1) = V2 (t0 ) + op (1),

(A4)

where R is a 2 × 2 matrix with elements Rjk = νj+k−2 . The final step of the argument is to use the Cramer–Wold device, that is to show that for any vector d {dT var(L1 )d}−1/2 {dT L1 − dT E(L1 )} →d N(0, 1).  We only need to check Lyapounov’s condition. To this end, recall that L1 = bn ni=1 Zi . It is sufficient to show that nb3n E(|dT Z1 |3 ) → 0. This result follows from a similar argument as that in Equation (A1). Hence we obtain var(L1 )−1/2 {L1 − E(L1 )} → N(0, 1). Combining Equations (A3)–(A4) we obtain −1 δˆ − (nh5 )1/2 V−1 (t0 )B1 (t0 ) →p N(0, V−1 1 V2 V1 )

Theorem 2.1 now follows from the definition of δ.

A.3.

Proof of Theorem 2.2

ˆ α) to be the maximiser of Recall the definition of Q(β; h, t, α) from Section 2.2 for various guides. Denote β(t, ˆ α) ˆ α ∗ ) converges to zero in ˆ − β(t, Q(β; h, t, α). When an estimated guide αˆ is used, it suffices to show that β(t, probability with a rate faster than (nh)1/2 , the convergence rate shown in Theorem 2.1 with a fixed guide. We start by observing two facts. First, n−1/2 αˆ − α ∗  = Op (1), which can be shown using the second set of assumptions at the end of Section A.1. Second, using the assumption that Q2 (u, v) < 0 for each u and v in the range of the ˆ are strictly concave in β. Thus we have response variable, it is evident that both Q(β; h, t, α ∗ ) and Q(β; h, t, α)

n−1

ˆ = n−1 Q(β; h, t, α ∗ ) + Op (n−1/2 ); n−1 Q(β; h, t, α)

(A5)

∂ ∂ ˆ = n−1 n−1 Q(β; h, t, α) Q(β; h, t, α ∗ ) + Op (n−1/2 ); ∂β ∂β

(A6)

∂2 ∂β∂β

T

ˆ = n−1 Q(β; h, t, α)

∂2 ∂β∂β T

Q(β; h, t, α ∗ ) + Op (n−1/2 ),

(A7)

ˆ α) ˆ α ∗ ) = ˆ − β(t, for any β, where the last two equations hold entry-wise. Thus it follows from Equation (A5) that β(t, op (1).

Journal of Nonparametric Statistics

213

Using Equation (A6), and a Taylor’s expansion, we have ∂ ˆ 0 = n−1 Q(β; h, t, α) ∂β ˆ α) ˆ β=β(t, −1 ∂ ∗ =n + Op (n−1/2 ) Q(β; h, t, α ) ∂β ˆ α) ˆ β=β(t, ∂ = n−1 Q(β; h, t, α ∗ ) ∂β ∗) ˆ β=β(t,α ∂2 ˆ α) ˆ α ∗ )] + Op (n−1/2 ). ˆ − β(t, + n−1 Q(β; h, t, α ∗ ) [β(t, ∂β∂β T ∗) ˆ β=β(t,α

Downloaded by [Fudan University] at 14:42 12 May 2015

ˆ α ∗ ) we have ∂/∂βQ(β; h, t, α ∗ )| ˆ ∗ = 0. Also, using similar argument used to Recall that by definition of β(t, β=β(t,α ) derive Equation (A3), the Frobenius norm of the second derivative term in the expansion above can be shown to be Op (1). Thus we have that ˆ α) ˆ α ∗ ) = Op (n−1/2 ). ˆ − β(t, β(t, The rate n1/2 is much faster than the rate (nh)1/2 shown in Theorem 2.1, and hence the result follows.

Parametrically guided estimation in nonparametric varying coefficient models with quasi-likelihood.

Varying coefficient models allow us to generalize standard linear regression models to incorporate complex covariate effects by modeling the regressio...
463KB Sizes 0 Downloads 4 Views