Biometrika (2014), 101, 3, pp. 613–624 Printed in Great Britain

doi: 10.1093/biomet/asu022 Advance Access publication 4 August 2014

Estimation of mean response via the effective balancing score BY ZONGHUI HU, DEAN A. FOLLMANN Biostatistics Research Branch, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20892, U.S.A. [email protected] [email protected] AND

NAISYIN WANG

SUMMARY We introduce the effective balancing score for estimation of the mean response under a missing-at-random mechanism. Unlike conventional balancing scores, the proposed score is constructed via dimension reduction free of model specification. Three types of such scores are introduced, distinguished by whether they carry the covariate information about the missingness, the response, or both. The effective balancing score leads to consistent estimation with little or no loss in efficiency. Compared to existing estimators, it reduces the burden of model specification and is more robust. It is a near-automatic procedure which is most appealing when high-dimensional covariates are involved. We investigate its asymptotic and numerical properties, and illustrate its application with an HIV disease study. Some key words: Balancing score; Dimension reduction; Missingness at random; Nonparametric kernel regression; Prognostic score; Propensity score.

1. INTRODUCTION In social and medical studies, of primary interest is usually the mean response, the estimation of which can be complicated by observations that are missing due to nonresponse, drop-out or death. Suppose that the data observed are triples {(Yi , δi , X i ), i = 1, . . . , n}, where Yi is the response, δi = 1 if Yi is observed and δi = 0 if Yi is missing, and X i is the vector of covariates which are always observed. Under a missing-at-random mechanism (Rosenbaum & Rubin, 1983), pr(δ = 1 | X, Y ) = pr(δ = 1 | X ), and estimation of E(Y ) is typically developed using a parametric form for the missingness pattern π(X ) = pr(δ = 1 | X ) or the response pattern m(X ) = E(Y | X ). Important estimation methods include regression (Rubin, 1987; Schafer, 1997), inverse propensity weighting (Horvitz & Thompson, 1952), augmented inverse propensity weighting (Robins et al., 1994), and their modified versions (D’Agostino, 1998; Scharfstein et al., 1999; Little & An, 2004; Cao et al., 2009). Reviews of most of these methods can be found in Lunceford & Davidian (2004) and Kang & Schafer (2007). Consistency and efficiency of these estimators rely on correct model specification. Even for the doubly robust estimators, either π(X ) or m(X ) needs to be correctly specified for consistency, and both must be correctly specified for efficiency (Robins & Rotnitzky, 1995; Hahn, 1998). When X ∈ R p is of high dimension, it is hard to choose a parametric specification that is sufficiently flexible to capture all the

Downloaded from http://biomet.oxfordjournals.org/ at Temple University on November 12, 2014

Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, U.S.A. [email protected]

614

Z. HU, D. A. FOLLMANN

AND

N. WANG

2. EFFECTIVE

BALANCING SCORE

2·1. Definition of the effective balancing scores Let R be any response depending on X ∈ R p . For the missing data problem, R can be the response variable Y or the indicator variable δ. The response R relates to X only through a ⊥ X | (β1T X, . . . , β KT X ), where the βk ∈ R p (k = 1, . . . , K ) few linear combinations; that is, R ⊥ are distinct vectors. This holds trivially for K = p and βk the vector with 1 as its kth element and all other elements zero. Let B = (β1 , . . . , β K ), where the βk are orthonormal and K is the smallest dimension needed to provide the conditional independence. Then B is a basis of the central dimension-reduction space SR|X , with K being the structural dimension (Cook, 1994). The columns of B are arranged in descending order of importance; that is, λ1  · · ·  λ K > 0, where λk measures the amount of X information carried by βkT X , as explained in § 3·2. In general, K is much smaller than p. If we let S = B T X , then S ∈ R K is a parsimonious summary of X : it has lower dimension than X but carries all the X information about R. In this paper, we refer to B = (β1 , . . . , β K ) as the effective directions. Let R = δ and denote by Bδ the effective directions of Sδ|X . Then δ⊥ ⊥ X | BδT X

(1)

Downloaded from http://biomet.oxfordjournals.org/ at Temple University on November 12, 2014

important nonlinear and interaction effects, and yet parsimonious enough to maintain reasonable efficiency. One family of estimators is built upon the balancing score. According to Rosenbaum & Rubin (1983), a balancing score b(X ) satisfies E{Y | b(X )} = E{Y | b(X ), δ = 1}, so E(Y ) can be estimated via b(X ) over the complete cases {(Yi , δi , X i ) : δi = 1}. The best-known balancing scores include the propensity score (Rosenbaum & Rubin, 1983) and the prognostic score (Hansen, 2008). The mean response can be estimated via the balancing score by approaches such as stratification (Rosenbaum & Rubin, 1983) and nonparametric regression (Cheng, 1994). The naive balancing score X is subject to the curse of dimensionality when X is high-dimensional (Abadie & Imbens, 2006). Balancing scores have been estimated through parametric modelling. Balancing-score-based estimators are less sensitive to model misspecification than other estimators (Rosenbaum, 2002). One important property of the balancing score-based estimator, which has rarely been used, is that full parametric modelling is unnecessary. For example, if π(X ) = f {b(X )} for some function b(X ) and unknown function f , then E(Y ) can be estimated via b(X ) through nonparametric regression or stratification, as subjects with similar b(X ) values have similar π(X ) values. If we can find such a function b(X ), there is no need for the full parametric form of π(X ). In this work, we introduce the effective balancing score. Like the propensity and prognostic scores, our proposed score creates a conditional balance between the subjects with observed response and those with missing response. Unlike conventional balancing scores, estimation of the effective balancing score is free of model specification via dimension reduction (Cook & Weisberg, 1991; Li, 1991; Cook & Li, 2002; Li & Wang, 2007; Li & Zhu, 2007). The effective balancing score carries all covariate information about the missingness or the response in the sense that δ ⊥ ⊥ X | S or Y ⊥ ⊥ X | S, where S stands for the effective balancing score and ⊥ ⊥ stands for conditional independence. The use of the effective balancing score thus leads to consistent estimation of E(Y ), with the potential for full efficiency. As a parsimonious summary of X , the effective balancing score has dimension much smaller than p, which enables effective use of nonparametric regression or stratification.

Estimation via effective balancing score

615

and we refer to Sδ = Bδ X as the effective propensity score. Let R = Y and let BY denote the effective directions of SY |X . Then Y⊥ ⊥ X | BYT X. (2) T

Obviously, BYT X is a prognostic score satisfying the definition of Hansen (2008). We refer to SY = BYT X as the effective prognostic score. Each effective score creates a conditional balance Y⊥ ⊥ δ | S,

(3)

where S is either Sδ or SY . For S = Sδ , (3) follows from Theorem 3 of Rosenbaum & Rubin (1983). For S = SY , pr(δ = 1 | Y, S) = E{E(δ | Y, X ) | Y, S} = E{E(δ | X ) | Y, S} = E{E(δ | X ) | S},

(4)

In other words, BdT X carries all the X information about both δ and Y , and creates both propensity balance and prognostic balance. We refer to Sd = BdT X as the effective double balancing score. As shown by (1) and (2), as long as either of the independence conditions in (5) holds, Sd is a balancing score satisfying the conditional balance (3). For this reason, we refer to Sd as the effective double balancing score. In the above example, Sd is the same as SY . In summary, the effective prognostic score and the effective double balancing score have the properties that Y ⊥ ⊥ δ | S and Y ⊥ ⊥ X | S. The first property ensures unbiased estimation of E(Y ) via S from the complete cases. The second property allows efficient estimation of E(Y ) via S. The effective propensity score possesses only the first property and is thus not as efficient. Without loss of generality, we assume that E(X ) = 0 and cov(X ) = I p , the identity matrix. 2·2. Estimation of the effective balancing scores To find the effective balancing scores is to find the effective directions of Sδ|X for the effective propensity score and the effective directions of SY |X for the effective prognostic score. For the effective double balancing score, we need the effective directions of S(δ,Y )|X . Remark 1. Under missingness at random, S(δ,Y )|X = SδY |X for a continuous response Y . The effective directions of S(δ,Y )|X can be estimated through the univariate response δY . A similar approach applies if Y is categorical. See the Appendix.

Downloaded from http://biomet.oxfordjournals.org/ at Temple University on November 12, 2014

where the second equality is due to missingness at random and the last to (2). Since (4) equals E(δ | S) = pr(δ = 1 | S), (3) follows. It is immediate from (3) that both effective scores are balancing scores. Here is an example. Suppose that Y given X is normal with mean m(X ) = X 1 + exp(X 2 + X 3 ) + X 1 X 4 and variance σ 2 (X ) = X 12 + X 42 , and suppose that the probability of observing Y is π(X ) = exp(0·1X 12 + X 2 + X 3 ){1 + exp(0·1X 12 + X 2 + X 3 )}−1 . Then the prognostic score is {m(X ), σ 2 (X )} and the propensity score is π(X ). The effective prognostic score is {2−1/2 (X 2 + X 3 ), X 1 , X 4 }, and the effective propensity score is {2−1/2 (X 2 + X 3 ), X 1 }. The effective balancing scores may have higher dimensions than their conventional counterparts. However, estimation of the propensity score and the prognostic score requires correct model specification and is subject to the challenges discussed in § 1. The effective balancing scores, on the other hand, can be obtained without specification of a model. We can also let R = (δ, Y ) be a bivariate response. Let Bd denote the effective directions for S(δ,Y )|X . Then δ⊥ ⊥ X | BdT X, Y ⊥ ⊥ X | BdT X. (5)

Z. HU, D. A. FOLLMANN

616

AND

N. WANG

There are many dimension reduction methods for estimating the effective directions, of which the most fundamental are sliced inverse regression (Li, 1991) and sliced average variance estimation (Cook & Weisberg, 1991). Both require that E(X | B T X ) be a linear function of B T X , known as the linearity condition. Many improved methods have been developed. To improve estimation efficiency, there are the likelihood-based methods of Cook (2007) and Cook & Forzani (2008, 2009). To relax distributional assumptions, Li & Dong (2009) and Dong & Li (2010) proposed methods to remove the linearity condition, and Ma & Zhu (2012) eliminated all distributional assumptions. These estimations are root-n consistent under certain conditions. We adopt the fitted principal components method of Cook (2007) in the numerical studies unless stated otherwise. More information about the dimension reduction methods is given in § 6.

In addition to the method in Remark 1, Remark 2 suggests a pooling method for estimating the effective directions of S(δ,Y )|X . Since δ and Y are mostly related, there is likely overlap between Sδ|X and SY |X . Therefore, the pooling method should be followed by Gram–Schmidt orthogonalization to remove redundancy.

3. MEAN

RESPONSE ESTIMATION VIA THE EFFECTIVE BALANCING SCORE

3·1. Nonparametric regression via the effective balancing score In this section, let B be the matrix of effective directions and S the effective balancing score. As B has orthonormal columns, S = B T X has identity covariance. We use S ∈ R K instead of X ∈ R p for the estimation of E(Y ) through stratification or nonparametric regression. Here we focus on nonparametric regression. We first consider B to be known and later investigate the impact of estimating B. Let m(S) = E(Y | S). Then E(Y ) = E{m(S)} can be estimated through estimation of m(S). To obviate the need for model specification for m(·), we estimate it by nonparametric kernel regression (Silverman, 1986): m(S) ˆ =

n 

 n δi yi K H (Si − S) δi K H (Si − S),

i=1

(6)

i=1

where Si = B T X i and K H (u) = det(H )−1 K(H −1 u) for u T = (u 1 , . . . , u K ), with H being the bandwidth matrix and K(·) the kernel function. Since S has identity covariance, we take H = h n I K with h n a scalar bandwidth (H¨ardle et al., 2004). We then estimate E(Y ) by μˆ = n −1

n 

m(S ˆ i ).

(7)

i=1

We refer to μˆ as the nonparametric balancing score estimator. By a result of Devroye & Wagner (1980), m(S) ˆ converges in probability to E(δY | S)/E(δ | S). It is immediate from (3) that E(δY | S) = E(δ | S)E(Y | S). Therefore, m(S) ˆ converges in probability to m(S) and, consequently, (7) converges in probability to μ = E(Y ).

Downloaded from http://biomet.oxfordjournals.org/ at Temple University on November 12, 2014

Remark 2. Under missingness at random, the effective directions of S(δ,Y )|X include both the effective directions of Sδ|X and those of SY |X (Chiaromonte et al., 2002); that is, S(δ,Y )|X = span(Sδ|X , SY |X ).

Estimation via effective balancing score

617

THEOREM 1. Under Conditions A1–A3 in the Appendix, the nonparametric balancing score estimator μˆ is asymptotically normally distributed. If as n → ∞ we have that h n → 0 and nh nK → ∞, then n 1/2 (μˆ − μ) → N (0, σ 2 ) in distribution, where σ 2 = var(Y ) + E[{π(S)−1 − 1}var(Y | S)] with π(S) = E(δ | S). For S = SY or S = Sd , by (2) and (5) we have Y ⊥ ⊥ X | S and thus var(Y | S) = var(Y | X ). It follows that   σ 2 = var(Y ) + E {π(X )−1 − 1}var(Y | X ) , which is the optimal efficiency for semiparametric estimators of E(Y ) (Hahn, 1998). Hence the nonparametric balancing score estimator via SY or Sd is both consistent and optimally efficient. For S = Sδ , the optimal efficiency may not be reached, as var(Y | S)  var(Y | X ).

Due to Theorem 2, we will use Sδ , SY and Sd for the effective balancing scores whether B is known or estimated. 3·2. Dimension of the effective balancing score To determine the dimension of the effective balancing score is to determine K , the number of effective directions. A simple approach is the sequential permutation test of Cook & Yin (2001). The dimension of the effective balancing score affects the performance of the proposed estimator through the nonparametric regression (6). Following Theorem 1, the impact of nonparametric regression is asymptotically negligible for h n ∼ n −α with 0 < α < 1/K . Thus, for larger K , selection of h n is more constrained as α falls in a narrower range. More specifically, nonparametric regression introduces a bias of h 2n B and a variance of (n 2 h nK )−1 V ; see the Appendix. The mean squared error of μˆ is minimized at h opt ∼ n −2/(K +4) . At h opt , the asymptotic variance is n −1 σ 2 + n −8/(K +4) V . If K  3, μˆ is root-n consistent and the variance from nonparametric regression is asymptotically negligible. If K = 4, μˆ is root-n consistent but the variance from nonparametric regression is not asymptotically negligible. If K > 4, μˆ converges slower than at rate n −1/2 . Ideally, we would like S to have dimension no more than 3 to attain the minimum mean squared error, root-n consistency, and negligible impact from nonparametric regression. Without dimension reduction, i.e., S = X ∈ R p , the proposed estimator reduces to the nonparametric regression estimator of Cheng (1994), which can perform poorly for large p. We can compare the three types of effective balancing score. Both Sd and SY improve over Sδ from Theorem 1. The effective double balancing score Sd can improve over SY when SY |X is more than three-dimensional but Sδ|X is less than three-dimensional. Here is a hypothetical example. Suppose that SY |X has five effective directions, Sδ|X has one effective direction, and Bd = (Bδ , BY ) has the effective propensity direction Bδ as the most important. To maintain the conditional balance (3), SY needs to be of dimension five. For Sd , we can use only the first three components: while the first component ensures conditional balance and thus consistency, the other two components enhance efficiency. When K > 3, we use S ∗ = (β1T X, β2T X, β3T X ), which generally shows good performance in numerical studies. Most dimension reduction methods estimate βk as the eigenvector of a kernel matrix, and the corresponding eigenvalue λk reflects the amount of X information carried by βkT X . When the first three components carry enough X information in the sense that (λ1 + λ2 + λ3 )/(λ1 + · · · + λ K )  0·90, S ∗ leads to good estimation. We refer to S ∗ as the approximate score. If K > 3 and the first three components carry a low percentage of X information, which

Downloaded from http://biomet.oxfordjournals.org/ at Temple University on November 12, 2014

THEOREM 2. With B replaced by its root-n consistent estimator, the nonparametric balancing score estimators have the same asymptotic properties as in Theorem 1.

618

Z. HU, D. A. FOLLMANN

AND

N. WANG

rarely happens in practice, we can use generalized additive modelling for the estimation of E(Y | S). Instead of using the multivariate kernel regression (6), we estimate through the additive model E(Y | S) = g1 (β1T X ) + · · · + g K (β KT X ), where each gk is estimated by smoothing on a single coordinate (Hastie & Tibshirani, 1990).

4. NUMERICAL

STUDIES

We investigate the numerical performance of the proposed estimators: μˆ δ uses the effective propensity score, μˆ Y uses the effective prognostic score, and μˆ d uses the effective double balancing score. Also computed are the commonly used model-based estimators: the parametric regression estimator μˆ reg , the inverse propensity weighting estimator μˆ ipw , and the augmented inverse propensity weighting estimator μˆ aipw . For the model-based estimators, we use linear regression for m(X ) and linear logistic regression for π(X ). In all simulations, 2000 datasets with n = 200 or n = 1000 are used. In simulation 1, X = (X 1 , . . . , X 10 ) has independent N (0, 1) components, π = exp(X 1 ){1 + exp(X 1 )}−1 , and Y = 3X 1 + 5X 2 +  with  ∼ N (0, 1). With m(X ) linear and π(X ) logistic linear, both working models are correct for the model-based estimators. The estimation results are shown in the top portion of Table 1. Compared to model-based estimators, the nonparametric balancing score estimators have additional bias and variation arising from nonparametric procedures, but these diminish as the sample size increases. The estimators μˆ Y and μˆ d reach the optimal efficiency, while μˆ δ is less efficient, in agreement with the discussion following Theorem 1. Simulation 2 is similar to simulation 1 except that π = exp{exp(X 2 )}[1 + exp{exp(X 2 )}]−1 and Y = (X 1 + X 3 ) − 10X 24 + 5 exp(X 4 + X 5 ) − X 3 (X 4 + X 5 ) + . Estimation results are shown in the middle portion of Table 1. As m(X ) is nonlinear and π(X ) is log-logistic, the working models are incorrect and we see large bias in the model-based estimators. In this simulation,

Downloaded from http://biomet.oxfordjournals.org/ at Temple University on November 12, 2014

3·3. Estimation procedure Estimation via the effective balancing score proceeds in three steps. In step 1, one estimates the effective directions B and determines the dimension K . In step 2, the effective balancing score S = Bˆ T X is computed when K  3; if K > 3, one replaces S by the approximate score S ∗ = (β1T X, β2T X, β3T X ). In step 3, E(Y ) is estimated by nonparametric regression via S or S ∗ , as determined in the previous step. For bandwidth selection, the optimal bandwidth is h opt ∼ n −2/(K +4) , which can be estimated by a plug-in method (Fan & Marron, 1992); see the Appendix. This optimal bandwidth is smaller than the conventional bandwidth h n ∼ n −1/(K +4) for estimation of the conditional mean (H¨ardle et al., 2004). At the conventional bandwidth, although the proposed estimator does not attain the minimal mean squared error, the bias and variance from nonparametric regression are asymptotically negligible. Therefore, when the sample size is large we can use the conventional bandwidth, which is easier to determine (Sheather & Jones, 1991). We can use the asymptotic variance formula in Theorem 1 for the estimation of variance. This, however, ignores the variability in the nonparametric estimates of B and m(S). We recommend bootstrapping for variance estimation: bootstrap n samples from the original triples {(Yi , X i , δi ), i = 1, . . . , n} with replacement; compute the nonparametric balancing score estimator μˆ (b) over the bootstrapped data {(Yi , X i , δi )(b) , i = 1, . . . , n}; repeat these two steps many ˆ The bootstrap estimator includes times and use the sample variance of μˆ (b) to estimate var(μ). all sources of variation.

Estimation via effective balancing score

619

Table 1. Results of the numerical studies

Bias SD RMSE CP.L CP.U

n = 1000

Bias SD RMSE CP.L CP.U

0·02 0·24 0·24 2·4 1

n = 200

Bias SD RMSE CP.L CP.U

μˆ δ −0·06 9·37 9·37 4·2 1·5

n = 1000

Bias SD RMSE CP.L CP.U

0·05 3·5 3·5 3·6 1·2

n = 200

Bias SD RMSE CP.L CP.U

μˆ δ −0·2 0·53 0·56 0 3·2

n = 1000

Bias SD RMSE CP.L CP.U

−0·13 0·2 0·24 0·2 4·2

Simulation 1 μˆ Y 0·12 0·44 0·46 3·5 2·7

μˆ d 0·13 0·45 0·47 3·8 1·6

μˆ reg 0·01 0·42 0·42 2·6 2·7

μˆ ipw 0·02 0·64 0·64 3·8 0·7

μˆ aipw 0·01 0·43 0·43 2·8 2·1

0·05 0·2 0·2 3·2 2·8

0 0·19 0·19 2·9 2·5

0·01 0·25 0·25 3·4 1·3

0 0·19 0·19 2·9 2·5

Simulation 2 μˆ Y μˆ d 0·19 −0·11 8·62 7·94 8·62 7·94 6·3 5·4 1·3 1·3

μˆ reg 2·29 8·87 9·16 15 0·2

μˆ ipw −8·8 19·42 21·32 4 1

μˆ aipw −8·31 17·91 19·74 3·8 0·6

0·08 3·42 3·42 3·4 2

2·32 4·01 4·63 17·6 0·1

−9·58 9·28 13·34 0·4 16·2

−9·75 9·04 13·3 0·3 18·5

Simulation 3 μˆ Y μˆ d −0·08 −0·12 0·42 0·44 0·43 0·46 0·5 1·1 6·1 6·6

μˆ reg 0·34 0·46 0·57 11·1 0·3

μˆ ipw 15·95 232·87 233·41 0·1 4·6

μˆ aipw −0·38 6·58 6·59 4·2 0·1

−0·1 0·19 0·21 1·2 4·5

0·27 0·2 0·34 26·8 0·1

29·51 165·74 168·34 5·4 0·3

−1·69 16·48 16·57 0·7 1·8

0·03 0·19 0·2 3·1 2·6

0·14 3·84 3·84 3·6 2·4

0·08 0·18 0·2 4·1 1·4

Bias, Monte Carlo bias; SD, standard deviation; RMSE, root mean squared error; CP.L, the percentage μ lies below the 95% bootstrap normal confidence intervals; CP.U, the percentage μ lies above the 95% bootstrap normal confidence intervals.

Sδ|X has one effective direction, while both SY |X and S(δ,Y )|X have four effective directions. For μˆ Y and μˆ d , we use the approximate scores SY∗ and Sd∗ , which consist of the first three components of the estimated SY and Sd , respectively. Of its first three components, Sd has X 2 as the information conveyer for δ and the other two components as primary information conveyers for Y . Therefore, Sd∗ maintains conditional balance, and the small numerical biases reflect consistency of the estimator. The first three components of SY carry around 93% of the X information about Y . The approximate score SY∗ does not maintain conditional balance but still leads to good estimation: μˆ Y has much smaller bias and is more stable than the model-based estimators. This simulation also shows that μˆ d can outperform μˆ δ and μˆ Y . Dimension reduction methods are usually developed under distributional restrictions on X , so we investigate the robustness of the proposed estimators to these restrictions. In simulation 3,

Downloaded from http://biomet.oxfordjournals.org/ at Temple University on November 12, 2014

n = 200

μˆ δ 0·07 0·58 0·58 1·8 0·3

620

Z. HU, D. A. FOLLMANN

AND

N. WANG

Table 2. Estimates of mean CD4 count at 96 weeks Estimate SD

μˆ δ

μˆ Y

μˆ d

μˆ reg

μˆ ipw

μˆ aipw

322·7 8·6

323·3 8·8

323·0 8·8

322·4 8·4

680·6 33·2

328·5 8·9

SD, bootstrap estimate of standard deviation.

5. APPLICATION We demonstrate the proposed estimation method by applying it to an HIV study, where 820 infected patients received combination antiretroviral therapy and had baseline characteristics measured prior to therapy (Matthews et al., 2011). The baseline characteristics included weight, body mass index, age, CD4 count, HIV viral load, and haemoglobin, platelet, alanine aminotransferase and albumin levels. We are interested in the CD4 count at 96 weeks post therapy. Around 50% of the patients were lost to follow-up. It is plausible to assume missingness at random; that is, whether a patient stayed in the study depended on his or her baseline characteristics. In this study, X is the vector of baseline characteristics and Y is the CD4 count at 96 weeks. Our interest is the mean CD4 count, E(Y ). We first fit the response pattern m(X ) by linear regression and the propensity score π(X ) by linear logistic regression; both show poor fit. With X of dimension nine, it is nearly impossible to try out all possible higher-order terms. This casts doubt on the reliability of model-based estimators. We turn to the effective balancing scores: Sδ is one-dimensional, and both SY and Sd are two-dimensional, where the dimensions are determined by the sequential permutation test (Cook & Yin, 2001). The estimates of E(Y ) are given in Table 2. The nonparametric balancing score estimates are all quite close, because of the overlap between Sδ|X and SY |X . A diagnostic analysis indicates that the first effective direction of SY |X , which conveys about 70% of the X information about Y , is close to the effective direction of Sδ|X . The three model-based estimates are quite different from each other. The inverse propensity weighting estimate μˆ ipw is rather large and with substantial variability, probably due to the poor fit of π(X ) from logistic regression. Despite the poor fit of m(X ˆ ), the regression estimator μˆ reg has a value close to the nonparametric balancing score estimates. Since the bias of μˆ reg is n n −1 i=1 {m(X ˆ i ) − m(X i )}, averaging over the samples can sometimes mitigate the pointwise bias in m(X ˆ ). 6. DISCUSSION Most dimension reduction methods recover B = (β1 , . . . , β K ) as the set of eigenvectors of a kernel matrix. Sliced inverse regression takes cov{E(X | R)} as the kernel matrix and sliced average variance estimation uses E[{I − cov(X | R)}2 ], both estimated by slicing the response R. The eigenvectors corresponding to the K largest eigenvalues are the estimates. Both methods

Downloaded from http://biomet.oxfordjournals.org/ at Temple University on November 12, 2014

Z 1 , . . . , Z 4 are independent N (0, 1), π = exp(−Z 1 + 0·5Z 2 − 0·25Z 3 − 0·1Z 4 ){exp(−Z 1 + 0·5Z 2 − 0·25Z 3 − 0·1Z 4 )}−1 , and Y = 210 + 4Z 1 + 2Z 2 + 2Z 3 + Z 4 + . Suppose that the covariates actually observed are X 1 = exp(Z 1 /2), X 2 = Z 2 /{1 + exp(Z 1 )}, X 3 = (Z 1 Z 3 /25 + 0·6)3 , X 4 = (Z 3 + Z 4 + 20)2 , X 5 = X 3 X 4 , and X 6 , . . . , X 10 independently of Un(0, 1). This set-up mimics that of Kang & Schafer (2007). Here, to explore robustness, we use sliced inverse regression to estimate the effective directions, even though X does not satisfy the linearity condition. The estimation results are presented in the bottom portion of Table 1. It can be seen that the proposed estimators are quite robust with respect to mild violation of the linearity condition.

Estimation via effective balancing score

621

ACKNOWLEDGEMENT We thank the editor, associate editor and two referees for their helpful comments. The third author’s research was supported by the U.S. National Cancer Institute. SUPPLEMENTARY

MATERIAL

Supplementary material available at Biometrika online includes R code and additional results of the numerical studies in § 4 and § 5. APPENDIX Proof of S(δ,Y )|X = SδY |X Let Bd denote the basis of S(δ,Y )|X and Bd∗ the basis of SδY |X . From (5), δ ⊥ ⊥ X | BdT X and Y ⊥ ⊥ X | BdT X . T It follows that δY ⊥ ⊥ X | Bd X and SδY |X ⊂ S(δ,Y )|X . For continuous Y , pr(δ = 0 | X ) = pr(δY = 0 | X ) − pr(Y = 0, δ = 1 | X ) = pr(δY = 0 | X ). Since Bd∗ T T is the basis of SδY |X , the right-hand side of the above equation is a function of Bd∗ X . Thus δ ⊥ ⊥ X | Bd∗ X and Sδ|X ⊂ S(δY )|X . Note that pr(Y  y, δ = 1 | X ) = pr(δY  y | X ) − pr(δ = 0 | X )I (y  0), pr(Y  y, δ = 1 | X ) = pr(Y  y | X ) pr(δ = 1 | X ). The second equation holds due to missingness at random. In the above equations, pr(δY  y | X ) is a T T X as Bd∗ is the basis of SδY |X , and pr(δ = 0 | X ) and pr(δ = 1 | X ) are functions of Bd∗ X as function of Bd∗

Downloaded from http://biomet.oxfordjournals.org/ at Temple University on November 12, 2014

are root-n consistent under the linearity condition, which is satisfied if X has an elliptically symmetric distribution. Sliced average variance estimation additionally assumes that cov(X | B T X ) is constant. The principal fitted components method of Cook (2007) is an extension of sliced inverse regression. In this method, one first finds a basis function Fy = { f 1 (y), . . . , fr (y)} for the inverse regression X | Y , and then estimates the effective directions through PF X , the projection of X onto the subspace spanned by Fy . Although derived from a normal likelihood function, the method is not tied to normality. It has double robustness in the sense that root-n consistency is attained if X is normally distributed conditional on Y or if Fy is sufficiently correlated with E(X | Y ). Appropriate selection of Fy allows more effective use of the inverse regression information than does sliced inverse regression. Approaches to finding Fy include the inverse response plot of X versus Y (Cook, 1998), spline basis, and inverse slicing. When the inverse regression X | Y has isotropic errors, the principal fitted component estimators of β1 , . . . , β K are simply the K largest eigenvectors of cov(PF X ). Cook (2007) and Cook & Forzani (2008, 2009) gave details about this method under various scenarios. Ding & Cook (2014) extended the method to matrix-valued covariates. Recently, Ma & Zhu (2012) proposed a semiparametric dimension reduction method, which requires no distributional assumptions for root-n consistency. The estimation of β1 , . . . , β K is from an estimating equation derived from a semiparametric influence function. By appropriately defining the terms in the influence function, this semiparametric method includes many dimension reduction methods as special cases. Any root-n consistent dimension reduction method may be used to find the effective directions in the proposed method. We can pick a convenient method so long as the distributional assumptions are satisfied.

Z. HU, D. A. FOLLMANN

622

AND

N. WANG

T S(δ)|X ⊂ S(δY )|X . Therefore, pr(Y  y | X ) is a function of Bd∗ X . It follows that Y ⊥ ⊥ X | Bd∗ X and SY |X ⊂ S(δY )|X . As Sδ|X ⊂ S(δY )|X and SY |X ⊂ S(δY )|X , it follows from Remark 2 that S(δ,Y )|X ⊂ S(δY )|X . If Y is categorical, we can perform a shift transformation Y ∗ = Y + c such that Y ∗ > 0. Thus it follows that S(δ,Y )|X = S(δ,Y ∗ )|X = SδY ∗ |X . T

Proof of Theorem 1 Theorem 1 is developed under the following regularity conditions.    Condition A1. The kernel function satisfies uK(u) du = 0, uu T K(u) du = γK I K and K2 (u)du = τK , with γK < ∞ and τK < ∞. Condition A2. The propensity score π(X ) is bounded away from 0.

We write n 1/2 (μˆ − μ) = n 1/2 An + n 1/2 Bn + n 1/2 Cn , with An = n −1

n 

m(Si ) − μ,

i=1

Cn = n −1

n 

Bn = n −1

n 

E{m(S ˆ i ) − m(Si ) | Oi },

i=1

m(S ˆ i ) − m(Si ) − E{m(S ˆ i ) − m(Si ) | Oi },

i=1 1/2 ): j = | i}. It is obvious that n An converges in distribution to  N [0, var{m(S)}]. where Oi = {(X j , Y j , δ j n n n −1 −1 −1 δ Y K (S − S )/n δ K (S − S ), where n From (6), m(S ˆ i) = n j j H i j j H i j j=1 j=1 j=1 δ j K H (Si − S j ) = π(Si ) f (Si ) + op (1) and f (S) is the density of S. It follows that 

n n   K H (Si − S j ){Y j − m(Si )}

−1 −1 Bn = n δj n E {1 + op (1)}.

Oi π(Si ) f (Si ) j=1 i=1

Similar to the argument for Theorem 2.1 of Cheng (1994), it can be shown that n 1/2 (Bn − Bn∗ ) = op (1) with n  Y j − m(S j ) . δj Bn∗ = n −1 π(S j ) j=1 Owing to the conditional independence (3), n 1/2 Bn∗ converges in distribution to N [0, E{var(Y | S)/π(S)}]. ˆ − m(S)}2 ] = O[{tr(H H T )}2 + {n det(H )}−1 ]; thus Cn = For Cn , E(Cn ) = 0 and n E(Cn2 )  E[{m(S) −1/2 ∗ 1/2 ). As An and Bn are independent, n (μˆ − μ) is asymptotically normal with mean zero and op (n variance var{m(S)} + E{var(Y | S)/π(S)}. Let H = h n I K . Following Ruppert & Wand (1994), the negligible terms involving h n are ˆ − m(S)} = h 2n B, E(Bn ) = E{m(S) with B = tr{ m (S)}/2,

n{var(Cn )} = E[{m(S) ˆ − m(S)}2 ] = (nh nK )−1 V,   V = E var(Y | S){ f (S)π(S)}−1 τK .

Here m (S) = E{Hm (S) + 2∇mT (S)∇π f /π f (S)}γK , where ∇m (S) and Hm (S) stand for the gradient and the Hessian matrix of m(S), respectively, and ∇π f /π f (S) = {π(S)∇ f (S) + ∇π (S) f (S)}/{π(S) f (S)}, with ∇π (S) and ∇ f (S) being the gradients of π(S) and f (S), respectively. The mean squared error is E{(μˆ − μ)2 } = h 4n B 2 + (n 2 h nK )−1 V + n −1 σ 2 . The optimal bandwidth, which minimizes the mean squared error, is h opt = {K V/(4B 2 )}1/(K +4) n −2/(K +4) and can be estimated by the plug-in method.

Downloaded from http://biomet.oxfordjournals.org/ at Temple University on November 12, 2014

Condition A3. The density of X is bounded away from 0.

Estimation via effective balancing score

623

Proof of Theorem 2 ˆ the root-n consistent estimate of B. Then μ˜ is as given Let μ˜ denote the proposed estimator under B, in (7), except that ⎧ ⎫ ⎧ ⎫ n n ⎨ ⎬⎨ ⎬   m(S ˆ i ) = n −1 δ j Y j K H ( Sˆi − Sˆ j ) n −1 δ j K H ( Sˆi − Sˆ j ) ⎩ ⎭ ⎩ ⎭ j=1

j=1

Using a bandwidth h n ∼ n −1/(2+η) with η > 0, and considering Bˆ − B = Op (n −1/2 ), the second term inside ˆ ∼ op (1), and μ˜ is asympthe kernel function is O{n −η/(4+2η) } = op (n −1/2 ). It follows that n −1/2 (μ˜ − μ) totically equivalent to μ. ˆ For the optimal bandwidth h n ∼ n −2/(K +4) , this term goes to zero at the rate O{n −K /(2K +8) }.

REFERENCES ABADIE, A. & IMBENS, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica 74, 235–67. CAO, W., TSIATIS, A. & DAVIDIAN, M. (2009). Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika 96, 723–34. CHENG, P. E. (1994). Nonparametric estimation of mean functionals with data missing at random. J. Am. Statist. Assoc. 89, 81–7. CHIAROMONTE, F., COOK, D. R. & LI, B. (2002). Sufficient dimension reduction in regressions with categorical predictors. Ann. Statist. 30, 475–97. COOK, D. R. & LI, B. (2002). Dimension reduction for conditional mean in regression. Ann. Statist. 30, 455–74. COOK, R. D. (1994). On the interpretation of regression plots. J. Am. Statist. Assoc. 89, 177–89. COOK, R. D. (1998). Regression Graphics: Ideas for Studying Regressions Through Graphics. New York: Wiley. COOK, R. D. (2007). Fisher lecture: Dimension reduction in regression. Statist. Sci. 22, 1–26. COOK, R. D. & FORZANI, L. (2008). Principal fitted components for dimension reduction in regression. Statist. Sci. 23, 485–501. COOK, R. D. & FORZANI, L. (2009). Likelihood-based sufficient dimension reduction. J. Am. Statist. Assoc. 104, 197–208. COOK, R. D. & WEISBERG, S. (1991). Discussion of “Sliced inverse regression for dimension reduction”. J. Am. Statist. Assoc. 86, 328–32. COOK, R. D. & YIN, X. R. (2001). Dimension reduction and visualization in discriminant analysis. Aust. New Zeal. J. Statist. 43, 147–77. D’AGOSTINO, R. B. (1998). Propensity score methods for bias reduction in the comparison of a treatment to a nonrandomized control group. Statist. Med. 17, 2265–81. DEVROYE, L. P. & WAGNER, T. J. (1980). Distribution-free consistency results in nonparametric discrimination and regression function estimations. Ann. Statist. 8, 231–9. DING, S. & COOK, R. D. (2014). Dimension folding PCA and PFC for matrix-valued predictors. Statist. Sinica 24, 463–92. DONG, Y. & LI, B. (2010). Dimension reduction for non-elliptically distributed predictors: Second-order moments. Biometrika 97, 279–94. FAN, J. & MARRON, J. S. (1992). Best possible constant for bandwidth selection. Ann. Statist. 20, 2057–70. HAHN, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66, 315–31. HANSEN, B. B. (2008). The prognostic analogue of the propensity score. Biometrika 95, 481–8. HA¨ RDLE, W., MU¨ LLER, M., SPERLICH, S. & WERWATZ, A. (2004). Nonparametric and Semiparametric Models. Heidelberg: Springer.

Downloaded from http://biomet.oxfordjournals.org/ at Temple University on November 12, 2014

with Sˆ = Bˆ T X . The difference between μ˜ and μˆ comes from the difference between K H (Si − S j ) and K H ( Sˆi − −K ˆ ˆ ˆ ˆ S j ). With H = h n I K , we have K H (Si − S j ) = h −K n K{(Si − S j )/ h n } and K H ( Si − S j ) = h n K{( Si − Sˆj )/ h n }. The latter can further be written as   Si − S j ( Bˆ − B)(X i − X j ) −K + . hn K hn hn

624

Z. HU, D. A. FOLLMANN

AND

N. WANG

[Received February 2013. Revised April 2014]

Downloaded from http://biomet.oxfordjournals.org/ at Temple University on November 12, 2014

HASTIE, T. J. & TIBSHIRANI, R. J. (1990). Generalized Additive Models. Boca Raton: Chapman & Hall/CRC. HORVITZ, D. G. & THOMPSON, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Am. Statist. Assoc. 47, 663–85. KANG, D. Y. & SCHAFER, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statist. Sci. 22, 523–39. LI, B. & DONG, Y. (2009). Dimension reduction for non-elliptically distributed predictors. Ann. Statist. 37, 1272–98. LI, B. & WANG, S. (2007). On directional regression for dimension reduction. J. Am. Statist. Assoc. 102, 997–1008. LI, K. C. (1991). Sliced inverse regression for dimension reduction. J. Am. Statist. Assoc. 86, 316–27. LI, Y. & ZHU, L.-X. (2007). Asymptotics for sliced average variance estimation. Ann. Statist. 35, 41–69. LITTLE, R. & AN, H. (2004). Robust likelihood-based analysis of multivariate data with missing values. Statist. Sinica 14, 949–68. LUNCEFORD, J. & DAVIDIAN, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects. Statist. Med. 23, 2937–60. MA, Y. & ZHU, L. (2012). A semiparametric approach to dimension reduction. J. Am. Statist. Assoc. 107, 168–79. MATTHEWS, G. V., MANZINI, P., HU, Z., KHABO, P., MAJA, P., MATCHABA, G., SANGWENI, P., METCALF, J., POOL, N., ORSEGA, S., EMERY, S. & PHIDISA II STUDY TEAM (2011). Impact of lamivudine on HIV and hepatitis B virus-related outcomes in HIV/hepatitis B virus individuals in a randomized clinical trial of antiretroviral therapy in southern Africa. AIDS 25, 1727–35. ROBINS, J. M. & ROTNITZKY, A. (1995). Semiparametric efficiency in multivariate regression models with missing data. J. Am. Statist. Assoc. 90, 122–9. ROBINS, J. M., ROTNITZKY, A. & ZHAO, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Am. Statist. Assoc. 89, 846–66. ROSENBAUM, P. R. (2002). Observational Studies. New York: Springer, 2nd ed. ROSENBAUM, P. R. & RUBIN, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 60, 211–3. RUBIN, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: Wiley. RUPPERT, D. & WAND, M. P. (1994). Multivariate locally weighted least squares regression. Ann. Statist. 22, 1346–70. SCHAFER, J. L. (1997). Analysis of Incomplete Multivariate Data. Boca Raton: Chapman & Hall/CRC. SCHARFSTEIN, D. O., ROTNITZKY, A. & ROBINS, J. M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. J. Am. Statist. Assoc. 94, 1096–120. SHEATHER, S. J. & JONES, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. J. R. Statist. Soc. B 53, 683–90. SILVERMAN, B. W. (1986). Density Estimation for Statistics and Data Analysis. Boca Raton: Chapman & Hall/CRC.

Estimation of mean response via effective balancing score.

We introduce effective balancing scores for estimation of the mean response under a missing at random mechanism. Unlike conventional balancing scores,...
159KB Sizes 3 Downloads 6 Views