Biostatistics (2015), 16, 2, pp. 311–325 doi:10.1093/biostatistics/kxu042 Advance Access publication on September 3, 2014

A multivariate generalized C p and surface estimation RICHARD CHARNIGO∗ , CIDAMBI SRINIVASAN

SUMMARY We propose a new multivariate generalized C p (MGC p ) criterion for tuning parameter selection in nonparametric regression, applicable when there are multiple covariates whose values may be irregularly spaced. Apart from an asymptotically negligible remainder, the MGC p criterion has expected value equal to the sum of squared errors of a fitted derivative (rather than of a fitted mean response). Thus, unlike traditional criteria for tuning parameter selection, MGC p is not prone to undersmoothed derivative estimation. We illustrate a scientific application in a case study that explores the relationship among three measures of liver function. Since recent technological developments hold promise for assessing two of these measures outside of medical and laboratory facilities, better understanding of the aforementioned relationship may allow enhanced monitoring of liver function, especially in developing countries and among persons for whom access to medical and laboratory facilities is limited. Keywords: Bilirubin; Compound estimation; Curve estimation; Functional data analysis; Liver disease; Nonparametric regression; Tuning parameter.

1. INTRODUCTION Consider a nonparametric regression model Yi = μ(xi ) + i

for i ∈ {1, . . . , n},

(1.1)

where: (i) the mean response function μ(x) is real-valued, defined on a specified compact set X ⊂ R D , and unknown but assumed to have continuous derivatives of order (J + 1) on X for some known positive integers D and J ; (ii) the vectors of observed covariate values x1 , . . . , xn belong to X but are not required to be evenly spaced; and (iii) the errors 1 , . . . , n are independent with zero mean and common variance σ 2 ∈ (0, ∞), which may be unknown. When D = 1, μ(x) and its derivatives define curves; when D > 1, they define surfaces. The present manuscript is partially motivated by and will address, via new methodology that we develop for use with model (1.1), the question of how a person’s blood level of bilirubin (one measure of liver function) relates to his or her levels of aspartate aminotransferase (AST) and alanine aminotransferase (ALT) (two other measures of liver function). The publicly available data are at {http://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset) ∗ To

whom correspondence should be addressed. c The Author 2014. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected]. 

Downloaded from http://biostatistics.oxfordjournals.org/ at Ritsumeikan University on June 7, 2015

Department of Statistics, University of Kentucky, 725 Rose Street, Lexington, KY 40536, USA [email protected]

312

R. CHARNIGO AND C. SRINIVASAN

n 2  {μ 1. They minimize a data-based proxy for i=1 h (xi ) − μ(xi )} or its expected value, where h denotes the (vector of) tuning parameter(s). This is true of, for instance, the C p criterion (Mallows, 1973), generalized cross validation (Craven and Wahba, 1979), and the improved Akaike information criterion (Hurvich and others, 1998). 2. They are compatible only with a specific nonparametric regression method. For example, the factor technique of M¨uller and others (1987) is intended for kernel smoothing, whereas the REML technique of Ruppert and others (2003) is designed for spline smoothing. 3. Their derivations are straightforward when D = 1, but analogous formulations for arbitrary D are not obvious. This is true of, for instance, H¨ardle’s (1990) proposal to employ cross-validation in conjunction with difference quotients (Y2i − Y2i−1 )/(x2i − x2i−1 ) when estimating a first derivative by kernel smoothing. The first point above may not seem like a drawback. However, as Wahba and Wang (1990) note, estimating μ(x) well may not ensure that its derivatives are accurately recovered. If estimating the derivatives well helps to address the scientific questions motivating model (1.1), then the tuning parameters should be chosen accordingly. Practical applications calling for accurate recovery of derivatives include describing human growth patterns (Ramsay and Silverman, 2002), characterizing nanoparticles from scattering experiments (Charnigo and others, 2012), and quantifying relationships among tests of liver function (Ramana and others, 2011). Our goal for the present manuscript, in terms of statistical innovation, is to formulate and justify a multivariate generalized C p (MGC p ) criterion for tuning parameter selection that does not suffer from the

Downloaded from http://biostatistics.oxfordjournals.org/ at Ritsumeikan University on June 7, 2015

(Bache and Lichman, 2013)}. Originally, these data were used to evaluate classification algorithms for predicting which persons have liver disease based on their blood levels of bilirubin, AST, ALT, and other substances (Ramana and others, 2011). However, Pollock and others (2012) note that blood tests for monitoring liver function are sometimes limited by expense and logistics. This may be the case even for high-risk patients, especially in developing countries. Accordingly, Pollock and others (2012) have developed a paper-based assay that can assess AST and ALT in a fingerstick specimen. We envisage that, if a person’s bilirubin level relates to his or her AST and ALT levels, then such a specimen may also yield a prediction of that person’s bilirubin level. This may provide greater insight into which levels of AST and ALT warrant physician follow-up. While nonparametrically estimating the mean response for bilirubin as a function of AST and ALT is a helpful step to addressing the aforementioned question, there are at least three reasons that derivatives should be estimated as well. First, doing so explicitly quantifies the marginal rate of increase in predicted bilirubin as either AST or ALT is varied. Second, estimating derivatives provides a basis for some limited extrapolation, in that a Taylor polynomial can be constructed to obtain a predicted bilirubin level for a new patient whose AST and ALT values are slightly outside the range of AST and ALT levels previously observed. Third, and perhaps most importantly, an estimated derivative can be examined for pathological features that would not be obvious from the corresponding estimated mean response; if such features are present, then caution should be exercised in using this estimated mean response for prediction and perhaps new estimates should be acquired, for example by changing tuning parameters. Indeed, this last point provides a segue into our background on statistical challenges. A variety of nonparametric regression methods have been proposed for estimating μ(x) and its derivatives, including kernel smoothing (Gasser and M¨uller, 1984; H¨ardle, 1990), local regression (Cleveland, 1979; Loader, 1999), spline smoothing (Silverman, 1984; Wahba, 1990), and compound estimation (Charnigo and Srinivasan, 2011; Charnigo and others, 2013). However, each nonparametric regression method depends on one or more tuning parameters, and so techniques have also been developed for tuning parameter selection. Unfortunately, most such techniques possess one or more of the following limitations:

Multivariate Generalized C p

313

n 2  {Dμ aforementioned limitations. We will develop a data-based proxy for i=1 h (xi ) − Dμ(xi )} , rather n 2  than for i=1 {μ h (xi ) − μ(xi )} , where D is a fixed differentiation operator of order less than or equal to J . The MGC p criterion will place no restriction on D or on the nonparametric regression method, other than assuming the linear representation h = Sh Y, Dµ (1.2)

2. THE

MULTIVARIATE GENERALIZED

Cp

CRITERION

Recalling (1.2), we define F to be an n × n “empirical partial differentiation matrix” such that the n × 1 vector FY mimics a noise-corrupted version of Dµ := (Dμ(x1 ), Dμ(x2 ), . . . , Dμ(xn )) . We allow F to h 2 is like a residual sum of squares in depend on the covariate values but not on h or Y. Since FY − Dµ h 2 so that the expected estimating Dμ(x), we create the MGC p criterion by adding a penalty to FY − Dµ 2  value of MGC p is approximately the same as that of Dµh − Dµ . The detailed specification of F appears next, after which we identify the penalty and offer some remarks on practical aspects of using MGC p . 2.1 Empirical partial derivatives Suppose, to begin with, that D is the first-order differentiation operator with respect to covariate x p (1  p  D). Let FYi denote the ith component of FY, and put FYi :=

k1  j=1

 j

Ya( j,i, p) − Yb( j,i, p) x p,a( j,i, p) − x p,b( j,i, p)

 {k1 (k1 + 1)/2}

for i ∈ {1, . . . , n}.

Above, k1 is a positive integer chosen by the data analyst (cf. Remark 1) and a( j, i, p), b( j, i, p) are positive integers that index the jth observations “above” and “below” observation i. We refer to FYi , a weighted average of difference quotients, as an “empirical partial derivative”. When D = 1 (cf. top panel of supplementary material available at Biostatistics online, Figure S1) or the vectors of covariate values are evenly spaced in X (cf. middle panel), appropriate locations for observations a( j, i, p) and b( j, i, p) are clear. However, to accommodate D > 1 and unevenly spaced vectors of covariate values, we must define a( j, i, p) and b( j, i, p) generally. To see how this may be done, suppose that D = 2. The first observation above observation i with respect to x1 , namely a(1, i, 1), should have x1,a(1,i,1) slightly larger than x1,i but x2,a(1,i,1) similar to x2,i . Of course, “similar” must be made precise, and one way to do so is to require |x1,a(1,i,1) − x1,i | > |x2,a(1,i,1) − x2,i |. This is equivalent to requiring (x1,a(1,i,1) − x1,i )2 /xa(1,i,1) − xi 2 > 1/2. One may also replace 1/2 by some larger number to enforce greater similarity of x2,a(1,i,1) to x2,i . The larger the threshold, however, the larger the distance may be from

Downloaded from http://biostatistics.oxfordjournals.org/ at Ritsumeikan University on June 7, 2015

 h := (Dμ     where Dµ h (x1 ), Dμh (x2 ), . . . , Dμh (xn )) , Y := (Y1 , Y2 , . . . , Yn ) , and Sh denotes an n × n smoothing matrix depending on h and on the covariate values but not on Y. Assumption (1.2) is satisfied for many nonparametric regression methods, including kernel smoothing, local regression, spline smoothing, and compound estimation. The rest of this manuscript is organized as follows. Section 2 defines the MGC p criterion and related quantities called empirical partial derivatives. The GC p criterion (Charnigo and others, 2011) is recovered as a special case when D = 1. Section 3 employs MGC p , with compound estimation, to investigate how a person’s bilirubin level relates to his or her AST and ALT levels. We also present results from several other tuning parameter selection techniques. Section 4 presents findings from simulation studies comparing MGC p , with compound estimation and spline smoothing, to competing techniques for tuning parameter selection. Section 5 provides a theoretical justification for MGC p . We offer some conclusions in Section 6. See supplementary material available at Biostatistics online, Appendix.

R. CHARNIGO AND C. SRINIVASAN

314

x1,i to x1,a(1,i,1) . Visually, observation a(1, i, 1) is as close as possible to observation i within the right side of a “bow tie” (cf. bottom panel). Observation a(2, i, 1) is second closest to observation i within the right side of the bow tie, and so forth. More formally, and for arbitrary D, we create a set of “eligible” observations by L p,i := {m ∈ {1, . . . , n} : (x p,m − x p,i )2 > 0, (x p,m − x p,i )2 /xm − xi 2 > 1 − u}, where u ∈ (0, 1/2] is chosen by the data analyst (cf. Remark 2). Put a(1, i, p) := arg

min

x p,m ,

b(1, i, p) := arg

max

{m∈L p,i : x p,m x p,i } {m∈L p,i ∩{b(1,i, p),...,b( j−1,i, p)}c : x p,m x p,i }

Multivariate Generalized C p

315

for the tuning parameter(s). If σ 2 is unknown, then we may minimize MGC p,h (Y, σˆ 2 ), where σˆ 2 is an approximately unbiased estimator of σ 2 (cf. Remark 4). The rationale for the penalty term (2.2) is that MGC p,h (Y, σ 2 ) has approximately the same expected h − Dµ2 . More specifically, let E denote expected value and put value as Dµ  bi,h := E[Dμ h (xi )] − Dμ(xi ),

ri := E[FYi ] − Dμ(xi ).

Straightforward calculations show that n 

Dμ(xi )2 −

i=1

E[MGC p,h (Y, σ 2 )] = −

n  i=1

The “remainder”

n

2 i=1 ri



n i=1

n 

2

h ] and 2bi,h Dμ(xi ) + E[Dµ

i=1

Dμ(xi )2 −

n 

h 2 ] + 2bi,h Dμ(xi ) + E[Dµ

i=1

n 

ri2 −

i=1

n 

2ri bi,h .

i=1

2ri bi,h is asymptotically negligible (cf. Remark 5).

REMARK 1 The choice of k1 regulates a bias-variance tradeoff in using FYi as a proxy for Dμ(xi ). If k1 is small, then FYi has large variance and small bias. If k1 is large, then the reverse is true. Proposition 5.1 quantifies the bias-variance tradeoff asymptotically. Our simulation studies (Section 4) suggest that k1 = 5 works well in many scenarios; taking k1 too large may deplete the number of observations at which FYi is meaningfully defined (i.e. not simply fixed at 0). REMARK 2 For D > 1, a smaller u allows observations that differ substantially from observation i on covariate x p to contribute to FYi but does not permit observations that vary appreciably from observation i on the other covariates. The reverse is true for a larger u. Proposition 5.1 suggests a rate at which u should decrease with the sample size. Our simulation studies (Section 4) suggest that u = 1/2 works well in many scenarios; taking u too small may deplete the number of observations at which FYi is meaningfully defined. REMARK 3 One may prefer that FYi fixed at 0 not contribute to MGC p . This preference can be accommodated by defining w = (w1 , . . . , wn ) such that wi = 0 when FYi is fixed at 0 and wi = 1 otherwise. Then replace (2.3) by MGC p,h,w (Y, σ 2 ) := (FY − Sh Y) ww (FY − Sh Y) + σ 2

n  i=1

wi

n 

(2Fi j Si j,h − Fi2j ),

(2.4)

j=1

which is a data-based proxy for h − Dµ) ww (Dµ h − Dµ). (Dµ REMARK 4 By analogy to Loader’s (1999) advice on bandwidth selection in local regression, we suggest n 0 ∗ 2 defining σˆ 2 as i=1 {Yi − μ h∗ (xi )} /(n − tr [Sh∗ ]), where h is on the boundary of H. More specifically, ∗ take h to be the most extreme (combination of) tuning parameter value(s) under consideration, in the sense of potentially undersmoothing the data. Importantly, the same value of σˆ 2 should be used with all h at which MGC p is evaluated. REMARK 5 Neither bi,h nor ri is random, but they are unknown in practice. Typically, bi,h is nonzero because nonparametric regression estimators are biased, while ri is nonzero because difference quotients

Downloaded from http://biostatistics.oxfordjournals.org/ at Ritsumeikan University on June 7, 2015

h − Dµ2 ] = − E[Dµ

316

R. CHARNIGO AND C. SRINIVASAN

are only approximations to partial derivatives. Nonetheless, Proposition 5.2 shows that the remainder is asymptotically negligible under mild assumptions: Let h − Dµ) ww (Dµ h − Dµ)] and f 0 (h) := E[(Dµ

f 1 (h) := E[MGC p,h,w (Y, σ 2 )].

(2.5)

The ratio of f 0 evaluated at a minimizer of f 1 , divided by the minimum of f 0 , tends to 1.

The data set contains records for 583 persons, of which 416 had liver disease. Exploratory analysis revealed that the distributions of ALT, AST, and bilirubin were highly non-normal, with skewness coefficients from 4.88 to 10.49 and kurtosis coefficients from 39.7 to 152.1. While normally distributed data are not required for nonparametric regression, marked deviations from normality are potentially problematic, especially when we seek to estimate derivatives. Therefore, we logarithmically transformed ALT, AST, and bilirubin. After transformation, skewness coefficients ranged from 1.19 to 1.42 and kurtosis coefficients from 3.89 to 5.58. In the notation of (1.1), we defined Y to be log-transformed bilirubin. Since log-transformed ALT and AST had a correlation of 0.84, we defined X 1 to be proportional to their sum and X 2 to be proportional to their difference. The constants of proportionality were chosen so that X 1 and X 2 ranged from −1 to 1. These definitions facilitate examination of how bilirubin varies when ALT and AST move in the same direction versus when they de-couple. 3.1 Methodology We employed compound estimation with a boundary adjustment to recover μ(x), (∂/∂ x1 )μ(x), and (∂/∂ x2 )μ(x). Two tuning parameters were selected by MGC p and five competing approaches, identified below. The tuning parameters were h 1 ∈ {0.1, 0.2, 0.3, 0.4, 0.5}, the nearest neighbor fraction for local polynomial estimation, and h 2 ∈ {10, 20, 30, 40, 50}, the multiplier used to define the convolution matrix. The nearest neighbor fraction and multiplier were set to h 1 /2 and 2h 2 , respectively, during the auxiliary data generation phase of compound estimation. Details about compound estimation appear in Charnigo and others (2013). To make this manuscript self-contained, we mention that the basic idea of compound estimation is to: (i) acquire local polynomials at each of numerous points in X using the locfit package in R (Loader, 2013) and (ii) calculate a weighted average of these polynomials, the weights varying over X . The parameter h 1 is analogous to a bandwidth in kernel smoothing and governs the local polynomial acquisition, while the parameter h 2 controls how rapidly the weights vary over X . Small h 1 and large h 2 imply less smoothing, while large h 1 and small h 2 imply more. The approaches to tuning parameter selection were as follows: 1. “MGC p ” Minimize sum of MGC p criterion for recovery of (∂/∂ x1 )μ(x) (wi = 1 for 531 observations) plus MGC p criterion for recovery of (∂/∂ x2 )μ(x) (wi = 1 for 534 observations), with k1 = 5 and u = 1/2. 2. “C p ” Minimize C p criterion (Mallows, 1973). 3. “iAIC0 ” Minimize improved AIC criterion (Hurvich and others, 1998). 4. “GCV0 ” Minimize GCV criterion (Craven and Wahba, 1979). 5. “iAIC1 ” Define an altered version of the improved AIC criterion by using FYi , F, and Sh in place ˆ Minimize sum of altered improved AIC of Yi , the identity matrix, and the matrix that maps Y to µ.

Downloaded from http://biostatistics.oxfordjournals.org/ at Ritsumeikan University on June 7, 2015

3. CASE STUDY: LIVER FUNCTION TESTING

Multivariate Generalized C p

(b)

(c)

(d)

Fig. 1. Displayed are estimates of μ(x) for the case study in Section 3, corresponding to different choices of tuning parameters in compound estimation. The first tuning parameter (in parentheses, above each panel) is h 1 , the nearest neighbor fraction for the acquisition of local polynomials; see Section 3.1. The second tuning parameter is h 2 , the multiplier for the convolution matrix. Superimposed in small “∗” symbols are the covariate values.

criterion for recovery of (∂/∂ x1 )μ(x) plus that for recovery of (∂/∂ x2 )μ(x), with k1 , u, and w as defined for approach MGC p . 6. “GCV1 ” Define an altered version of the GCV criterion, as above for improved AIC. 3.2 Findings The six tuning parameter selection approaches yielded four distinct combinations of h 1 and h 2 : (0.20, 50) from C p and GCV0 ; (0.40, 10) from iAIC0 ; (0.50, 10) from iAIC1 ; and (0.50, 50) from GCV1 and MGC p . The corresponding estimates of μ(x), (∂/∂ x1 )μ(x), and (∂/∂ x2 )μ(x) are displayed via contour plotting in Figures 1–3. All three estimates of μ(x) in Figure 1 exhibit a general trend of increase from upper left to lower right. Since the trend is not simply from left to right, these results suggest that bilirubin tends to increase not only when ALT and AST are simultaneously large but also when AST is large compared with ALT. However, the contours deviate from diagonal lines. Thus, an ordinary parametric regression analysis would miss some details. The “0.5” and “1” contours change markedly from panel a (C p , GCV0 ) to panels b (iAIC0 ), c (iAIC1 ), and d (GCV1 , MGC p ). These differences are potentially important if, for instance, one of the contours

Downloaded from http://biostatistics.oxfordjournals.org/ at Ritsumeikan University on June 7, 2015

(a)

317

318

R. CHARNIGO AND C. SRINIVASAN x1 derivative (0.40,10)

x1 derivative (0.50,10)

x1 derivative (0.50,50)

Fig. 2. Shown are estimates of (∂/∂ x1 )μ(x) for the case study in Section 3, corresponding to different choices of tuning parameters in compound estimation. The arrangement of Figure 2 is similar to that of Figure 1.

defines combinations of ALT and AST at which patients receive physician follow-up. Which estimates of μ(x) are most plausible? We can gain some insight by examining the estimates of (∂/∂ x1 )μ(x) and (∂/∂ x2 )μ(x) in Figures 2 and 3. Panel a of Figure 2 exhibits considerably more variation than panels b, c, and d. More specifically, panel a reveals a region in which (∂/∂ x1 )μ(x) is estimated to be negative. While negativity of (∂/∂ x1 )μ(x) is not impossible a priori, the presence of such a region arouses suspicion because there are no data points in the vicinity. The other panels do not have this feature. Likewise, panel a of Figure 3 exhibits more variation than panels b, c, and d. In particular, while the estimates of (∂/∂ x2 )μ(x) are predominantly negative, panel a shows two sizable regions in which (∂/∂ x2 )μ(x) is estimated to be positive. To summarize, the panel a estimates (C p , GCV0 ) appear undersmoothed, while those in panels b (iAIC0 ), c (iAIC1 ), and d (GCV1 , MGC p ) are more plausible. A choice among panels b, c, and d is less clear. Indeed, one may reasonably assert that iAIC0 , iAIC1 , GCV1 , and MGC p all performed acceptably in this case study. That C p and GCV0 lead to undersmoothing is not surprising since these approaches are geared toward estimation of μ(x), while recovering derivatives may require more smoothing. The relatively good performance of iAIC0 is consistent with the findings of Charnigo and others (2011) as well as the simulation study results in Section 4. Even so, iAIC0 is geared toward recovery of μ(x), and there may be situations in which a good tuning parameter for the mean response is not good for a derivative (cf. H¨ardle, 1990,

Downloaded from http://biostatistics.oxfordjournals.org/ at Ritsumeikan University on June 7, 2015

x1 derivative (0.20,50)

Multivariate Generalized C p x2 derivative (0.20,50)

x2 derivative (0.40,10)

x2 derivative (0.50,50)

Fig. 3. Shown are estimates of (∂/∂ x2 )μ(x) for the case study in Section 3, corresponding to different choices of tuning parameters in compound estimation. The arrangement of Figure 3 is similar to that of Figure 1.

p. 163). There is no existing theory for GCV1 and iAIC1 to predict their behavior, and we do not anticipate that theoretical results such as those in Section 5 for MGC p will carry over to GCV1 and iAIC1 . Considering also their mixed performances in the simulation studies, these criteria do not supplant MGC p . That said, a data analyst may wish to employ multiple criteria simultaneously, keeping in mind Loader’s (1999, p. 33) suggestion that considering near-minimizers of selection criteria may be informative. For further insight in this case study, we re-scaled the 25 values for each criterion (corresponding to 25 combinations of h 1 and h 2 ) from 0 to 1 and identified combinations of h 1 and h 2 for which the re-scaled values were 0 such that un 2α and kr n −βr converge to positive constants as n → ∞ and 1 < βr + α < 1 + α D for r ∈ {1, . . . , |D|}. j 1/D n −α+(α−1)/D  x p,a( j,i, p) − x p,b( j,i, p)  2. There exist C1 , C2 > 0 such that C1 1/D −α+(α−1)/D n for p ∈ {1, . . . , D}, j ∈ {1, . . . , max{k1 , . . . , k|D| )}}, and i ∈ S. C2 j The first assumption implies that the “bowtie” in the bottom panel of supplementary material available at Biostatistics online, Figure S1 collapses to a horizontal line as n → ∞. The second assumption indicates that the observed covariate values do not deviate too much from the realization of a D-dimensional uniform sample. PROPOSITION 5.1 Under model (1.1) and Assumptions 1 and 2, there exists C3 > 0 such that  |D|

sup E[(FYi − E[FYi ])2 ]  C3 n 2|D|α−2|D|(α−1)/D−β1 −

r =1

i∈S

and

sup |E[FYi ] − Dμ(xi )| = C3 n −α+(maxr ∈{1,...,|D|} βr +α−1)/D . i∈S

2βr /D

Downloaded from http://biostatistics.oxfordjournals.org/ at Ritsumeikan University on June 7, 2015

In supplementary material available at Biostatistics online, Tables S1 and S2, MGC p handles most scenarios well. The main exception is with the second mean response function at low noise (D = 2). Supplementary material available at Biostatistics online, Tables S3 through S6 explores the performance of MGC p with different choices of k1 and u, for the same 96 scenarios considered in Table 1 through supplementary material available at Biostatistics online, Table S2. In some cases (e.g. when D = 3 and splines are used for estimation), the results are excellent regardless of k1 and u. In other cases, the results are sensitive to k1 , with k1 = 5 generally better than k1 = 10 or k1 = 15. The results are not as sensitive to u, but u = 1/2 appears preferable to u = 1/3 or u = 1/4. Averaging over all 96 scenarios, putting k1 = 5 and u = 1/2 yields a 32% depletion of the effective sample size directly contributing to MGC p (cf. Remark 3). In contrast, putting k1 = 15 and u = 1/4 yields an average 70% depletion. Thus, excessively large depletions appear to adversely affect performance of MGC p . Supplementary material available at Biostatistics online, Tables S7 through S10 displays the medians (over the 100 data sets) of the aggregated squared bias and variance associated with the 2 n 2  tuning parameter(s) selected by each criterion, j=1 i=1 (E[(∂/∂ x j )μh (xi )] − (∂/∂ x j )μ(xi )) and 2 n  ˆ j=1 i=1 V[(∂/∂ x j )μh (xi )], evaluated at h = h. Although there is no single pattern that describes all 96 scenarios, the primary impression from these results is that, whenever there are substantial differences among criteria, MGC p yields estimators that are smoother (i.e. with greater squared bias and lesser variance) than estimators obtained from its competitors. This is understandable since derivative estimation, toward which MGC p is geared, may warrant more smoothing than mean response estimation. Finally, we mention that the computation time for MGC p is driven mostly by the acquisition of Sh at h ∈ H. For example, obtaining Sh for spline estimation in Table 1 took ∼3 min 35 s. In contrast, acquiring F took ∼16 s. Once Sh and F were available, the matrix-vector multiplications to calculate MGC p were virtually instantaneous.

Multivariate Generalized C p

323

The proof of Proposition 5.1 appears in supplementary material available at Biostatistics online, Appendix. n 2 n ri − i=1 2ri bi,h is asymptotiNext we describe the circumstances under which the “remainder” i=1 cally negligible. Recall from Section 2 that the remainder is the difference between E[MGC p,h (Y, σ 2 )] and h − Dµ2 ]. If the remainder is small, then minimizing the former quantity should be like minimizE[Dµ ing the latter. In what follows, we consider a more general formulation of MGC p that allows some observations to be removed from consideration, as in (2.4). Let I denote an indicator function that equals 1 when its argument is true (0 otherwise), and recall the definitions of f 0 (h) and f 1 (h) from (2.5). To Assumptions 1 and 2 above, we add:

The third assumption asserts a lower bound on the bias in estimating Dμ(xi ) for some non-negligible fraction of the observations, uniformly over h ∈ H. Unless the mean response function is a polynomial, this assumption will be met by choosing γ1 slightly larger than (J + 1 − |D|)/(2J + 2 + D) and letting H vary with n such that any h ∈ H permits n (J +1−|D|)/(2J +2+D) -consistent estimation of Dμ(xi ) away from the boundary of X (cf. Stone, 1980). The fourth assumption identifies an upper bound for the bias in estimating Dμ(xi ) for all observations not near the boundary of X , uniformly over h ∈ H. The fourth assumption also requires that an empirical partial derivative have smaller bias asymptotically than  the fitted partial derivative Dμ h (xi ). This assumption will be met by choosing γ2 slightly smaller than (J + 1 − |D|)/(2J + 2 + D), α somewhat larger than this, and β1 , . . . , β|D| not too large. PROPOSITION 5.2 Under model (1.1) and Assumptions 1 through 4, f 0 (arg minh∈H f 1 (h)) → 1 as n → ∞. minh∈H f 0 (h) The proof of Proposition 5.2 appears in supplementary material available at Biostatistics online, Appendix.

6. CONCLUSIONS Bilirubin levels increase when both ALT and AST become large, but there is some asymmetry in the relationship. More specifically, large AST and moderate ALT seem to portend higher bilirubin levels than large ALT and moderate AST. One’s perception of nonlinearity in the relationship is influenced by the tuning parameters used in the nonparametric regression analysis. The tuning parameters chosen by our new MGC p criterion yield plausible estimates of the derivatives of the mean response function. In contrast, the tuning parameters selected by some competing criteria yield implausible estimated derivatives, and so the corresponding estimated mean response functions are also suspect. A plausible estimated mean response function could augment the informational value of ALT and AST measurements extracted from a fingerstick blood specimen. For instance, a decision on whether to classify the specimen as “negative” or “positive” for abnormal liver function could be based on the ALT and AST measurements as well as the

Downloaded from http://biostatistics.oxfordjournals.org/ at Ritsumeikan University on June 7, 2015

3. Let w = (w1 , . . . , wn ) be a vector,  not identically zero, such that wi ∈ {0, 1} for all i ∈ {1, . . . , n}. n I[ {wi = 1} ∩ {inf h∈H | bi,h |  n −γ1 } ]  A whenever n is There exist A, γ1 > 0 such that n −1 i=1 sufficiently large. 4. There exists γ2 ∈ (0, α − (maxr ∈{1,...,|D|} βr + α − 1)/D) such that I[wi = 1] suph∈H | bi,h |  n −γ2 for all i ∈ {1, . . . , n} whenever n is sufficiently large. Moreover, 2γ1 − γ2 − α + (maxr ∈{1,...,|D|} βr + α − 1)/D < 0.

R. CHARNIGO AND C. SRINIVASAN

324

SUPPLEMENTARY MATERIAL Supplementary Material is available at http://biostatistics.oxfordjournals.org.

ACKNOWLEDGMENTS This material is based upon work supported by the National Science Foundation under Grant no. DMS0706857 and the Army Research Office under Grant no. W911NF-12-1-0422. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the Army Research Office. We thank the editor, associate editor, and two referees for helpful comments. Conflict of Interest: None declared.

REFERENCES BACHE, K. AND LICHMAN, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. CHARNIGO, R., FENG, L. AND SRINIVASAN, C. (2013). Nonparametric and semiparametric compound estimation in multiple covariates (submitted for publication). CHARNIGO, R., FRANCOEUR, M., KENKEL, P., MENGUC, M. P., HALL, B. AND SRINIVASAN, C. (2012). Credible intervals for nanoparticle characteristics. Journal of Quantitative Spectroscopy and Radiative Transfer 113, 182–193. CHARNIGO, R., HALL, B. metrics 53, 238–253.

AND

SRINIVASAN, C. (2011). A generalized C p criterion for derivative estimation. Techno-

CHARNIGO, R. AND SRINIVASAN, C. (2011). Self-consistent estimation of mean response functions and their derivatives. Canadian Journal of Statistics 39, 280–299.

Downloaded from http://biostatistics.oxfordjournals.org/ at Ritsumeikan University on June 7, 2015

corresponding prediction of bilirubin, rather than being based (solely) on the ALT and AST measurements exceeding certain multiples of their upper limits of normal. The statistical innovation of MGC p lies in its identification of tuning parameters appropriate for derivative estimation, especially when there are multiple covariates whose values are irregularly spaced. Traditional criteria for tuning parameter selection emphasize estimation of the mean response function and may yield undersmoothed derivative estimates with implausible features. The favorable performances of MGC p in the case study and simulation studies, along with the supporting theory, make MGC p a suitable choice for tuning parameter selection when recovering derivatives is important. Moreover, as illustrated in the case study, a data analyst may employ MGC p in conjunction with other criteria if he/she does not wish to rely on it exclusively. If D is large, then estimating a mean response function and its derivatives nonparametrically may be difficult, regardless of whether MGC p is used, due to the curse of dimensionality. To address this difficulty, one may wish to consider either a semiparametric regression model with a nonparametric component of lower dimension D  < D or an additive nonparametric regression model in which the covariates do not interact. The former would express μ(x) in the form ν(x1 , . . . , x D ) + β D +1 x D +1 + · · · + β D x D , while the latter would represent μ(x) as f 1 (x1 ) + f 2 (x2 ) + · · · + f D (x D ). Then MGC p could be incorporated into an iterative backfitting procedure for estimating the nonparametric component(s). Supplementary material available at Biostatistics online, Appendix provides pseudo-code illustrating how MGC p could be used with an additive model.

Multivariate Generalized C p

325

CLEVELAND, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74, 829–836. CRAVEN, P. AND WAHBA, G. (1979). Smoothing noisy data with spline functions. Numerische Mathematik 31, 377–403. GASSER, T. AND MU¨ LLER, H. G. (1984). Estimating regression functions and their derivatives by the kernel method. Scandanavian Journal of Statistics 11, 171–185. HA¨ RDLE, W. (1990). Applied Nonparametric Regression. Cambridge: Cambridge University Press.

LOADER, C. (1999). Local Regression and Likelihood. New York: Springer. LOADER, C. (2013). locfit: Local Regression, Likelihood and Density Estimation. R package version 1.5-9.1. http://CRAN.R-project.org/package=locfit. MALLOWS, C. L. (1973). Some comments on C p . Technometrics 15, 661–675. MU¨ LLER, H.-G., STADTM U¨ LLER, U. AND SCHMITT, T. (1987). Bandwidth choice and confidence intervals for derivatives of noisy data. Biometrika 74, 743–749. POLLOCK, N., ROLLAND, J., KUMAR, S., BEATTIE, P., JAIN, S., NOUBARY, F., WONG, V., POHLMANN, R., RYAN, U. AND W HITESIDES , G. (2012). A paper-based multiplexed transaminase test for low-cost, point-of-care liver function testing. Science Translational Medicine 4, 152ra129. RACINE, J. AND NIE, Z. (2014). crs: Categorical Regression Splines. R package version 0.15-22. http://CRAN-R. project.org/package=crs. RAMANA, B., BABU, M. AND VENKATESWARLU, N. (2011). A critical study of selected classification algorithms for liver disease diagnosis. International Journal of Database Management Systems 3, 101–114. RAMSAY, J. O. AND SILVERMAN, B. W. (2002). Applied Functional Data Analysis. New York: Springer. RUPPERT, D., WAND, M. P. AND CARROLL, R. J. (2003). Semiparametric Regression. Cambridge: Cambridge University Press. SILVERMAN, B. W. (1984). Spline smoothing: the equivalent variable kernel method. Annals of Statistics 12, 898–916. STONE, C. J. (1980). Optimal rates of convergence for nonparametric estimators. Annals of Statistics 8, 1348–1360. WAHBA, G. (1990). Spline Models for Observational Data. Philadelphia: Society for Industrial and Applied Mathematics. WAHBA, G. AND WANG, Y. (1990). When is the optimal regularization parameter insensitive to the choice of the loss function? Communications in Statistics 19, 1685–1700. [Received September 4, 2013; revised July 28, 2014; accepted for publication July 30, 2014]

Downloaded from http://biostatistics.oxfordjournals.org/ at Ritsumeikan University on June 7, 2015

HURVICH, C. M., SIMONOFF J. S. AND TSAI, C.-L. (1998). Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Journal of the Royal Statistical Society B 60, 271–293.

A multivariate generalized Cp and surface estimation.

We propose a new multivariate generalized Cp (MGCp) criterion for tuning parameter selection in nonparametric regression, applicable when there are mu...
606KB Sizes 2 Downloads 7 Views