c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 3 ( 2 0 1 4 ) 557–568

journal homepage: www.intl.elsevierhealth.com/journals/cmpb

ORTH: R and SAS software for regression models of correlated binary data based on orthogonalized residuals and alternating logistic regressions Kunthel By a,∗ , Bahjat F. Qaqish a , John S. Preisser a , Jamie Perin b , Richard C. Zink c a

Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC, USA b Department of International Health, Bloomberg School of Public Health, Johns Hopkins University, MD, USA c JMP Life Sciences, SAS Institute Inc., NC, USA

a r t i c l e

i n f o

a b s t r a c t

Article history:

This article describes a new software for modeling correlated binary data based on orthog-

Received 12 August 2012

onalized residuals, a recently developed estimating equations approach that includes, as a

Received in revised form

special case, alternating logistic regressions. The software is flexible with respect to fitting in

28 August 2013

that the user can choose estimating equations for association models based on alternating

Accepted 24 September 2013

logistic regressions or orthogonalized residuals, the latter choice providing a non-diagonal working covariance matrix for second moment parameters providing potentially greater effi-

Keywords:

ciency. Regression diagnostics based on this method are also implemented in the software.

Association models

The mathematical background is briefly reviewed and the software is applied to medical

Estimating equations

data sets.

Logistic regression

Published by Elsevier Ireland Ltd.

Permutation invariance Regression diagnostics

1.

Introduction

Statistical methods for the regression analysis of correlated binary data have been around for more than four decades [1]. These methods include likelihood-based methods such as generalized linear mixed models [2], marginalized likelihood [3], and semiparametric-based methods such as generalized estimating equation s (GEE)[4]. The appropriateness of these methods with respect to parameter interpretation have been discussed elsewhere [5]. We discuss software related to a new extension of GEE [6] for the analysis of correlated binary data applicable when the marginal mean and marginal withincluster association structure is of interest.



GEE provides a general approach for analyzing correlated data that surpasses the restricted capabilities of earlier methods, including weighted least squares [7]. For the situation where the association is not of interest, first-order GEE, also known as GEE1, provides a computationally fast approach for fitting marginal mean models under an assumed correlation structure. If the assumed correlation structure is incorrect, estimators for the regression parameters in the marginal mean model still maintain consistency although some efficiency may be lost. In general, however, the modeling of association parameters using GEE may or may not lead to an underlying multivariate Bernoulli distribution. In some cases, GEE asymptotic theory may break down when the correlation structure is misspecified [8,9]. Even when it is correctly

Corresponding author at: 10903 New Hampshire Avenue, Building 21, Room 4652, Silver Spring, MD 20993, USA. Tel.: +1 301 796 6729. E-mail address: [email protected] (K. By). 0169-2607/$ – see front matter. Published by Elsevier Ireland Ltd. http://dx.doi.org/10.1016/j.cmpb.2013.09.017

558

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 3 ( 2 0 1 4 ) 557–568

specified, correlations for binary data are subject to constraints imposed on them by the marginal means. Practical strategies have been proposed to mitigate concerns due to infeasible correlation parameter estimates [10]. This article considers a GEE approach that allows careful modeling of the within-cluster association structure based on pairwise odds ratios. Since the odds ratio is not bounded by the marginal means, many concerns relating to infeasible estimates are ameliorated [11]. Several extensions of GEE allow modeling of the association structure [12,13]. The computational effort expended in these methods often restricts their usage to data structures characterized by small cluster sizes. Computational gains may be achieved by imposing extra conditions, but must be paid for by some degree of loss in efficiency [14,15]. Alternating logistic regressions (ALR) [16] addresses some of these concerns while in turn introducing some of its own complications [17]. Some of these complications were addressed by the method of orthogonalized residuals (ORTH) [6]. It is on this work that our software is based. We have written an R package and a SAS macro that fits logistic regression models for correlated binary data based on orthogonalized residuals. Our algorithm is a variation of iterative reweighted least squares. We have also incorporated cluster-diagnostics for marginal mean and association models [18,19]. For expositional convenience, when we write ORTH, we are referring to the method based on orthogonalized residuals, and when we write orth, we are referring to the software based on ORTH. To get an appreciation of what ORTH does, it is useful to briefly examine some mathematical background. This is done in the next section. Before that however, we need to introduce some notation. Let ni denote the size of the i-th cluster and let



Yi = Yi1

Yi2

···

Yini



i = 1, . . ., K

,

denote the vector of binary responses for cluster i where K is the number of clusters. The symbol i shall mean i = E[Yi ] and the symbol i shall mean

2.

Methods

2.1.

Background

We start with second-order GEE [13], denoted hereafter as GEE2. Define Wijk = Yij Yik and



Wi = Wi12

Wi13

···

Wi(ni −1)ni



.



 Let ıi = E[Wi ], Y∗i = (Y i , Wi ) , and  = (ˇ, ˛). The second-order generalized estimating equation is

 U,GEE2 =

Uˇ,GEE2



U˛,GEE2

=



 K  Di 0 i=1

Ai

Ci

˙ −1 i∗

Yi − i W i − ıi

= 0 (1)

where ˙ i∗ = cov (Y∗i ) and Ai =

∂ıi , ∂ˇ

Ci =

∂ıi , ∂˛

Di =

∂i . ∂ˇ

Subsequently, U␤,∗ denotes the estimating function for the marginal mean parameters and U␣,∗ denotes the estimating function for the marginal association parameters where a descriptor for the estimation method is substituted for ‘*’. For example, Uˇ,GEE2 and U˛,GEE2 correspond to the GEE2 estimating functions for ˇ and ˛ respectively. Note that ˙ i* involves 3rd and 4th order moments. Even with restrictions on these higher order moments, the amount of computation can still be prohibitive for large cluster sizes. Furthermore, misspecification of ˙ i* may lead to biased estimates of ˇ, even if the marginal mean model g1 (i ) is correctly specified. The reason for this is that Uˇ,GEE2 is a weighted sum of Yi − i and Wi − ıi where the weights depend upon correctly specified components of ˙ ∗i . Setting Ai = 0 and cov (Yi , Wi ) = 0 in expression (1) enables unbiased estimation of ˇ even if the model involving ˛ is misspecified [15]. The resulting estimating equation for ˇ is

Uˇ,GEE1 =

K 

−1 D i Vi (Yi − i ) = 0

(2)

i=1 i

=[

i12 ,

i13 ,

. . .,

i(ni −1)ni ]



,

where ijk denotes the pairwise odds ratio between the j-th and k-th responses in cluster i [16]. The symbol ˇ, of dimension p × 1, is the parameter vector associated with the marginal mean model. The symbol ˛, of dimension q × 1, is the parameter vector associated with the marginal association model. For link functions g1 and g2 , the marginal models are written as Mean Model : Association Model :

g1 (i )

= Xi ˇ

g2 (

= Zi ˛

i)

where Xi and Zi are design matrices   of dimensions ni × p and ni mi × q respectively with mi = . 2

where √ √ Vi = diag { ijj }RiYY (˛) diag { ijj }, ijk = cov (Yij , Yik ), and RiYY (˛) is a working correlation matrix for Y. Additionally specifying a working diagonal covariance matrix for cov (Wi ) gives

U˛ =

K 

C i [diag (var [Wi ])]

−1

(Wi − ıi ) .

(3)

i=1

Capitalizing on the diagonal structure of the working covariance matrix, expression (3) permits fast computations for large cluster sizes but potentially sacrificing efficiency. A similar procedure had been proposed earlier for fitting linear models to correlations among binary data [14].

559

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 3 ( 2 0 1 4 ) 557–568

ALR takes a somewhat different approach. While keeping the estimating equation of GEE1, ALR’s estimating equation for ˛ is based on conditional residuals which are defined by Yij − ijk

√ √ Pi = diag ( vi )R∗iQQ () diag ( vi )

where, for j > k, ijk = E[Yij |Yik ] = ij +

ijk ikk

K  ∂  i

∂˛

−1 C i Pi Qi = 0

(8)

i=1

S−1 (Yi −  i ) = 0 i

(4)

where Si = diag [ i (1 −  i )], and the solution ˛ is invariant to permutations of the elements in Yi . In the above presentation, the expression for Wi is a function of Yi and hence necessarily correlates with Yi . In general, the actual covariance matrix of Wi is not diagonal. Specifying a diagonal matrix when in fact it is not may lead to efficiency loss [6]. ORTH, to some extent, addresses this issue but does so in a way that avoids some of the complications seen in expression (4). Because Si is a function of Yi , it is stochastic. Hence, it is not a covariance matrix in the usual sense. As such, it is not clear how one would go about introducing a non-diagonal Si if the goal is to improve efficiency. A further complication caused by a stochastic Si is that standard estimating equation theory cannot be applied to study the properties of U˛,ALR [6]. Lastly, the robust covariance estimator under alternating logistic regressions is not invariant to permutations of Yi . By casting the problem in terms of orthogonalized residuals, ORTH resolves each of these complications.

Orthogonalized residuals

The idea that led to the development of ORTH was to (1) minimize the correlation between Yi and the residual of the association estimating equation, which is forced to zero in the specification of separate estimating equations for ˇ and ˛; and (2) to approximate the covariance of the aforementioned residual with a non-diagonal matrix. This is accomplished as follows. For the i-th cluster, define elements of the mi × 1 vector of orthogonalized residuals Qi = [Qi12 , Qi13 , . . ., Qi(ni −1)ni ] by Qijk = Wijk − [ijk + bijk:j (Yij − ij ) + bijk:k (Yik − ik )]

(5)

where bijk:j

= ijk (1 − ik )(ik − ijk )/dijk ,

bijk:k

= ijk (1 − ij )(ij − ijk )/dijk ,

dijk

K 

U˛,ORTH =

where

i=1

2.2.

(7)

where vi = {vijk }. Putting all of these together, ORTH’s estimating equation for ˛ is

(Yik − ik ) .

Based on  i = [i12 , i13 , . . ., i(ni −1)ni , ] , ALR’s estimating equation for ˛ is defined as

U˛,ALR =

where Imi is an mi × mi identity matrix, 1 is an mi × 1 vector of ones and  is the exchangeable correlation parameter to be estimated. Letting vijk denote the variance of Qijk , we approximate cov (Qi ) by

Next, RiQQ = corr (Qi ) is approximated by an exchangeable working correlation matrix R∗iQQ := 11 + (1 − )Imi

(6)

∂Q i ∂˛

 ,

while that for ˇ is the same as (2). The correlation parameter  is estimated by a moment estimator:

ˆ () =

K 1

M i=1

M=

K 

⎡⎛

⎞2 ⎤ 2  Qijk  Qijk ⎣⎝ ⎦;  ⎠ − jChi) 3.684099 1.024464 12.93209 0.0003229965 distance

SAS Code Below is an example of SAS code used to analyze the arthritis data. Depending on the reader’s experience, he or she may prefer to create the data set for the Z matrix in another way. In fact, PROC IML may be used to create the Z data set in exactly the same manner as the R code. filename INF “APSTAT.DAT”; **; ** Reads in ASCII file containing the data **; ** Note y j = 0 denotes good self assessment and **; ** y j = 1 denotes bad self assessment. ** Gender: 1 if male **; : in years **; ** age ** treat : treatment (1 if auranofin, 0 if placebo)**; data temp; infile inf; input gender age trt y1-y5; patient = n ; /* Create patient ID */ run; %let y = y; %let x = int time gender age treat; %let z = distance;

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 3 ( 2 0 1 4 ) 557–568

data XY (keep=patient &y &x) Z (keep = patient &z); set temp; int = 1; /* intercept */ array y [*] y1-y5; array t [*] t1-t5 (0 1 5 9 13); /* Creates the design matrix for the mean model */ /* as well as the response. */ do j = 1 to 5; y = y [j]; if (y ∧ =.) then do y = ∧ y; * model prob(good); time = t [j]; treat = trt * (j> 2); output XY; end; end; /* Creates the design matrix for the association model. */ do j = 1 to 4; if (y [j] ∧ =.) then do k = (j+1) to 5; if (y [k] ∧ =.) then do; distance = 1 / abs(t [j]-t [k]); output Z; end; end; end; run; ** Create weight matrix **; ** Each cluster has weight 1 **; data weight (keep=w); set temp; w=1; run; ** Invokes the macro **; %include “ORTH.macro”; ** Performs ALR using ORTH **; %ORTH(xydata=xy, yvar=&y, xvar=&x, id=patient, zdata=z, zvar=&z, wdata=weight, wvar=w, maxiter=20, epsilon=0.0000001, estlamb = NO, CLSOUT=clsout, OBSOUT=obsout, monitor=no, IBETA=, IALPHA=); quit; /* The following is partial output printed by the ORTH macro */ ********** RESULTS ********** Number of Clusters: Maximum Cluster Size: Minimum Cluster Size: Number of Iterations: Outcome Variable: y Note: Robust Standard Errors are Presented LAMBDA Note: Lambda is fixed at 0. Marginal Mean Parameter Estimates PARM STDERR CHISQ VAR 1.0933862 1.3780824 0.6295015 INT −0.027173 0.0296475 0.8400149 TIME GENDER 0.5956477 0.4784813 1.549705 −0.015363 0.0246821 0.3874335 AGE TREAT 1.4572118 0.450921 10.44346 VAR DISTANCE

PARM 3.6840989

Marginal Odds Ratio Parameter Estimates STDERR CHISQ 1.0244654 12.932068

567

51 5 2 9

PVALUE 0.4275382 0.3593925 0.213179 0.5336516 0.0012308 PVALUE 0.000323

568

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 3 ( 2 0 1 4 ) 557–568

references

[1] J.F. Pendergast, J.S. Gange, M.A. Newton, M.J. Lindstrom, M. Palta, M.R. Fisher, A survey of methods for analyzing clustered binary response data, International Statistical Review 64 (1996) 89–118. [2] P.J. Diggle, P. Heagerty, K.Y. Liang, S.L. Zeger, Analysis of Longitudinal Data, 2nd ed., Oxford University Press, Oxford, 2002. [3] P. Heagerty, S.L. Zeger, Marginalized multilevel models and likelihood inference, Statistical Science 15 (2000) 1–26. [4] K.Y. Liang, S.L. Zeger, Longitudinal data analysis using generalized linear models, Biometrika 73 (1986) 13–22. [5] B.A. Reboussin, J.S. Preisser, E.S. Song, M. Wolfson, Sample size estimation for alternating logistic regressions analysis of multilevel randomized community trials of under-age drinking, Journal of the Royal Statisticall Society (Series A) 175 (2012) 691–712. [6] B.F. Qaqish, R.C. Zink, J.S. Preisser, Orthogonalized residuals for estimation of marginally specified association parameters in multivariate binary data, Scandinavian Journal of Statistics 39 (2012) 515–527. [7] G.G. Koch, J.R. Landis, J.L. Freeman, D.H. Freeman, R.G. Lehnen, A general methodology for the analysis of experiments with repeated measurement of categorical data, Biometrics 33 (1977) 133–158. [8] M. Crowder, On the use of a working correlation matrix in using generalized linear models for repeated measures, Biometrika 82 (1995) 407–410. [9] N.R. Chaganty, H. Joe, Efficiency of generalized estimating equations for binary responses, Journal of the Royal Statistical Society (Series B) 66 (2004) 851–860. [10] J. Shults, Discussion of “Generalized estimating equations: Notes on the choice of the working correlation matrix” – continued, Methods of Information in Medicine 50 (2011) 96–99. [11] N.R. Chaganty, H. Joe, Range of correlation matrices for dependent Bernoulli random variables, Biometrika 93 (2006) 197–206. [12] L.P. Zhao, R.L. Prentice, Correlated binary regression using a quadratic exponential model, Biometrika 77 (1990) 642–648. [13] K. Liang, S. Zeger, B. Qaqish, Multivariate regression analysis for categorical data, Journal of the Royal Statistical Society, B 54 (1992) 3–40. [14] R. Prentice, Correlated binary regression with covariates specific to each observation, Biometrics 44 (1988) 1033–1048.

[15] S. Lipsitz, N. Laird, D. Harrington, Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association, Biometrika 78 (1991) 153–160. [16] V. Carey, S. Zeger, P. Diggle, Modelling multivariate binary data with alternating logistic regression, Biometrika 80 (1993) 517–526. [17] A. Kuk, Permutation invariance of alternating logistic regressions for multivariate binary data, Biometrika 91 (2004) 758–761. [18] J. Preisser, B. Qaqish, Deletion diagnostics for generalized estimating equations, Biometrika 83 (1996) 551–562. [19] J.S. Preisser, K. By, J. Perin, B.F. Qaqish, Deletion diagnostics for alternating logistic regressions, Biometrical Journal (2012) 701–715. [20] J.S. Preisser, J. Perin, Deletion diagnostics for marginal mean and correlation model parameters in estimating equations, Statistics and Computing 17 (2007) 381–393. [21] K.E. Muller, M.C. Mok, The distribution of Cook’s D statistic, Communications in Statistics – Theory and Methods 26 (1997) 525–546. [22] J.A. Diaz-Garcia, G. Gonzalez-Farias, A note on the Cook’s distance, Journal of Statistical Planning and Inference 120 (2004) 119–136. [23] M. Vens, A. Ziegler, Generalized estimating equations and regression diagnostics for longitudinal controlled clinical trials: a case study, Computational Statistics and Data Analysis 56 (2012) 1232–1242. [24] R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org, ISBN 3-900051-07-0, 2008. [25] SAS Institute Inc., SAS/STAT 9.1 User’s Guide, SAS Institute Inc., Carey, NC, 2012. [26] K. By, B.F. Qaqish, J.S. Preisser, orth: Multivariate Logistic Regression Using Orthogonalized Residuals, R package version 1.5.1, 2011. [27] G. Fitzmaurice, S. Lipsitz, A model for binary time series data with serial odds ratio patterns, Applied Statistics 44 (1995) 51–61. [28] A. Ekholm, J. McDonald, P. Smith, Association models for a multivariate binary response, Biometrics 56 (2000) 712–718. [29] S.R. Lipsitz, G.M. Fitzmaurice, Estimating equations for measures of association between repeated binary responses, Biometrics 52 (1996) 903–912. [30] R. Zink, Correlated binary regression using orthogonalized residuals, University of North Carolina, Chapel Hill, 2003, Ph.D. thesis.

ORTH: R and SAS software for regression models of correlated binary data based on orthogonalized residuals and alternating logistic regressions.

This article describes a new software for modeling correlated binary data based on orthogonalized residuals, a recently developed estimating equations...
670KB Sizes 0 Downloads 0 Views