STATISTICS I N MEDICINE, VOL. 11, 2001-2016 (1992)

GRAPHS AND STOCHASTIC RELAXATION FOR HIERARCHICAL BAYES MODELLING NICHOLAS LANGE Department of Community Health. Division of Biology and Medicine. Brown University, Providence, RI 02912, U.S.A.

SUMMARY This expository paper describes two useful tools for the statistical analysis of processes that generate repeated measures and longitudinal data. The first tool is a graph for a visual description of dependency structures. The second tool is a stochastic relaxation method (‘Gibbs sampling’)for fitting hierarchical Bayes models. Graphs are concise and accessible summaries of stochastic models. Graphs aid communications between statistical and subject-matter scientists, during which formulations of scientific questions are modified. An uncluttered picture of the dependency structure of a model augments effectively its corresponding formulaic description. Stochastic relaxation is a computationally intense method that allows experimentation with broader classes of models than were previously thought feasible because of analytic intractability. Stochastic relaxation is intuitive and easily described to non-statisticians. Several sample graphs show how hierarchical Bayes models can use stochastic relaxation to obtain their fits. An example based on estimating drug shelf-life demonstrates some uses of graphs and stochastic relaxation compared with several frequentist growth curve analyses that use restricted maximum likelihood and generalized estimating equations approaches.

1. MOTIVATION

Graphical representations of dependency structures in statistical models and dynamic Monte Carlo methods for obtaining fits of models to data are proving to be very useful tools for a wide variety of scientific investigations. This paper intends to give a somewhat informal and intuitive introduction to these tools’ basic ideas. Many references to the rapidly growing literature in these areas are provided. A concrete example of their use with data from an experiment to determine expected shelf-life of a pharmaceutical, an interesting non-trivial problem, helps to focus these ideas by applying them to a standard balanced and complete linear growth curve model. Results obtained from several different frequentist modelling and estimation approaches serve as benchmarks by which to assess the performance of these new tools. Although it can of course be argued quite convincingly that graphs and stochastic relaxation need not be used at all for the chosen example, interpretation of their performance in this simple setting can help build intuition regarding their expected performance in modelling more complex statistical systems. The paper is organized as follows. The next section describes the use of graphical representations and stochastic relaxation methods in hierarchical Bayes modelling. Sections 3 and 4 give an application of these tools. Section 5 gives some brief history, background and other issues. The final section is a brief discussion of what has been presented.

0277-671 5/92/152OOl- 16$13.CN 0 1992 by John Wiley & Sons, Ltd.

2002

N. LANGE

2. STOCHASTIC RELAXATION IN HIERARCHICAL BAYES DATA ANALYSIS

A link between the image processing uses of the Gibbs sampler (see Section 5.2) as derived by Geman and Geman’ and general hierarchical Bayes modelling was perhaps first considered by Clayton,’ although Tanner and Wong3 gave a version for two random variables; see also Li.4 Gelfand and Smith’ and Gelfand et aL6 provided rigorous statistical detail and several examples; see also Tanner.’ The key observation was that the pixels (picture elements) of the Gemans’ setup could be replaced by the parameters of a Bayesian model for most any data analytic problem. Prior and posterior probability distributions on pixel values become distributions on parameters. Analysis proceeds by conditioning fully on observed data and visiting each of the parameters in turn to update their conditional distributions given the others, cycling to convergence until an estimate of the full joint posterior distribution of the parameters is obtained. Assessment of convergence is a difficult problem; see Section 5 for a very brief discussion of issues. The Gemans’ use of the Gibbs sampler was for high-dimensional problems with many local ‘horizontal’ cliques of pixels all at the same level (see Section 5.2). The full force of stochastic relaxation as applied in Bayesian data analysis is that a multilevel ‘vertical’ hierarchy of parameter distributions, each involved in many fewer cliques, can be updated using the Gibbs sampler. The dimensionality of the problems encountered in Bayesian data analysis is often much lower than that of image processing problems. An essential difference between the two is that the structures of Bayesian data analysis problems are often dominated by their vertical rather than their horizontal aspects. Gelfand and Smith’ and Casella and George’ pointed out that uses of stochastic relaxation in hierarchical Bayesian data analysis often focus on particular features of marginal distributions of parameters of interest, for instance on posterior means and variances, rather than on the entire joint posterior distribution per se. They emphasized that a major attractive feature of the stochastic relaxation approach is that it enables one to obtain desired functionals of marginal. posterior distributions, often very difficult or impossible to obtain by classical analytic methods, without actually calculating these distributions. One obtains a desired feature of a marginal posterior distribution indirectly by first obtaining the set of conditional posterior distributions which together determine the joint posterior distribution of all parameters given the data, and then integrating over all of these parameters except the parameter of interest. The posterior conditional distributions involved are often difficult to marginalize analytically, requiring intractable high-dimensional integration, yet easy to evaluate and update by dynamic Monte Carlo methods. Once convergence has been reached, a particular feature of a posterior marginal distribution of a chosen parameter is calculated indirectly through sampling from its estimated posterior conditional distribution given the others in the hierarchical ensemble. The Monte Carlo draws described in the previous paragraph may be combined by simple averaging to form a mixture density estimate using Rao-Blackwellization, taking advantage of the structure of the model to reduce mean squared error by conditioning. As an example of this variance reduction technique, suppose that one is interested in the marginal posterior distribution of a mean parameter under standard Gaussian assumptions (as in the next section’s example). After convergence, one draws a large number of times from the posterior distributions of the mean and variance of this parameter’s posterior distribution, in pairs, conditioning on all of the other parameters in the model. Under Gaussian assumptions, each such pair determines a particular posterior Gaussian distribution for the parameter of interest. A mixture density estimate of the posterior distribution with variation much lower than that of any of the single functions is then obtained by averaging these Gaussian functions. (Imagine plotting the Gaussian functions on a common plot, discretizing the horizontal axis by a fine grid, averaging vertically at each discrete point and then connecting the dots.) Marginal posterior means, medians, modes,

GRAPHS AND STOCHASTIC RELAXATION FOR HIERARCHICAL BAYES MODELLING

2003

variances and many other functional summary measures are readily obtained from such a mixture density estimate.

2.1. Longitudinal data analysis Adopting a hierarchical Bayes approach to characterizing processes generating repeated measures and longitudinal data is advisable when one requires entire posterior distributions of key parameters in addition to their chosen features, and also when one is not satisfied with making standard asymptotic approximations and/or adjustments and plugging in parameter estimates. Attempts to accommodate for such approximations often prove to be quite complex and seem to verge toward adopting a fully Bayesian outlook; see, for instance, Lange and Ryan' and Hulting and Harville" among many others. It is not of primary interest here to convince the reader of the benefits of a hierarchical Bayes approach; Breslow" provided an excellent discussion of pros and cons of frequentist/Bayes methods in several biostatistical contexts. The following example is one that integrates over conditional posterior distributions of parameters rather than over unseen data; resulting expressions for conditional posterior means and variances of interest are therefore different from their frequentist counterparts. The example helps to sharpen the foregoing ideas by tying them to a concrete data-analytic problem.

3. EXAMPLE: ESTIMATING DRUG SHELF-LIFE Chow and ShaoI2and Shao and ChowI3 reported on the use of a classical balanced and complete growth curve model in a pharmaceutical study of a drug's expected shelf-life. Repeated measures over time are the percentages of claimed potency for a small number of batches of a drug. Not accounting for batch-to-batch variability can yield variance estimates that are artificially low. Their linear growth curve model is identical to that of Rao14 and of Lange and Laird" who analysed this model under repeated sampling assumptions. Five of the batches were packaged in bottles, the other five in blister packs; although easily included, the package type indicator is ignored in the present analysis. The model's structure is also identical to that of Racine-Poon and Smith,I6 and of Gelfand et aL6 who, in a different context, used the Gibbs sampler in a hierarchical Bayes approach. Figure l(a) displays the data from Shao and Chow;13 Figure l(b) shows ten separate ordinary least-squares regression lines fit to each of the batches and extrapolated to thirty-six months. Figure 2 gives a picture of the hierarchical nature of the problem. 3.1. Population model Figure 3 gives a chain independence graph (see Section 5.1.2) of a hierarchical Bayes model for the drug shelf-life problem. The graph shows the model structure for a single batch only; the structure is identical across the n = 10 independent batches. Parameters are denoted by Greek characters surrounded by 0,constants by letters of the English alphabet surrounded by 0 and are grouped when appropriate. Outcome random variables are denoted by Y; regressors X are fixed and thus not included in the graph. Directed line segments point from levels of the hierarchy to levels below, and '8 e q' is read as '8 is a function of q' or 'q affects 8'. The outcome data at level 0 are potency assay results expressed as a percentage of claimed potency. These are the Yij, j = 1,. . . ,t = 6; i = 1, . . . ,n = 10. Batches are sampled at t = 6 times: 0 , 3 , 6 , 9 , 12, and 18 months. An X matrix of regressors including only intercept and time slope is given in canonical form using an orthonormalized version of the original time scale obtained by the Gram-Schmidt

2004

N. LANGE I

..

.

.- -

.- .

0

2

6

4

8

.- - - - -. 10

12

14

16

18

14

16

18

time (in months)

0

2

6

4

8

10

12

20

22

24

26

.

.

.

.

.

28

30

32

34

36

time (in months)

Figure 1. The drug shelf-life data: (a) ten batches measured six times over eighteen months; (b)ordinary least-squares lines fit separately to the batches and extrapolated to 36 months

procedure, and is defined as =

[

0408 - 0552

0.408

- 0.345

1

0408 0408 0.408 0.408 - 0138 0.069 0.276 0-690



(1)

(Strictly speaking, the second column of X can thus no longer be interpreted as a design on ‘slope’ effects.) Level 1 of Figure 3 gives the simple linear model for the mean function, E( Y i j )= 0.408ao + t,al, where a = (ao, al)Tare population intercept and slope effects and t j is from the second column of X. The individual random intercepts and individual random slopes pi = ( j o i , j ~ i capture ) ~ random batch effects, and the a’ are individual batch variances. Given the a, pi and 0’ from level 1, the Yij are conditionally independent and thus no line segments connect the Yij; for simplicity, these are grouped together as a block in this chain independence graph (see Section 5.1).

GRAPHS AND STOCHASTIC RELAXATION FOR HIERARCHICAL BAYES MODELLING

betweenbatch variation

Level 2

Level 1

batch effects

I

Level 0

2005

I potency assay

variations

I

mults ( ~ of b claimed potency)

I

Figure 2. The hierarchical structure of the drug shelf-life problem

Level 2

Level 1

Level 0 Figure 3. A chain independence graph for the hierarchical Bayes model

3.2. Prior specification The pi are assumed independent and identically distributed (i.i.d.) and drawn from a bivariate Gaussian distribution (N)with mean 0 and arbitrary 2 x 2 variance-covariance matrix V, that is, pi N(0, V). It is because of this assumption of zero-centred batch effects that the population

2006

N. LANGE

mean parameter a does not feed directly into pi, appearing in level 1 instead of level 2. The C T are ~ assumed i.i.d. from an inverse gamma (IG) distribution. The assumption of separate, batchspecific variances is more general than the common variance assumption taken by the authors cited previously in this section, yet has been considered by Braun et a1.,I7 who used a modified EM algorithm (Dempster et ~ l . ' ~and ) , also by Hui and Berger." The inverse gamma distribution is parametrized by c1 and c2, its shape and scale parameters, respectively, specifying a prior mean of l/(cz(cl - 1)) and a prior variance of l/(c:(cl - 1)2(c1- 2)) for cl > 2. Level 2 specifies prior parameters feeding into level 1. The variance-ovariance matrix V for the individual batch effects W((bB)- ', b), is assumed to have been drawn from an inverse Wishart distribution, that is, V with mean (bB)-' and precision b. Values for all constants surrounded by 0 feeding from the uppermost level of Figure 3 into levels 1 and 2 have been derived from a preliminary analysis of external data gathered from a similar experiment described by Chow and Shao.12 Constants a and A specify the prior mean and prior variance of a bivariate Gaussian distribution for the mean effects a. These are taken as N

[

A = 10.28 2-30] 2.30 2.35 '

a = (200.4, - 5.9)T,

which are the estimated mean and variance-covariance matrix obtained from this same balanced and complete growth curve model described by Lange and LairdI5 using restricted maximum likelihood (REML, Patterson and Thompson2'), yet with a common a2 term. The sample mean ~ the present model are each taken as unity, implying c1 = 3 and c2 = 1/2. and variance of the C T in Prior constants

B=[

b=8,

2.083

- 0.016

- 0016

0.003

1

have been chosen, implying vague initial precision for a variance-ovariance matrix for the batch effects determined by its REML estimate from the external data, inflated by a factor of eight.

3.3. Complete conditional distributions Thus the full hierarchical Bayes model has been specified. Although its graphical representation helps one understand its conditional Gaussian likelihood and prior structure, explicit detail on the conditional distributions of its parameters is required in order to use stochastic relaxation to obtain its fit. Toward this end, let x j = [ l , t j ] be thejth row of X at (1) and write Yij

= xj(a

+ + pi)

~ i j ,

i = 1, . . . ,II; j = 1, . . . , t,

as the simple linear growth curve model. In vector form at the individual level the model is Yi=X(a+pi)+Ei,

i = 1, . . . , II,

with

Yi = ( Y i l , . . . , Yit)=,

x = (XT, . . . , x y , Ei

= (Eil,

. . . ,& i t ) T .

GRAPHS AND STOCHASTIC RELAXATION FOR HIERARCHICAL BAYES MODELLING

2007

Q

Figure 4. Computational structure for updating the full conditional distributions by dynamic Monte Carlo

Further, define ‘working residuals’ specific to the parameters a, pi, cz as R?’ = Yi - Xpi,

RIB’= Yi - XU, Rf”) = Yi - X(a + pi), Employing the Gaussian-Gaussian conjugacy between likelihood and prior, as in Lindley and Smith,’l one has the following complete conditional distributions: for a:

for the pi:

for the

0::

and for V - :

Draws from a Wishart distribution are obtained by the method of Ode11 and Feiveson;” see also R i ~ l e yEquations .~~ (2) to (5) give the conditional distributions that sit on the nodes of the graph shown in Figure 4, which is a schematic diagram of the computational structure of the updating scheme used in the dynamic Monte Carlo method (see Section 5.1.1). Individual subscripts i have been suppressed. The graph has two cliques, [a, pi,a?] and [pi,V], as seen by inspection of (2) to (5). One can visit and update these nodes in any fixed order, or even in random order. More general conditional distributions are readily derived for incomplete data models and complex hierarchical Bayes models including additional covariates, non-Gaussian distributions and additional sources of stochastic variation; see, for instance, Zeger and Karim” and Lange et

2008

N. LANCE

Compressing notation by defining Y = (Yl, . . . ,Yn),/?= (/?I, . . . ,Bn),a2 = (a;, . . . , a:), the conditional likelihood function (given /3) is

n fl N ( Yijla, n

L(a, A a’; Y) =

t

i= 1 j = 1

Pi, a :),

and the joint distribution of the pi is

fi N(piilO, v).

i= 1

Thus the model for the posterior distribution is written compactly as L(a,B, a’; Y)N(ala, A)

fi N(Bil0, V) fi IG(aZlc1,cz)W(V-lI(bB)-l, b).

i= 1

i= 1

(6)

3.4. Inverse prediction Application of Fieller’s method (see, for instance, Draper and Smith,25 pages 47-51) to this inverse prediction problem would not be satisfactory in certain situations very relevant to the drug shelf-life problem: in addition to the difficulties mentioned in the next paragraph, the population curve may not be well determined and/or be nearly horizontal and could thus yield some very peculiar and unsatisfactory confidence interval endpoints. Consider instead the following approach. To obtain the posterior distribution for the time X oit takes to reach a specified drug potency Yo one simply inverts the monotonic mean regression function given in Section 3.1 to obtain X o = (E( Y o )- ao)/a,. Interest is in the behaviour of a typical batch, so that individual batch effects are set to their prior expected value/$ = 0 for all i. The posterior distribution of X O is best summarized by its quantiles. A parametric summary would be unsatisfactory because the distribution of XO,being a ratio of two Gaussian random variables, is very heavy tailed and does not have a first moment. A mixture density estimate (see Section 2) of the posterior distribution of X o can be obtained easily from the Gibbs sampling iterates after convergence. It is often required that no more than 5 per cent of the batches have measured potencies less than 90 per cent of their claimed potency at time Xoin order for the drug to be able to receive a labelled shelf-life of Xo. This same requirement is used when reporting the following results. 4. RESULTS

Table I gives the results of the application of several different statistical models and methods to the drug shelf-lifeinverse prediction problem. Key features of this problem are the estimated slope of the monotonically decreasing regression line, its estimated standard error and the estimated time at which no more than 5 per cent of the batches can be expected to have measured potencies less than 90 per cent. Some estimates of these three numbers are given in the columns of Table I. Results from four different models and methods are given in its rows: the Bayesian growth curve model with heterogeneous batch variances (Bayes-GC) as defined in Section 3, the frequentist version of BayesGC except with homogeneous batch variance (GC), as described by Lange and Lairdl5 and mentioned in Section 3.2, a generalized estimating equations (Liang and ZegerZ6) result (GEE-ARl) using a first-order autoregressive working correlation matrix and a robust estimate of standard error, and the ordinary least-squares result (OLS). Different working correlation matrices for the GEE approach (for example, an exchangeable or compound symmet-

GRAPHS AND STOCHASTIC RELAXATION FOR HIERARCHICAL BAYES MODELLING

2009

Table I. Estimated slopes and their estimated standard errors for four different models and methods, and an estimate of minimum labelled shelf-life (in a canonical time scale) Method Bayes-GC GC GEE-ARl OLS

Estimated slope

Estimated SE

- 4.03 - 4.1 1

2.98 1.93 0.578 0.395

-

4.11

- 4.11

Estimated 5th percentile of X,

2.383

ric pattern) were also tried and found to give very similar answers. Bayes-GC models individual batch variances as heterogeneous, different from the three other approaches. Although their resulting estimates are not strictly commensurable, these models’ results have been included for rough comparative purposes. Stochastic relaxation results are from one long run, sampling 50 times with an independence gap of 20 following a ‘burn-in’ to convergence after lo00 cycles (see Section 5.2.3). Multiple independent starts were also tried and yielded virtually identical results in this case. Estimated minimum labelled shelf-life is given only for Bayes-GC for reasons given in the preceding subsection. In Table I, the estimated labelled shelf-life of 2.383 on the canonical scale is the 5th percentile of the empirical distribution obtained by substituting each of the 50 independent draws C,,, from the converged distribution of a into X,, = (0.9 - 0408do,)/El,,,, rn = 1, . . . ,50. Figure 5 gives the estimated posterior distributions of the slope estimates from G C and Bayes-GC, along with the prior distribution used for Ba yes-GC. A Bayesian model simpler than (6) that assumed a common variance term a* for all batches yielded an estimated marginal posterior distribution for the slope coefficient (not shown) that was virtually identical to that shown in Figure 5 for Bayes-GC. Note the close similarity of all four point estimates for the slope and the strong decreasing trend in their estimated standard errors. This trend is to be expected because Bayes-GC is purposely over-parameterized, GC models nine less parameters and assumes repeated sampling, GEE-AR 1 has no random effects and includes only a single nuisance parameter for the covariance of the serial observations. The Bayes-GC and G C comparison concurs qualitatively with a similar comparison made by Gelfand et aL6 between EM and stochastic relaxation methods applied to a similar balanced and complete growth curve model for a different problem (see their Figure 5), yet is quantitatively more extreme.

5. SOME HISTORY, BACKGROUND AND FURTHER READING

This section gives some history, background and additional references, past and present, on the use of graphs as model descriptors and on stochastic relaxation in statistical image analysis.

5.1. Graphical representations of statistical models The use of graphs to depict relations between sets of variables dates at least to the method of path coefficients by Sewell Wright.27*28Wright’s method employed multiple linear regression models with closed-form solutions and emphasized the importance of proper estimation and interpreta-

201 0

N. LANCE

0.20

0.15

0.10

0.05

0.0 I

I

I

I

1

1

1

1

1

-14

-12

-10

-8

-6

4

-2

0

2

~~

1

4

Figure 5. Distributions of the estimated population slope effect from the frequentist and Bayesian models for the drug shelf-life problem

tion of partial regression coefficients. Since Wright’s work, the use of graphs in statistical analyses has been expanded and formalized; see, for instance, Demp~ter,’~Speed,30 Lauritzen and Spiege1halter.j’ Current research interest in graphical models is high; see, for instance, Edwards,jz Wermuth and L a ~ r i t z e nand ~ ~the recent text by Whittaker.34 Two main features of graphs are conditional and chain independence. 5.1.1. Conditional independence graphs

Two random variables X ahd Y are said to be conditionally independent given a third random variable Z if and only if p ( X YJZ)= p ( X J Z ) p (YJZ),where ~ ( - 1 . ) denotes a conditional probability density function. A conditional independence graph for three such variables is given in Figure 6(a).Conditional independence for more than three variables, as shown for instance by the graph in Figure q b ) , implies that in this case p ( X YI W, Z ) = p ( X I Z ) p ( YIZ) and also that p ( W X I Y, Z ) = p ( Wl Z ) p ( X I Z). Such factorizations of joint densities imply the following useful

GRAPHS AND STOCHASTIC RELAXATION FOR HIERARCHICAL BAYES MODELLING

201 1

(b) Figure 6. Two simple conditional independence graphs

simplifications of these conditional densities:

with p(ZI W, X, Y) unsimplified in this case. 5.1.2. Chain independence graphs It is often useful to indicate an ordering or hierarchy of variables in models and graphs, breaking the symmetry of the multivariate structure by supposing relations such as ‘Xaffects Y‘ but not the other way around, as in regression. Such orderings are indicated by adding arrows to conditional independence graphs to form directed graphs. Additional levels of complexity in statistical models arise when the variables may be arranged in a hierarchy of blocks to form a chain. For instance, a set of variables W 1 , W,, . . . , W, may jointly affect X I , X 2 , . . . , X J that in turn jointly affect Y1,Y,, . . . , Y,. In such cases, one has a chain independence graph consisting of a mixture of undirected and directed line segments connecting variables. A detailed example of a chain independence graph was given in Section 3. The next section shows how graphical models help one understand how a flexible yet computationally intense method may be used to obtain their fits. 5.2. Stochastic relaxation

The method of stochastic relaxation has been used successfully in a wide variety of contexts for quite a number of years, dating at least to the Metropolis a l g ~ r i t h min ~ ~statistical . ~ ~ physics. Hastings3’ used stochastic relaxation to generate random variance-covariance matrices and derived some limit theorems for his method by using some known properties of Markov chains; Peksun3* continued this work. Important theoretical results for stochastic relaxation in general and in image processing applications in particular were established by Geman and Geman’ who called their method the ‘Gibbs sampler’, as the distributions sampled in their context were Gibbs distributions. The next subsection takes a brief look at the original Gibbs sampler.

2012

N.LANGE

Figure 7. An 8-neighbourhood structure, with a clique indicated by shaded circles

5.2.1. What is the Gibbs sampler?

In statistical image analysis, a multivariate random variable X is often used to denote the joint distribution of possible pixel values in an image. The pixel ensemble is often broken down into manageable units called neighbourhoods. Figure 7 shows an 8-neighbourhood; there is one such neighbourhood for each pixel in an image. The support of X consists of pixel sites (denoted by open and shaded circles). On each of these pixel sites sits a distribution of possible pixel values (for example, 0-255 for an 8-bit image). A clique, denoted by shaded circles, is a subset of this neighbourhood that consists of those pixels that are assumed to affect one another directly. The distribution of values at the centre pixel depends only on the distributions of the other members of the cliques. Much of the work with Gibbs distributions in imaging processing problems is aimed at characterizing this local dependence and interaction structure in different contexts; see Lange et aLJ9 for a recent application. A joint distribution of the components of X has Gibbs form if its probability density function can be written as 1

pdx) = -exp( -

z (0)

In statistical physics, the exponent V ( x ,0) in (7) is called the energy function and the normalizing constant Z ( 0 ) is called the partition function. The exponent takes the form

where Vc(x,0) is a contribution to the energy from the cth clique, such as Vc(x,0) = O(xi - xi)* for i , j e c for instance; see Besag4’ and other statistical image analysis references cited for additional examples of local energy functions. The parameter 0 is an index to the family of possible pixel value distributions, and the dependence of the normalizing constant Z on 0 had, until the Gemans’ work, made finding maximum likelihood estimates of 0 virtually impossible. Under mild conditions, the Gemans showed that by visiting each pixel turn and updating current estimates of their conditional distributions given the others by dynamic Monte Carlo sampling, the estimated joint posterior distribution of the pixels is guaranteed to converge to its true theoretical value. The MAP (maximum a posteriori) estimate of the reconstructed image could thus be obtained in a conceptually simple albeit computationally demanding manner. Many useful features of the stochastic relaxation process derive from its Markov properties. The process is (first-order) Markov because the conditional probability that the graph is in

GRAPHS AND STOCHASTIC RELAXATION FOR HIERARCHICAL BAYES MODELLING

201 3

a current state given the entire history of graph states is equal to the conditional probability that the graph i s in a current state given only the previous graph state. The true joint posterior distribution can indeed be found by this method when it is known that the generated Markov chain of graph states is irreducible and aperiodic. A Markov chain is irreducible when one can generate a sequence of realizations that connects any one possible state to any other possible state. A Markov chain is aperiodic if, for every possible state, there is a positive probability that the process will return to each state in an arbitrary number of transitions. If, however, the chain is reducible and/or periodic, then the process may appear to have converged yet in fact be trapped in a region of the state space that may be very far from the ‘truth’. If one can generate a long sequence from an irreducible aperiodic chain, then one is guaranteed that the true joint posterior distribution can be reached. (Assessment of convergence in data analytic uses of stochastic relaxation is a somewhat open and subtle question; a brief discussion of this issue is deferred until Section 5.2.3.) For a more technical treatment of these topics and an interesting ‘animation’ example, see T i e r n e ~ . ~ ~

5.2.2. Some variations of dynamic Monte Carlo As pointed out by Green and Han?’ the Gibbs sampler is a special case of the variant of the Metropolis algorithm, due to ha sting^,^' in which draws from the conditional distributions are accepted or rejected with a probability, less than one, that depends on the current and potential future states of the graph. The Gibbs sampler always accepts the draw. Barone and F r i g e ~ s i ~ ~ derived a class of stochastic relaxation processes in which the draw is always accepted that also includes the Gibbs sampler as a special case. Their generalization is for Gaussian distributed nodes and helps speed convergence by using an antithetic varia.ble (see, for instance, R i ~ l e y ? ~ pages 118-132) that is changed dynamically. Too little is known about this variant of dynamic Monte Carlo at this time for it to be applied here, yet it would appear to hold some useful future prospects in general Gaussian contexts. 5.2.3. Running a dynamic Monte Carlo algorithm

The goal of stochastic relaxation is to run the process until the converged state of the graph has been reached, at which time one can obtain independent draws from the posterior conditional distributions of the parameters and use these independent draws to make statistical inferences concerning the dependency structure of the underlying model. Draws from the conditional distributions may be obtained in two ways: by multiple independent parallel runs with random starting points or by one long run. Figure 8 gives a picture. Controversy continues as to the better of the two approaches under different circumstances; see, for instance, Green and Han?’ Raftery and Lewis45,Ritter and Tanner46 and Gelman and R ~ b i n . ~Both ’ approaches were taken for the example in Section 3 and were found to yield virtually identical results in this simple case. Until convergence, such draws are correlated. One cannot actually determine when the stochastic relaxation process has converged unless one already knows the true dependency structure of the underlying model, to help separate components of simulation variance and covariance from the total. Some knowledge of convergence rate is required to avoid problems of long range dependence, in which, due to simulation correlation, the parameters of interest can appear to be much more correlated than they actually are. Green and Han4’ pointed out that unless consideration is given to convergence issues, naive uses of stochastic relaxation can yield totally meaningless piles of numbers.

2014

N. LANCE

many short runs

sampleofM

0

one long run

sample of M

I

I - .

0

t bm-in

- ...

0

I independence t gap

Figure 8. Multiple independent runs with random starting points versus one long run

6. DISCUSSION

Graphs and stochastic relaxation are tools that have intuitive appeal and that provide a useful framework for hierarchical Bayes modelling in a variety of practical data analytic contexts. See Racine et U I . for ~ ~more examples of Bayesian approaches to biopharmaceutical problems. Common criticisms of the use of the Gibbs sampler for longitudinal data analysis are that it takes too much time and that one needs to adopt a Bayesian approach. The first criticism is unsupportable; the second, not entirely valid (for instance, see Geyer4’). Stochastic relaxation does take time, and if one is concerned primarily with computational efficiency and/or time constraints, one should perhaps consider other techniques. Also, when prior and likelihood are not conjugate, the complete set of conditional distributions is not ‘available’ in the sense of Gelfand and Smith,’ and adaptive rejection sampling (for example, Gilks and Wild”) and other numerical integration methods are required. Similar to the research motivations of the computationally demanding bootstrap methods, there are currently no other methods besides stochastic relaxation that are able to accommodate all of the features and complexities of certain statistical models; see, for instance, Lange et uLZ4As far as adopting a Bayesian perspective to longitudinal data analysis, the Gibbs sampler may be considered as a conceptual opposite of the bootstrap: for the parametric bootstrap the parameters are fixed and draws made from the empirical distribution of the data given the parameters; for stochastic relaxation the data are fixed and draws made from the conditional distributions of the parameters. ACKNOWLEDGEMENTS

The author thanks Alan Gelfand and many other colleagues, too numerous to mention individually, for helpful discussions that greatly improved the paper. Thanks also to an anonymous reviewer for constructive comments and to Dr. Shein-Chung Chow, Bristol-Myers, for use of the drug shelf-life data.

GRAPHS AND STOCHASTIC RELAXATION FOR HIERARCHICAL BAYES MODELLING

201 5

REFERENCES 1. Geman, S. and Geman, D. ‘Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images’, i EEE Transactions on Pattern Analysis and Machine intelligence, 6, 721-741 (1984). 2. Clayton, D. G. ‘Monte Carlo interval estimation in Bayesian hierarchical models’, Technical report, Department of Community Health, Leicester University, 1988. 3. Tanner, M. A. and Wong, W. H. ‘The calculation of posterior distributions by data augmentation’(with

discussion), Journal of the American Statistical Association, 82, 528-550 (1987). 4. Li, K-H. ‘Imputation-using Markov chains’, Journal of Statistical Computing and Simulation, 30,50-79 11988). 5. Gelfand, A. E. and Smith, A. F. M. ‘Sampling based approaches to calculating marginal densities’, Journal of the American Statistical Association, 85, 398-409 (1990). 6. Gelfand, A. E., Hills, S. E., Racine-Poon, A. and Smith, A. F. M. ‘Illustration of Bayesian inference in, normal data models using Gibbs sampling’, Journal of the American Statistical Association, 85,972-985 (19901. 7. Tanner, M. A. Tools for Statistical inference: Observed Data and Data Augmentation Methods, SpringerVerlag, 1991. 8. Casella, G. and George, E. I. ‘An introduction to Gibbs sampling’, The American Statistician, 46, 167-174 (1992). 9. Lange, N. and Ryan, L. ‘Assessing normality in random effects models’, Annals ojstatistics, 17,624-642 (1989). 10. Hulting, F. L. and Harville, D. A. ‘Some Bayesian and non-Bayesian procedures for the analysis of comparative experiments and for small area estimation: Computational aspects, frequentist properties and relationships’, Journal of the American Statistical Association, 86, 557-568 (1991). 11. Breslow, N. ‘Biostatistics and Bayes’ (with comments), Statistical Science, 5, 269-298 (1990). 12. Chow, S-C. and Shao, J. ‘Estimating drug shelf-life with random batches’, Biometrics, 47, 1071-1080 (1991). 13. Shao, J. and Chow, S-C. ‘Statistical inference in stability analysis’, Technical report, Bristol-Myers US Pharmaceuticals Group, Evansville, Indiana, 1990. 14. Rao, C. Radhakrishna ‘Prediction of future observations in growth curve models’ (with comments), Statistical Science, 4, 434-471 (1987). 15. Lange, N. and Laird, N. M. ‘The effect of covariance structure on variance estimation in balanced

growth-curve models with random parameters’, Journal of the American Statistical Association, 84, 241-247 (1989). 16. Racine-Poon, A. and Smith, A. F. M.‘Population models’, in Berry, D. (ed.), Statistical Methodology in the Pharmaceutical Sciences, Marcel Dekker, New York, 1990. 17. Braun, H. I., Jones, D. H., Rubin, D. B. and Thayer, D. T. ‘Empirical Bayes estimation of coefficients in the general linear model from data of deficient rank‘, Psychometrika, 48, 171-181 (1983). 18. Dempster, A. P., Laird, N. M., and Rubin, D. B. ‘Maximum likelihood with incomplete data via the EM algorithm’ (with discussion), Journal of the Royal Statistical Society, Series B, 39, 1-38 (1977). 19. Hui, S. L. and Berger, J. 0.‘Empirical Bayes estimation of rates in longitudinal studies’, Journal of the American Statistical Association, 78, 753-760 (1983). 20. Patterson, H. D. and Thompson, R. ‘Recovery of interblock information when block sizes are unequal’, Biometrika, 58, 545-554 (1971). 21. Lindley, D. V. and Smith, A. F. M. ‘Bayes estimates for the linear model’(with discussion),Journal of the Royal Statistical Society, Series B, 34, 1-41 (1972). 22. Odell, P. L. and Feiveson, A. H. ‘A numerical procedure to generate a sample covariance matrix’, Journal of the American Statistical Association, 61, 198-203 (1966). 23. Zeger, S. L. and Karim, M. R. ‘Generalized linear models with random effects: A Gibbs sampling approach’, Journal of the American Statistical Association, 86, 79-86 (1991). 24. Lange, N., Carlin, B. P. and Gelfand, A. E. ‘Hierarchical Bayes models for the progression of HIV infection using longitudinal CD4 T-cell numbers’ (with discussion), Journal of the American Statistical Association, 87, 615-632 (1992). 25. Draper, N. R. and Smith, H. Applied Regression Analysis, 2nd ed, Wiley, New York, 1980. 26. Liang, K-Y.and Zeger, S. ‘Longitudinal data analysis using generalized linear models’, Biometrika, 73, 13-22 (1986). 27. Wright, S. ‘The relative importance of heredity and environment in determining the piebald pattern of guinea pigs’, Proceedings of the National Academy of Science, 6, 320-332 (1920).

2016

N. LANGE

28. Wright, S. ‘The method of path coefficients’, Annals of Mathematical Statistics, 5, 161-215 (1934). 29. Dempster, A. P. ‘Covariance selection’, Biometrics, 28, 157- 175 (1972). 30. Speed, T. P. ‘Relations between models for spatial data, contingency tables and Markov fields on graphs’, Suppl. Adu. Appl. Prob., 10, 1 1 1-122 (1978). 31. Lauritzen, S. L. and Spiegelhalter, D. J. ‘Local computations with probabilities on graphical structures and their application to expert systems’ (with discussion), Journal ofthe Royal Statistical Society, Series B , 50, 157-224 (1988). 32. Edwards, D. ‘Hierarchical interaction models’ (with discussion), Journal ofthe Royal Statistical Society, Series B, 52, 3-20 (1990). 33. Wermuth, N. and Lauritzen, S. L. ‘On substantive research hypotheses, conditional independence graphs, and graphical chain models’ (with discussion), Journal ofthe Royal Statistical Society, Series B, 52, 21-50 (1990). 34. Whittaker, J. Graphical Models in Applied Mathematical Multiuariate Statistics, Wiley, New York, 1990. 35. Metropolis, N. and Ulam. S. ‘The Monte Carlo method’, Journal ofthe American Statistical Association, 44, 335-341 (1949). 36. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. ‘Equations of state calculations by fast computing machines’, Journal of Chemical Physics, 21, 1087- 1092 (1953). 37. Hastings, M. ‘Monte Carlo sampling methods using Markov chains and their applications’, Biometrika, 57, 97- 109 (1970). 38. Peksun, P. H. ‘Optimum Monte-Carlo sampling using Markov chains’, Biornetrika, 60,607-612 (1973). 39. Lange, N., OTuama, L. A., Manbeck, K. M., Zimmerman, R. E., McClure, D. E. and Geman, S. ‘Some sharpening and registration methods applied to SPECT images of pediatric brain tumors’, In Computing Science and Statistics, Proceedings of the 23rd Symposium on the Interface, 542-549 (1991). 40. Besag, J. ‘Spatial interaction and the statistical analysis of lattice systems’, Journal of the Royal Staristical Society, Series B, 36, 192-225 (1974). 41. Tierney, L. ‘Exploring posterior distributions using Markov chains’, In Computing Science and Statistics, Proceedings of the 23rd Symposium on the Interface, 563-570 (1991). 42. Green, P. J. and Han. X-L. ‘Metropolis methods, Gaussian proposals and antithetic variables’, Technical report, Department of Mathematics, University of Bristol, Bristol, UK, 1991. 43. Barone, P. and Frigessi, A. ‘Improving stochastic relaxation for gaussian random fields’, Probability in the Engineering and Informational Sciences, 4, 369-389 (1989). 44. Ripley, B. Stochastic Simulation, Wiley, New York, 1987. 45. Raftery, A. E. and Lewis, S. ‘How many iterations in the Gibbs sampler?’ in Bernado, J. et al. (eds), Bayesian Statistics 4, Oxford University Press, (to appear). 46. Ritter, C. and Tanner, M. A. ‘Facilitating the Gibbs sampler: The Gibbs stopper and the griddy Gibbs sampler’, Journal of the American Statistical Association, 87, 861-868 (1992). 47. Gelman, A. and Rubin, D. B. ‘Inference from iterative simulation using multiple sequences’, Technical report, Department of Statistics, University of California, Berkeley, U.S.A. (1992). 48. Racine, A., Grieve, A. P., Fliiher, H. and Smith, A. F. M. ‘Bayesian methods in practice: Experiences in the pharmaceutical industry’, Journal of the Royal Statistical Society, Series C (Applied Stutistics), 35, 93-120 (1986). 49. Geyer, C. J. ‘Markov chain Monte Carlo maximum likelihood‘, In Computing Science and Statistics, Proceedings of the 23rd Symposium on the Interface, 156-163 (1991). 50. Gilks, W. R. and Wild, P. ‘Adaptive rejection sampling for Gibbs sampling’, Journal of the Royal Statistical Society, Series C (Applied Statistics), 41, 337-348 (1992).

Graphs and stochastic relaxation for hierarchical Bayes modelling.

This expository paper describes two useful tools for the statistical analysis of processes that generate repeated measures and longitudinal data. The ...
917KB Sizes 0 Downloads 0 Views