Multivariate Behavioral Research

ISSN: 0027-3171 (Print) 1532-7906 (Online) Journal homepage: http://www.tandfonline.com/loi/hmbr20

Confirmatory Latent Class Analysis: Model Selection Using Bayes Factors and (Pseudo) Likelihood Ratio Statistics Herbert Hoijtink To cite this article: Herbert Hoijtink (2001) Confirmatory Latent Class Analysis: Model Selection Using Bayes Factors and (Pseudo) Likelihood Ratio Statistics, Multivariate Behavioral Research, 36:4, 563-588, DOI: 10.1207/S15327906MBR3604_04 To link to this article: http://dx.doi.org/10.1207/S15327906MBR3604_04

Published online: 10 Jun 2010.

Submit your article to this journal

Article views: 118

View related articles

Citing articles: 13 View citing articles

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=hmbr20 Download by: [Universite Laval]

Date: 02 October 2015, At: 16:15

Multivariate Behavioral Research, 36 (4), 563-588 Copyright © 2001, Lawrence Erlbaum Associates, Inc.

Confirmatory Latent Class Analysis: Model Selection Using Bayes Factors and (Pseudo) Likelihood Ratio Statistics Herbert Hoijtink

Downloaded by [Universite Laval] at 16:15 02 October 2015

Department of Methodology and Statistics University of Utrecht

Inequality constraints among class specific probabilities can be used to assign a specific meaning to the classes in a latent class model. Different models arise if different sets of constraints are used. In this paper, model selection using Bayes factors, and, (pseudo) likelihood ratio statistics evaluated using posterior predictive p-values, will be discussed. It will be illustrated that these Bayesian selection criteria do not suffer from the same flaw as maximum likelihood based selection criteria. Using a small simulation study it will be shown that, in the context of the simulation study, Bayes factors and the pseudo likelihood ratio statistic have the best proporties. The article will be concluded with an example.

Introduction If a researcher has the responses xij(0,1) of persons i = 1, ..., N to items j = 1, ..., J, latent class analysis can be used to determine the number of latent classes q = 1, ..., Q and the nature of each class. Each class is characterized by a weight q which is the proportion of persons belonging to class q, and a vector of class specific probabilities  qj for j = 1, ..., J which are the probabilities of responding 1 (subsequently to be called the ‘correct’ or ‘positive’ response or ‘yes’) to each of the J items. Usually latent class analyses are exploratory. This means that the researcher has little or no idea about the number or nature of the latent classes. Analyses are executed with a different number of latent classes. Using information criteria (Lin & Dayton, 1997) or likelihood ratio statistics (Everitt, 1988) the researcher determines the number of classes that should be used. Subsequently the class specific probabilities are used to attach a meaning to each of the latent classes. Although exploratory analyses are useful and valuable, they are not without risk. The correct number of classes can often only approximately be determined (see also the comments given in the section entitled “Simulation Study”, and meaning is attached after inspection of the class specific probabilities which may lead to overPlease address correspondence regarding this manuscript to Herbert Hoijtink, Department of Methodology and Statistics, University of Utrecht, P.O. Box 80140, 3508 TC Utrecht, The Netherlands. E-mail: [email protected]. MULTIVARIATE BEHAVIORAL RESEARCH

563

Downloaded by [Universite Laval] at 16:15 02 October 2015

H. Hoijtink

interpretation. This article will deal with confirmatory latent class analysis (see McCutcheon, 1987, pp. 37-38, for a description of the general idea). The focus is on confirmatory models that can be constructed using inequality constraints among the parameters of the model. In the next section the constraints will be described, and a number of examples will be given. In the section entitled “Model Selection Criteria” different model selection criteria will be introduced and explained. It will be shown that model selection criteria based on maximum likelihood estimates of the model parameters have an important drawback if the model is defined using inequality constraints among the parameters. It will also be shown that Bayesian model selection criteria do not have this drawback. To evaluate the performance of the Bayesian model selection criteria, a simulation study will be presented and discussed. The article will be concluded with an example. Examples of Confirmatory Latent Class Analysis The point of departure is a researcher who has two or more theories about the process by which the item-responses are generated. Via a confrontation of each theory with the data the researcher wants to find out which theory is the best. To be able to confront a theory with the data, the theory has to be ‘translated’ into a latent class model. This translation will be based on constraints that have previously been considered by Croon (1990, 1991, 1993) and Hoijtink and Molenaar (1997) and subsequently generalized by Hoijtink (1998). Equations 1 through 5 give possible forms of constraints that can be placed on the parameters of latent class models (note that cqj is restricted to values for which the left hand side of Equations 1 through 4 can attain values in the interval [0,1]): (1)

qj = cqj ;

(2)

qj > cqj or qj < cqj ;

and, for q ⬆ q and/or j ⬆ j, (3)

qj > qj + cqj or  qj < qj + cqj ;

(4)

qj > qj × cqj or  qj < qj × cqj ;

and, (5) 564

 q > q or  q < q. MULTIVARIATE BEHAVIORAL RESEARCH

Downloaded by [Universite Laval] at 16:15 02 October 2015

H. Hoijtink

Example 1. Suppose an educational researcher wants to investigate how the population of students should be described. He has two competing theories. In theory one the population consists of three classes, those with a preference for language and history (q = 1); those with a preference for natural sciences (q = 2); and, those with a preference for social and behavioral sciences (q = 3). In theory two the population also consists of three classes: students without a real interest in either languages, history or sciences (q = 1); students with a moderate interest (q = 2 ); and students that are highly interested in languages, history and sciences (q = 3). Using nine statements (j = 1, 2, 3 with respect to languages and history, j = 4, 5, 6 with respect to the natural sciences, and j = 7, 8, 9 with respect to social and behavioral sciences) the preferences of the students are measured (the response 1 indicates a positive attitude to the science involved in the statement). Theory one could be translated into a latent class model using (6)

 1j > 1j  for j = 1, 2, 3 and j = 4, ..., 9,

(7)

 2j > 2j  for j = 4, 5, 6 and j = 1, 2, 3, 7, 8, 9,

and, (8)

3j >  3j  for j = 7, 8, 9 and j = 1, ..., 6.

It can be seen that in the first class languages and history are more ‘popular’ than in the other two classes. Similarly, social and behavioral sciences are more popular in the third class than in classes one and two. Theory two could be translated into a latent class model using (9)

1j < 2j <  3j for j = 1, ..., 9,

that is, the higher the class number the higher the interest in languages, history and social and natural sciences. Example 2. Suppose a psychiatrist wants to investigate the relation between depression and anxiety. To be able to do this he scores 200 persons with respect to 30 items. It is known that (positive responses to) items j = 1, ..., 10 are indicative for depression, j = 11, ..., 20 are indicative of anxiety, and j = 21,...,30 are indicative of both depression and anxiety. The psychiatrist has two plausible theories: (a) depression and anxiety occur jointly but persons suffer in different degrees from both phenomena; and, (b) depression and anxiety may or may not occur jointly. MULTIVARIATE BEHAVIORAL RESEARCH

565

H. Hoijtink

The first theory could be translated into a latent class model using the following constraints: (10)

 1j < 2j < 3j < 4j for j = 1, ..., 30,

which ensures that the degree to which persons are depressed and anxious is increasing with class number since for each item the probability of a positive response increases with class number. The second theory could be translated as follows:

Downloaded by [Universite Laval] at 16:15 02 October 2015

(11)

1j < .2 for j = 1, ..., 30,

which renders a class of rather normal persons that do not suffer from depression and anxiety; (12)

 2j > 1j for j = 11, ..., 20,

and (13)

 2j > 1j + .2 for j = 1, ..., 10, 21, ..., 30,

which renders a class of persons that are somewhat more anxious than the ‘normal’ persons, and substantially more depressed than the ‘normal’ persons; (14)

 3j > 1j for j = 1, ..., 10,

and (15)

 3j > 1j + .2 for j = 11, ..., 30,

which renders a class of persons that are somewhat more depressed than the ‘normal’ persons, and substantially more anxious than the ‘normal’ persons; and, (16)

 4j >  2j and 4j >  3j for j = 1, ..., 30,

which renders a class of persons that are more depressed and anxious than the persons in class 2 and 3.

566

MULTIVARIATE BEHAVIORAL RESEARCH

H. Hoijtink

It may be, with respect to the second theory, that another psychiatrist does not agree with the assumption that the anxiety level of depressed persons is larger than the anxiety level of normal persons. This would lead to a third theory in which Equation 12 is replaced by

Downloaded by [Universite Laval] at 16:15 02 October 2015

(17)

 2j < .2 for j = 11, ..., 20.

Example 3. Suppose that a researcher has but one theory, and wants to find out if it makes any sense. The theory states that in the second week of their life babies want (and consequently are) to be nourished every other period of two hours. To investigate his theory the researcher observes the eating behavior of a sample of babies during one day in the second week of their lives. He breaks down the day in twelve (the items) two hour periods (the first period starts around noon) and scores for each baby if it was nourished (response 1) or not (response 0) during each of the periods. To ensure compatability of the two hour periods among babies, two rules are used: (a) if a baby is fed before 2 PM, the first period starts precisely 1 hour before the actual time the baby is fed, and (b) if a baby is not fed before 2 PM, the second period starts precisely 1 hour before the actual time the baby is fed, and the first period starts 3 hours earlier. This theory can be translated into a latent class model consisting of two latent classes. For the first class (18) 1j > 1j + .5 for j = 1, 3, 5, 7, 9, 11 and j = 2, 4, 6, 8, 10, 12, that is the probability of being nourished in a odd-numbered period is at least 50% higher than the probability of being nourished in an even-numbered period. For the second class (19) 1j >  1j + .5 for j = 2, 4, 6, 8, 10, 12 and j = 1, 3, 5, 7, 9, 11. To achieve his goal, the researcher should confront his theory (translated into a latent class model using Equations 18 and 19) with the data, and he should confront an unrestricted latent class model with two or three classes with the data. If the constraints are (largely) appropriate, the data should provide more support for the constrained model than for the unconstrained model. Model Selection Criteria After translation of one or more theories into a constrained latent class model, the support of the data for latent class model has to be determined. MULTIVARIATE BEHAVIORAL RESEARCH

567

H. Hoijtink

Downloaded by [Universite Laval] at 16:15 02 October 2015

The ‘classical’ approach is to compute an information criterion (Akaike, 1987; Bozdogan, 1987; Linn & Dayton, 1997; Schwarz, 1978) for each model, or to use likelihood ratio tests (Everitt, 1988). These criteria are based on maximum likelihood estimates of the model parameters. When theories are translated into a latent class model using one or more of the restrictions (Equations 1 to 5) these criteria have an important drawback. This will be illustrated in the next subsection using a simple binomial model. In the section entitled “Bayesian Model Selection Criteria” the same binomial model will be used to explain the Bayesian counterparts of these criteria and to illustrate that these do not suffer from the drawback. Maximum Likelihood Based Model Selection Criteria Information criteria can be used to determine which of a number of competing models is superior. An information criterion can (loosely) be defined as the distance between the model at hand and the true model. The smaller the value of the information criterion, the smaller the distance. The best known information criteria are AIC (Akaike, 1987), CAIC (Bozdogan, 1987) and BIC (Schwarz, 1978). Each criterion consists of two parts. One part is based on the maximum of the log-likelihood of the parameters of model k = 1, ..., K: log L(kML|X). Note that kML denotes the maximum likelihood estimates of the parameters of model k, and X denotes the N × J matrix containing the item responses. The larger the maximum, the more ‘likely’ the parameter values given the data, that is, the better the model. Note that the maximum is usually multiplied by -2 when used in an information criterion, that is, the better the model the smaller the criterion. The second part is based on the number of parameters Dk used in the model. This part serves as a penalty function. The inclusion of more parameters in a model usually increases the maximum of the log-likelihood. The information criteria tax this increase via the addition of a function of Dk. To give an example: (20)

AICk = -2 log L(kML|X) + 2Dk.

Without modification information criteria based on maximum likelihood estimates can not be used to determine which of a number of models based on varying sets of inequality constraints is the best model. Consider two simple models: (21)

568

u ~ Bin(N, p),

MULTIVARIATE BEHAVIORAL RESEARCH

H. Hoijtink

and,

Downloaded by [Universite Laval] at 16:15 02 October 2015

(22)

u ~ Bin(N, p) with p > .6,

where, p denotes a binomial probability, and u the number of ‘successes’ observed in N trials. The set of constraints H1 used to construct Equation 21 is empty, the set of constraints H 2 for Equation 22 contains p > .6. Denoting a succes by a 1 and a failure by a 0, a series of N = 10 Bernouille experiments could render 0111111110, that is, u = 8. The maximum likelihood estimate of p is.8 for both Equations 21 and 22. The consequence is that criteria like Equation 20 cannot be used to distinguish Equation 21 from 22. The maximum of the likelihood of both models and the number of parameters used are the same. In this specific example, one could argue that consequently the most parsimonious model (Equation 22) should be preferred. However, note that it is not always possible to decide which model is the most parsimonious. In one model the constraint could be p > .6, in another model .5 < p < .9. Recently (Anraku, 1999) a maximum likelihood based information criterion that can be applied to a simple model based upon inequality contraints among means was developed. However, it may take a while (if ever) before this approach becomes applicable to more complex models like the constrained latent class model. In the next section Bayes factors (Kass & Raftery, 1995) will be introduced. Bayes factors does not suffer from the same drawback as maximum likelihood based information criteria, and is easy to compute. In the context of latent class analysis information criteria are not the only means to determine which of a number of models is superior. A likelihood ratio test can be used to compare the likelihood of a (constrained) latent class model with the likelihood of a saturated multinomial model (see the section entitled “Likelihood Ration Statistics Evaluated Using Posterior Predictive P-Values” for a formal definition of the likelihood ratio test). If this is done for each of the K models, the model with the smallest likelihood ratio (in this article expressed by means of a p-value) is, in terms of this criterion, superior to the other models. Unlike the information criteria, this criterion does not involve a trade off between the likelihood and the number of parameters used. Likelihood ratio tests based on maximum likelihood estimates of the model parameters, suffer from the same flaw as the information criteria: it is not always possible to use them to distinguish Equation 21 from 22. The maximum of the likelihood of a model with constraints H1 may be the same as the maximum of the likelihood of a model with constraints H2.

MULTIVARIATE BEHAVIORAL RESEARCH

569

H. Hoijtink

For relatively simple models there exist likelihood ratio tests that can be used when the model parameters are restricted using inequality constraints (see, for example, Robertson, Wright & Dykstra, 1988, who present a comprehensive overview). However it is questionable if that approach can be generalized to contrained latent class models. The next section will introduce likelihood ratio tests evaluated using posterior predictive distributions (Gelman, Meng & Stern, 1996; Meng, 1994; Rubin, 1984). It will be shown that these are easy to compute and valid in the context of inequality constrained models.

Downloaded by [Universite Laval] at 16:15 02 October 2015

Bayesian Model Selection Criteria A Short Introduction to Bayesian Statistical Inference Two concepts are central in Bayesian statistical inference: the prior distribution of the model parameters and the posterior distribution of the model parameters. The constraints used to translate a theory into a latent class model are incorporated in the prior distribution g(k|Hk). In our applications the prior distribution is constant for all values of k that are admissible given the constraints in Hk, and zero otherwise. The posterior distribution reflects what can be learned about the parameters from the data taking into account the prior constraints. The posterior g(k|HkX) is proportional to the prior multiplied with the density of the data given the value of the parameters of the model at hand Pr(X|k): N

(23)

Q

J

g (k | Hk , X)  g ( k | Hk ) ∏ ∑ q ∏ qjij (1 − qj ) i =1 q =1

x

1− xij

.

j =1

Hoijtink (1998) shows that it is easy to obtain a sample from this posterior distribution using an application of the Gibbs sampler (see Appendix A for a short description) as presented by Zeger and Karim (1991). This sample k1, ..., km, ..., k25000 can be used to obtain estimates of and credibility intervals for the model parameters taking into account the prior constraints. The expected a posteriori (EAP) estimate of a parameter is simply the average of the 25000 values of that parameter sampled from the posterior distribution. A 95% central credibility for this parameter is given by the 2.5 th and 97.5th percentile of the distribution of these 25000 sampled values. In the next two subsections it will be shown that it is easy to compute Bayes factors and posterior predictive p-values for likelihood ratio tests using this sample from the posterior distribution. Note that the computation 570

MULTIVARIATE BEHAVIORAL RESEARCH

H. Hoijtink

of EAP’s, Bayes factors and posterior predictive p-values involves integration with respect to Equation 23. This integral is approximated using the sample of 25000 parameter vectors, which gives an accurate approximation of the integral.

Downloaded by [Universite Laval] at 16:15 02 October 2015

Bayes Factors and Marginal Likelihood Kass and Raftery (1995) present a comprehensive review of Bayes factors. Bayes factors is a statistic that compares the marginal likelihood (see Equation 24) of two competing models. The model with the largest marginal likelihood is preferred. The basic idea behind Bayes factors is the same as the basic idea behind the more familiar information criteria discussed in the section entitled "Maximum Likelihood Based Model Selection Criteria". It can, for example, be shown (see Kass & Raftery, 1995), that the Bayesian information criterion (BIC, Schwarz, 1978) is an approximation of minus twice the logarithm of the marginal likelihood. Although not explicit in its formulation, the marginal likelihood like the information criteria contains a trade off between the likelihood of the parameters given the data and the number of parameters in the model. However, as will be shown in the sequel, the marginal likelihood does not suffer from the same drawback as the information criteria. In the remainder of this article minus twice the logarithm of the marginal likelihood will be used: (24)

b

g

−2 log PrkBf X| Hk = −2 log

z

k

b gb

g

Pr X| k g k | Hk dk ,

this brings comparisons of different models on the same scale as the familiar deviance statistics (Kass & Raftery, 1995) which should facilitate the interpretation of the differences that can be observed for different models. There are many different ways to compute Equation 24. One can obtain a sample from the prior of the model parameters and integrate with respect to this sample:

(25)

−2 log PrkBf ( X| Hk ) ≈ − 2 log

25000

∑ l =1

Pr ( X| lk ) , 25000

where kl denotes the lth parameter vector sampled from the prior of model k. However, as indicated by Kass and Raftery (1995) and verified in our research, the results may vary substantially for different samples of 25000 draws from the prior, that is, the approximation is rather unstable. MULTIVARIATE BEHAVIORAL RESEARCH

571

H. Hoijtink

Downloaded by [Universite Laval] at 16:15 02 October 2015

In this article the computation will be based on an idea that can be found in Newton and Raftery (1994) and Kass and Raftery (1995). They suggest to sample 99% of the parameter vectors from the posterior distribution (in our applications 99% equals 24750 parameter vectors), and to imagine that 1% (in our applications 250) of the parameter vectors is sampled from the prior distribution each with a density equal to the marginal likelihood. According to the aforementioned authors an approximation -2log P$ of -2logPrkBf(X|Hk) is then obtained via a simple iterative algorithm based on Equation 26:

(26)

LM 250 P$ + ∑ $ M −2 log P = − 2 log MM 250 + ∑ N

Pr ( |  k ) m

24750 k =1

.01 + .99 Pr ( |  k ) / P$ m

1

24750 k =1

m .01 + .99 Pr ( |  k ) / P$

OP PP . PQ

For the latent class models the simple iterative algorithm converged extremely slow. Appendix B presents both the simple algorithm and a modification that usually converges within a few iterations. Our research shows that in the context of (constrained) latent class analysis this approximation of Equation 24 is very stable. Results varied less than.05 for different samples of 24750 from the posterior distribution. Returning to the binomial example from the previous section, it can be illustrated that the marginal likelihood can be used to distinguish Equation 21 from 22. For both models the Pr(u|p, N) is displayed in Figure 1. In Equation 21 p is unrestricted. In Equation 22 p is restricted to be larger than.6 (indicated by the vertical line). As can be seen, ‘on the average’ Pr(u|p, N) is larger for the model with p > .6 than for the model without a restriction on p. Note that these ‘averages’ are in fact the marginal likelihoods of the models with and without a restriction on p: Pr(u|p, N) is integrated with respect to uniform prior restricted to be larger than .6, and an unrestricted uniform prior respectively. Computation of Equation 25 (for simple models this approximation is very accurate) renders -2log 10 Pr(u|p, N)g(p|H1)dp  4.79 is larger than -2log 1.6 Pr(u|p, N)g(p|H2)dp  3.22. Stated otherwise, where maximum likelihood based information criteria fail, Bayes factors is able to make a distinction between models that are based on different sets of inequality constraints.

z

572

z

MULTIVARIATE BEHAVIORAL RESEARCH

Downloaded by [Universite Laval] at 16:15 02 October 2015

H. Hoijtink

Figure 1 The Pr(u|p,N)

Likelihood Ratio Statistics Evaluated Using Posterior Predictive Pvalues As indicated in the section entitled “Maximum Likelihood Based Model Selection Criteria” it is not uncommon to use likelihood ratio statistics to determine which latent class model is superior. Using a simple binomial model it was shown that maximum likelihood based likelihood ratio statistics have a flaw when the goal is to find the best of a number of constrained latent class models. In this section first of all two likelihood ratio statistics wil be introduced. Subsequently the simple binomial model will be used to explain the basic ideas behind posterior predictive p-values, and to illustrate that these p-values can be used to select the best of a number of constrained latent class models. The section will be concluded with the definition of posterior predictive p-values applied to latent class models. First the two likelihood ratio statistics that will be used in this article will be introduced. Let P denote the number of different response vectors xp that are possible. Since the data are dichotomous, this number is 2J. Let xpj denote the response given to the jth item in the pth response vector. The likelihood ratio statistic compares, for each response vector, M x p | k with MULTIVARIATE BEHAVIORAL RESEARCH

573

H. Hoijtink

N x p . Note that M x p | k denotes the expected number of persons with response vector xp conditional on k: Q

(27)

M x p | k = N

J

∑ ∏ q

x pj qj

(1 −  qj )

1 − x pj

.

j =1

q =1

Downloaded by [Universite Laval] at 16:15 02 October 2015

Note also that N x p denotes the observed number of persons with the pth response vector. The larger the differences between corresponding M x p | k and N x p , the less the data support the latent class model, and consequently, the larger the value of the likelihood ratio statistic:

(28)

P

M x p |k

p =1

Nx p

LR k ( X, k ) = − 2 ∑ N x p log

.

Note that Equation 28 cannot be computed since k is unknown. The classical solution is to replace k by kML. The Bayesian solution will be presented in the sequel. Hoijtink (1998) shows that Equation 28 is sensitive to the presence of outliers that is persons with a response vector for which M x p | k is (much) smaller than one. This sensitivity is smaller if a pseudo likelihood ratio statistic is used. In this statistic, for each pair of items, the expected number of each possible response (00, 01, 10 and 11) is compared to the corresponding observed number: 1

(29)

1

PLR k ( X,  k ) = − 2 ∑ ∑ ∑ N xij = v , xij′ = wlog j ≠ j ′v = 0 w = 0

M xij = v , xij′ = w|k N xij = v ,xij′ = w

,

where, the expected number of persons with responses v and w to items j and j, respectively, conditional on k is denoted by Q

(30)

M xij = v , xij′ = w| k = N

∑  q

q =1

x pj qj

(1 −  qj )

1− x pj

x

 qjpj′ ′ (1 −  qj ′ )

1 − x pj′

,

and the observed number by N xij = v , xij′ = w . Since k is unknown, Equations 28 and 29 cannot be computed. The classical solution is to substitute the unknown quantity with kML. However, as illustrated using Equatons 21 and 22, it is not always possible to distinguish 574

MULTIVARIATE BEHAVIORAL RESEARCH

Downloaded by [Universite Laval] at 16:15 02 October 2015

H. Hoijtink

between models based on different sets of inequality constraints, since the maximum likelihood estimates of the model parameters may be the same for all models. The Bayesian solution is to compute the posterior predictive pvalue (Gelman, Meng & Stern, 1996; Meng, 1994; Rubin, 1984) of both likelihood ratio statistics. The simple binomial model will be used to illustrate the principles involved in the computation of posterior predictive p-values. The definition applied to latent class models will be given at the end of this section. Let D⵺ denote any statistic measuring the discrepancy between the statistical model used and the data. To select the best model in the binomial example the following discrepancy will be used: D(u, p) = |uodd - ½Np|, where uodd denotes the number of successes observed during the odd trials of 10 Bernouille experiments. It can be verified in the section entitled ”Maximum Likelihood Based Model Selection Criteria” that in our example uodd = 4. The larger D(u, p), the larger the discrepancy between the data u and the model used. Like Equations 28 and 29 D(u,p) cannot be computed because p, the parameter in the binomial model, is unknown. This problem can be solved using the posterior distribution of p given the data u and N. The posterior distribution summarizes the available information with respect to p. For Equation 21 this posterior is proportional to the Pr(u|p, N) presented in Figure 1. The posterior can accurately be represented using a sample p1, ..., pm, ..., p 25000 from this posterior. Now it becomes possible to obtain an idea about the data matrices that are to be expected if Equation 21 holds. Each of the 25000 pm can be used to generate a data matrix that is in accordance with Equation 21. The procedure is simple: sample um from a binomial distribution with parameters pm and N for m = 1, ..., 25000. For each pm two discrepancies can be computed: (a) D(u, pm) which is a discrepancy between the observed data and the model; and, (b) D(um, p m) which is a discepancy between generated data and the model. If D(um, pm)  D(u, pm) the discrepancy between the observed data and the model is equal to or smaller than the discrepancy between the generated data and the model. The latter implies that the model accurately describes the observed data. The posterior predictive p-value is the proportion of 25000 comparisons for which D(um, pm)  D(u, pm). The procedure described in the previous paragraph can also be applied to Equation 22. The only difference is that for Equation 22 the posterior distribution is proportional to the Pr(u|p, N) presented in Figure 1 truncated below at .6. The posterior predictive p-value for Equation 21 was.451, the p-value for Equation 22 .501. It is clear that Equation 22 is preferred over Equation 21. Stated otherwise, posterior predictive p-values can be used to select the best of several models that are constructed using different sets of inequality constraints. MULTIVARIATE BEHAVIORAL RESEARCH

575

H. Hoijtink

Applied to latent class models the definition of the posterior predictive pvalue (Gelman, Meng & Stern, 1996; Meng, 1994; Rubin, 1984) is

Downloaded by [Universite Laval] at 16:15 02 October 2015

(31)

pk = Pr[D(Xrep, k)  D(X, k)|X, Hk],

that is, the probability that the discrepancy between model k and a datamatrix X rep generated using model k is equal to or larger than the discrepancy between model k and the observed data matrix X. The discrepancy measure is either Equation 28 or 29. The probability is computed over the joint distribution of the generated data Xrep and k conditional on the posterior distribution of the model parameters, that is, using the Gibbs sampler (Hoijtink, 1998) 25000 parameter vectors km are sampled from the posterior distribution of model k, and each parameter vector is used to generate a data matrix Xkm,rep. The posterior predictive p-value is then computed as the proportion of times that D(mk ,rep , mk ) ≥ D(, mk ) . To generate a data matrix, for each person class membership is sampled from a multinomial distribution with probabilities km. Subsequently, the class m m , ...,  qJ,k of the class to which a person is assigned specific probabilities  q1,k are compared with a vector of pseudo random numbers sampled from a U(0,1) distribution. If a class specific probability is larger than the corresponding random number, a person gives the response 1, otherwise the response 0 is given. In this section three statistics that can be used for model selection were introduced: (a) minus twice the logarithm of the marginal likelihood (Equation 24), (b) the likelihood ratio statistic (Equation 28) evaluated using its posterior predictive p-value; and, (c) the pseudo likelihood ratio statistic (Equation 29) evaluated using its posterior predictive p-value. In the next section a small simulation study will be executed to obtain an indication of the quality of the performance of these statistics. Simulation Study Introduction of the Simulation Study In this section a small simulation study will be used to obtain an indication of the performance of the marginal likelihood, the likelihood and the pseudo likelihood ratio statistics. There exist numerous examples of competing models where each of the statistics will almost always select the correct model. Imagine numerous data sets sampled from a population described by Theory 1 from Example 1 in the section entitled “Examples of Confirmatory Latent Class Analysis”. Imagine further, that each of these data sets is 576

MULTIVARIATE BEHAVIORAL RESEARCH

Downloaded by [Universite Laval] at 16:15 02 October 2015

H. Hoijtink

analyzed using a latent class model based on Theory 1 and a latent class model based on Theory 2. Then it is almost certain that the marginal likelihood and posterior predictive p-values obtained for Theory 1 are larger than the corresponding quantities obtained for Theory 2. The theories are not nested, that is, it is not so that one theory equals the other theory with extra restrictions added. It should be very easy to determine which theory is correct, since one of the sets of constraints is clearly wrong. Now consider three nested models. The Unconstrained Latent Class Model. This model will subsequently be referred to as U2 or U3 where 2 and 3 denote the number of latent classes. The Latent Class Formulation of a Nonparametric Item Response Model. (32)

 1j < 2j < ... < Qj for j = 1, ..., J.

As can be seen this model implies that the probability of a positive response increases as a function of class number, that is, the first class contains the less able persons and the last class contains the most able persons. Nonparametric item response models are an alternative for more familiar models like the Rasch or two parameter logistic model. The interested reader is referred to, for example, Junker (1993) and Mokken (1996). Note that this model equals the unconstrained latent class model with extra restrictions added. This model will subsequently be referred to as M2 and M3. Again the number denotes the number of latent classes. The Latent Class Formulation of a Nonparametric Item Response Model in Which both the Persons and the Items are Ordered (Mokken, 1996). This model is obtained using restriction (Equation 32) which ensures that the persons are ordered from less to more able, and (33)

q1 > q2 > ... > qJ for q = 1, ..., Q,

which ensure that within each latent class the items are ordered from easy (high probabilities for the correct response) to difficult. This model will subsequently be referred to as D2 and D3. If a data set is analyzed with these models it should be rather difficult to determine which is the best model. If the third model is correct, the second and first model are also correct (but less parsimonious, less informative and thus less desirable). If the second model is correct, so is the first model. Stated otherwise, finding the best of a number of nested models is a real challenge for the model selection procedures proposed in this article. If the

MULTIVARIATE BEHAVIORAL RESEARCH

577

H. Hoijtink

statistics perform satisfactorilly in this context it is likely that they will also perform satisfactorilly in the context of nonnested or partly nested models.

Downloaded by [Universite Laval] at 16:15 02 October 2015

Results of the Simulation Study In Table 1 the populations used to generate data are described. As can be seen, population 1 can be described by an unconstrained latent class model with two latent classes. However, the most appropriate description is given by the (more restrictive) nonparametric item response model in which only the persons are ordered. A completely wrong description would be obtained using the (too restrictive) item response model in which both the persons and the items are ordered. An item response model in which only the persons are ordered with three latent classes describes population 2, and, a three class item response model in which both the persons and the items are ordered describes population 3. Each population was used to generate ten data matrices, each containing the responses of 500 persons to 10 items. The procedure by which data matrices can be generated was the same as the procedure described at the end of the previous section.

Table 1 Description of the Populations Used to Simulate Data Pop. 1

Pop. 2

Pop. 3

i

 i1

 i2

 i1

 i2

 i3

 i1

 i2

 i3

1 2 3 4 5 6 7 8 9 10

.30 .35 .30 .35 .30 .35 .30 .35 .30 .35

.75 .70 .75 .70 .75 .70 .75 .70 .75 .70

.20 .25 .20 .25 .20 .25 .20 .25 .20 .25

.55 .50 .55 .50 .55 .50 .55 .50 .55 .50

.75 .80 .75 .80 .75 .80 .75 .80 .75 .80

.60 .55 .50 .45 .40 .35 .30 .25 .20 .15

.70 .65 .60 .55 .50 .45 .40 .35 .30 .25

.80 .75 .70 .65 .60 .55 .50 .45 .40 .35



.30

.70

.20

.30

.50

.20

.30

.50

578

MULTIVARIATE BEHAVIORAL RESEARCH

Downloaded by [Universite Laval] at 16:15 02 October 2015

H. Hoijtink

Each of the ten data matrices generated using population 1 was analysed using U2, M2 and D2. The goal of this simulation was to determine if the three methods that can be used for model selection prefer M2 over U2 and disqualify D2. For each model -2 log PrkBf(X|Hk) and the posterior predictive p-values of the likelihood ratio statistics were computed. The results were the same for each of the three methods: D2 was disqualified, and M2 was preferred to U2 since it is the more parsimonious model and the support for both models was virtually the same. The last line needs some elaboration. For the first of the ten data matrices the p-values of the pseudo likelihood ratio statistics for U2 and M2 were.524 and.529, respectively. Does this imply that the data support M2 more than U2? We consider the support to be equal. The consequence is that M2 is preferred because it is more parsimonious than U2. Minus twice the logarithm of the marginal likelihood of U2 and M2 was also very similar for the first data matrix generated: 6451.93 and 6451.25, respectively. Kass and Raftery (1995) state that differences smaller than 2 are barely worth mentioning, that is, M2 is preferred because it is more parsimonious than U2. Each of the ten data matrices generated using population 2 was analysed using U3, M3 and D3. The results are displayed in Table 2. The goal of this simulation was to determine if the three methods prefer M3 over U3 and disqualify D3. Again all three model selection methods agreed that D3 had

Table 2 Results of the Analyses of the Ten Data Matrices Simulated Using Population 2 -2logPrkBf(X|Hk)

Data Matrix 1 2 3 4 5 6 7 8 9 10

Post. Pred. LR

Post. Pred. PLR

U3

M3

D3

U3

M3

D3

U3

M3

D3

6378.63 6355.92 6245.43 6409.31 6330.90 6327.59 6376.44 6229.12 6288.69 6282.46

6377.60 6358.14 6243.24 6404.78 6328.62 6329.10 6369.77 6227.39 6283.91 6285.38

6407.40 6379.49 6290.79 6433.55 6356.75 6352.61 6407.83 6279.20 6303.45 6304.23

.129 .212 .166 .576 .009 .416 .798 .306 .316 .026

.197 .332 .260 .722 .017 .575 .862 .406 .523 .051

.027 .017 .039 .375 .000 .185 .378 .106 .111 .004

.511 .507 .467 .534 .419 .495 .503 .521 .447 .513

.520 .525 .493 .538 .426 .509 .503 .537 .481 .523

.028 .010 .004 .010 .001 .056 .000 .005 .006 .037

MULTIVARIATE BEHAVIORAL RESEARCH

579

Downloaded by [Universite Laval] at 16:15 02 October 2015

H. Hoijtink

to be disqualified. To give an example: The support for each of the three models [-2logPrkBf(X|Hk)] given by the first data matrix generated was 6378.63, 6377.60 and 6407.40, respectively. The posterior predictive pvalues of the likelihood ratio statistic were.130,.198 and.027, respectively. The p-values of the likelihood and pseudo likelihood ratio statistic were always (although sometimes by a small amount) higher for M3 than for U3 that is according to these statistics M3 has to be preferred. For 7 data matrices -2logPrkBf(X|Hk) was between 2 and 7 smaller for M3 (indicating preference for M3), for the other 3 data matrices it was between 2 and 3 larger. For population 2 the likelihood ratio statistics appear to perform somewhat better than -2logPrkBf(X|Hk). Each of the ten data matrices generated using population 3 was analysed using M3 and D3. The goal was to find out if D3 is preferred over M3. The posterior predictive p-value of the likelihood ratio statistic was always (between.005 and.09) higher for M3. The likelihood ratio statistic is not able to select the correct model. The posterior predictive p-value of the pseudo likelihood ratio statistic was always (between.02 and.15) higher for D3, that is, an excellent performance. Using [-2logPrkBf(X|H k)] D3 was prefered over M3 for eight of the data matrices (5 differences were 3 or more in favor of D3, 3 differences were about 0 which leads to a preference for D3 since it is more parsimonious). For two data matrices M3 was preferred over D3 (differences of 2 or more in favor of M3). The main results of the simulation study can be summarized as follows. 1. The likelihood ratio statistic evaluated using posterior predictive pvalues should not be used. It was not able to select the correct model for population 3. 2. The performance of the pseudo likelihood ratio statistic evaluated using posterior predictive p-values was rather good. It always selected the correct model. 3. The performance of -2logPrkBf(X|H k) was not bad, the correct model was chosen in 25 out of 30 trials. Discussion of the Simulation Study The question that remains is how well the pseudo likelihood ratio statistic and the marginal likelihood perform outside the context of the simulation study. The problem that had to be solved in the simulation study was both difficult and easy. It was difficult because the best of a number of nested models had to be selected (as indicated in the introduction to this section, selecting the best of a number of nonnested models is much easier), easy because the populations were chosen such that it was relatively easy to 580

MULTIVARIATE BEHAVIORAL RESEARCH

Downloaded by [Universite Laval] at 16:15 02 October 2015

H. Hoijtink

recognize which set of restrictions was the best. As illustrated by Example 1 and 2 below, the latter is not always the case. Example 1. Imagine a population equally distributed over two latent classes. Suppose there are 10 items. In class 1 (11, ..., 1J) = (.10, ..., .10) and in class 2 (21,...,2J) = (.11,...,.11). Imagine another population also equally distributed over two latent classes with (11, ..., 1J) = (.10, ..., .10) and (21, ..., 2J) = (.70, ..., .70). Finally imagine that a data set is sampled from both populations, and that both data sets are analysed using an unconstrained latent class model with two classes and a model constrained using Equation 32 also with two classes. If the class specific probabilities are clearly different in both classes (i.e., .10 and .70) there is a very high probability that the correct model (Equation 32) will be preferred over the unconstrained latent class model (see also population 1 in the section entitled “Simultation Study”). However, if the class specific probabilities are rather similar in both classes (i.e., .10 and .11) it is very unlikely that Equation 32 will be preferred over the unconstrained latent class model (due to sample fluctuations about half of the class specific probabilities in the sample will be smaller in class 1 than in class 2). A very large sample size is needed before the sample fluctuations are reduced to such a degree that in the sample most of the class specific probabilities are smaller in class one than in class two. Example 2. Consider again the first population described in the previous example. Suppose a researcher wants to use the data set to determine if this population consists of one, two or three unconstrained latent classes. His conclusion will probably be that one latent class is sufficient. Very large samples are needed before it will become clear that the population consists of two latent classes. Now consider the second population described in the previous example. If the sample size is around 500, the researcher will probably conclude that the population should be described using three latent classes: one class with class specific probabilities of about.10; one class with probabilities of about.40 and one class with probabilities of about.70. A situation like Example 1 in which both statistics are likely to fail may never occur in practice. The developement of a theory implies either or both of experiences based on the analysis of other data sets and careful thinking. There are probably no previous data sets that suggest that the first population from Example 1 should be described using 2 latent classes. It is also unlikely careful thinking would direct a researcher in the direction of 2 latent classes if both classes are almost the same. In the behavioral sciences there are not many theories that are very minute and detailed. Most theories describe clear cut tendencies and differences. When these theories are translated into latent class models, both the pseudo likelihood ratio statistic and Bayes factors should be able to determine which theory receives the most support from the data. MULTIVARIATE BEHAVIORAL RESEARCH

581

Downloaded by [Universite Laval] at 16:15 02 October 2015

H. Hoijtink

A situation like Example 2 will occur in practice. Imagine an arithmetic reasoning test consisting of 10 items (the responses are either correct or incorrect). Data obtained using this test can be analysed using the nonparametric item response model (Equation 32). Resulting are Q ordered classes. Class 1 contains persons who arithmetic reasoning ability is rather small, class Q contains the persons with a relatively large ability. Usually a researcher wants to know how many ability classes can be distinguished. As indicated in Example 2, it may very well be that the number of ability classes is under- or over-estimated. The best course of action open to a researcher is (a) to use the pseudo likelihood ratio test and Bayes factors to select the values of Q that are reasonable, and (b) to use the class specific probabilities to determine which of the reasonable values of Q is to be preferred. For example, if the pseudo likelihood ratio test and Bayes factors indicate that both Q = 2 and Q = 3 appear to be reasonable, than Q = 3 should be preferred if the class specific probabilities are clearly different for each of the three classes. Note that it is up to the researcher to define ‘clearly different’ or, maybe more appropriate, ‘relevant from the perspective of the researcher’. A situation like Example 2 may also occur in the context of exploratory latent class analysis, that is, how many unconstrained classes should be used to describe the data. A study by Everitt (1988) indicates that it may be difficult to determine the number of classes that should be used. It may very well be that the approach suggested in the previous paragraph is also appropriate in the context of exploratory latent class analysis. However, since exploratory analyses are not the subject of this article, the latter will not be elaborated. In summary it can be concluded that the pseudo likelihood ratio statistic and the marginal likelihood are expected to perform satisfactorilly with: (a) Non-nested models (with or without an equal number of classes); and (b) Nested models when the parameter values are clearly different for each of the classes in the population. Examples 1 and 2 illustrate situations in which both statistics are likely to fail: (a) Nested models when, in the population, the parameter values are rather similar for some of the classes; and, (b) when for a specific model (i.e., the unconstrained latent class model or Equation 32) the number of classes has to be determined. As indicated before, one should not only rely on the pseudo likelihood ratio test and the marginal likelihood to select the correct model. Inspection of the estimates of the class specific probabilities for a number of models that have reasonable values for both statistics, should also be used when selecting a model. Finally, since this simulation study is rather limited, it would be premature to conclude that the pseudo likelihood ratio statistic is superior. There may be situations not covered by this study where -2logPrkBf(X|Hk) outperforms the pseudo likelihood ratio statistic. It seems wise to use both methods when 582

MULTIVARIATE BEHAVIORAL RESEARCH

H. Hoijtink

Downloaded by [Universite Laval] at 16:15 02 October 2015

the best of a number of competing models has to be selected. Further research could be aimed at a more elaborate comparison of the performance of the pseudo likelihood ratio statistic and Bayes factors. The simulation design is rather limited with respect to variations in the number of items and persons. However, it is expected that the results hold for 5 J 15 and 200 N 800 which usually covers the dimensions of data sets that are analysed using latent class models. Further research should mainly be aimed at the evaluation of applications of the proposed methods to empirical data. The latter will (a) indicate if these methods are empirically viable, and (b) lead to new research questions related to the topic of this article. Example The data to be analysed in this section concern the responses of 400 Dutch persons (181 male, 219 female) to 13 items with respect to marriage properties (see Table 3 for the phrasing of the items [translated from Dutch]). The responses were coded 0 if the item at hand was not considered a desirable property of a marriage, and 1 otherwise. The 400 persons are a random sample from a larger sample that was interviewed during the 1981 European value studies. The data can be obtained from the author. The first theory states that persons differ with respect to the degree to which they consider the marriage properties to be important. This theory can be translated into a latent class model using Equation 32 with Q = 3 (S3): the higher the class number, the higher the probability that a marriage property is considered to be important. The second theory states that persons differ with respect to the relative importance they attach to the marriage properties. Three types of persons are distinghuished: persons who consider 2, 6, 7, 9 and 10 to be most important (these persons might be labelled ‘yuppies’); persons who consider 1, 4, 5, 7, 8 and 12 to be most important (these persons might be labelled ‘traditional’); and, persons who consider 3, 4, 8, 11, 12 and 13 to be most important (these persons might be labelled ‘of age’). The relative importance attached by the yuppies can be translated into a latent class model (T3) using (34)

1 j ≥ 1 j ′ for j = 2,6,7,9,10 and j ′ = 1,3,4,5,8,11,12,13.

For the traditional persons (35)

2 j ≥ 2 j ′ for j = 1,4,5,7,8,12 and j ′ = 2,3,6,9,10,11,13,

can be used, and for the ‘of age’ persons MULTIVARIATE BEHAVIORAL RESEARCH

583

584

83 31 21 88 20 42 12 86 56 68 29 47 32

1 Faithfullness 2 An adequate income 3 Being of the same social background 4 Mutual respect and appreciation 5 Shared religious beliefs 6 Good housing 7 Agreement on politics 8 Understatement and tolerance 9 Living apart from your inlaws 10 Happy sexual relationships 11 Sharing household chores 12 Children 13 Tastes and interests in common 

%

Phrasing

i

Table 3 Analysis of Marriage Properties, N = 400

.22

.66 .12 .07 .63 .05 .04 .02 .64 .31 .27 .07 .10 .12

 i1

.58

.84 .24 .12 .94 .13 .40 .06 .90 .64 .76 .32 .50 .28 .19

.93 .77 .70 .95 .59 .89 .46 .95 .59 .91 .48 .84 .66 .24

.65 .10 .06 .66 .04 .03 .01 .66 .36 .31 .09 .09 .11 .61

.86 .28 .14 .93 .16 .46 .07 .90 .60 .76 .32 .54 .31

 i2

 i2

 i1

S3

U3  i3

Downloaded by [Universite Laval] at 16:15 02 October 2015

.14

.93 .81 .78 .97 .67 .92 .52 .97 .68 .95 .51 .86 .70

 i3

.22

.81 .93 .34 .82 .42 .92 .95 .81 .86 .89 .30 .65 .43

 i1

.53

.78 .14 .08 .80 .17 .13 .23 .78 .06 .06 .13 .30 .15

 i2

T3

.24

.85 .51 .96 .95 .38 .64 .23 .96 .67 .86 .94 .92 .94

 i3

H. Hoijtink

MULTIVARIATE BEHAVIORAL RESEARCH

H. Hoijtink

Downloaded by [Universite Laval] at 16:15 02 October 2015

(36)

3 j ≥ 3 j ′ for j = 3,4,8,11,12,13 and j ′ = 1,2,5,6,7,9,10.

Since the researcher is not quite certain if any of his theories makes sense, he analyzed the data also with an unconstrained latent class model with 3 classes (U3). To be taken seriously, the support for the best theory should be about equal or higher than the support for the unconstrained latent class model. Looking at Table 4 it can be seen that T3 receives little support from the data, that according to -2logPrkBf(X|Hk) S3 receives the most support and that the posterior predictive p-value of the pseudo likelihood ratio test is about equal for S3 and U3, that is, S3 appears to be a sensible theory. The EAP estimates of the class weights and the class specific probabilities are given for each of the three models in Table 3. As can be seen for S3 and T3 the estimates are in accordance with the constraints for these models.

Table 4 Evaluation of Different Models for the Data with respect to Marriage Properties Model U3 S3 T3

-2logPrkBf(X|Hk)

Post. Pred. PLR

5277.04 5272.36 11118.95

.188 .182 .000

References Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317-332. Anraku, K. (1999). An information criterion for parameters under a simple order restriction. Biometrika, 86, 141-152. Bozdogan, H. (1987). Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52, 345-370. Croon, M. A. (1990). Latent class analysis with ordered latent classes. British Journal of Mathematical and Statistical Psychology, 43, 171-192. Croon, M. A. (1991). Investigating Mokken scalability of dichotomous items by means of ordinal latent class analysis. British Journal of Mathematical and Statistical Psychology, 44, 315-331. Croon, M. A. (1993). Ordinal latent class analysis for single-peaked items. Kwantitatieve Methoden, 42, 128-142. Everitt, B. S. (1988). A Monte Carlo investigation of the likelihood ratio test for number of classes in latent class analysis. Multivariate Behavioral Research, 23, 531-538. MULTIVARIATE BEHAVIORAL RESEARCH

585

Downloaded by [Universite Laval] at 16:15 02 October 2015

H. Hoijtink Gelman, A., Meng, X. & Stern, H. S. (1996). Posterior predictive assessment of model fitness via realized discrepancies (with discussion). Statistica Sinica, 6, 733-807. Hoijtink, H. (1998). Constrained latent class analysis using the Gibbs sampler and posterior predictive p-values: Application to educational testing. Statistica Sinica, 8, 691-711. Hoijtink, H. & Molenaar, I. W. (1997). A multidimensional item response model: Constrained latent class analysis using the Gibbs sampler and posterior predictive checks. Psychometrika, 62, 171-189. Junker, B. W. (1993). Conditional association, essential independence and monotone unidimensional item response models. The Annals of Statistics, 3, 1359-1378. Kass, R. E. & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773-795. Lin, T. H. & Dayton, C. M. (1997). Model selection information criteria for non-nested latent class models. Journal of Educational and Behavioral Statistics, 22, 249-264. McCutcheon, A. L. (1987). Latent class analysis. London: Sage. Meng, X. (1994). Posterior predictive p-values. The Annals of Statistics, 22, 1142-1160. Mokken, R. J. (1996). Nonparametric models for dichotomous responses. In W. J. van der Linden and R. K. Hambleton (Eds.), Handbook for modern item response theory (pp. 351-368). Springer: New York. Newton, M. A. & Raftery, A. E. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society, B, 56, 3-48. Robertson, T., Wright, F. T. & Dykstra, R. L. (1988). Order restricted statistical inference. New York: John Wiley. Rubin, D. R. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12, 1151-1172. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461464. Zeger, S. L. & Karim, M. R. (1991). Generalized linear models with random effects: A Gibbs sampling approach. Journal of the American Statistical Association, 86, 79-86.

Accepted October, 2000. Appendix A A Short Description of the Gibbs Sampler This appendix will give a short description of the application of the Gibbs sampler that was used to obtain a sample 1, ..., m, ..., 25000 from Equation 23. The interested reader is referred to Hoijtink (1998) for a detailed description of the procedure. The Gibbs sampler is an iterative procedure. In iteration r = 0 initial values have to be provided for the class weights and the class specific probabilities. Any set of values that is in agreement with the constraints imposed upon the parameters can be used. Each iteration r = 1, ..., R consists of three steps. 1. For i = 1, ..., N sample class membership i,r  {1, ..., Q} from its posterior distribution given the current values of the class weights, the class

586

MULTIVARIATE BEHAVIORAL RESEARCH

H. Hoijtink

specific probabilities and the data. This posterior is a multinomial distribution with probabilities

Downloaded by [Universite Laval] at 16:15 02 October 2015

(37)

 q ,r − 1



Q

∏  ∏ J

j =1

1 − xij xij qj ,r − 1 (1 −  qj ,r − 1 )

J

 q = 1 q ,r − 1

x

j =1

 qjij,r − 1 (1 − qj ,r − 1 )

1 − xij

for q = 1, ..., Q. 2. For q = 1, ..., Q and j = 1, ..., J sample qj from its posterior distribution given the current values of i for i=1, ..., N and the data. This posterior is a (truncated) beta distribution with parameters sqj,r + 1 and Nq,r-sqj,r + 1, where Nq,r denotes the number of persons allocated to class q in iteration r, and sqj,r denotes the number of persons allocated to class q in iteration r that respond positively to item j. Note that the beta distribution is truncated because the sampled value for qj is only acceptable if it is in accordance with the inequality constraints involving qj. 3. Sample the class weights from their posterior distribution given the current values of i for i = 1, ..., N. This posterior is a (truncated) Dirichlet distribution with parameters N1,r + 1, ..., NQ,r+1. Note that the Dirichlet distribution is truncated because the values sampled for the weights are only acceptable if they are in accordance with the inequality constraints involving the weights. For all analyses executed in the article the Gibbs sampler was run for 110000 iterations. After a burn-in period of 10000 iterations the values sampled in the second and third step of each fourth iteration were saved. The result is 1, ...,  m, ..., 25000. Appendix B An Algorithm for Computation of the Logarithm of the Marginal Likelihood For the latent class model

(38)

log P$ t + 1 = log

250 P$ t + 250 +





24750 k =1

24750 k =1

MULTIVARIATE BEHAVIORAL RESEARCH

Pr ( |  k ) m

m t .01+ .99 Pr ( |  k ) / P$

1 m t .01+ .99 Pr ( |  k ) / P$

587

Downloaded by [Universite Laval] at 16:15 02 October 2015

H. Hoijtink

log P$ cannot be obtained using the following algorithm (implicitly suggested by Newton & Raftery, 1994, and Kass & Raftery, 1995). 1. In iteration t = 0 assign an initial value to P$ t . 2. Compute log P$ t +1 . 3. Compute P$ t = P$ t +1 . 4. Repeat the second and the third step until the difference between log t $ P and log P$ t +1 is very small. For the latent class model the convergence rate of this algorithm is so slow that it is impossible to determine when differences between log P$ t and log P$ t +1 are so small that the algorithm has converged. The following algorithm converges fast and with known precision. 1. In iteration 0 set log P$ 0 = 0 and set the ‘stepsize’ S = 100. 2. Compute log P$ 1 (which is smaller than log P$ 0 and replace log P$ 1 by log P$ 0 = S . 3. For t = 1,... compute log P$ t +1 . If (a) log P$ t +1 < log P$ t and log P$ t < log P$ t −1 then log P$ t +1 = log P$ t - S, (b) log P$ t +1 > log P$ t and log P$ t < log t +1 = log P$ t + S, (c) log P$ t +1 > log P$ t and P$ t −1 then S = .6 × S and log P$ t t +1 t −1 then log P$ = log P$ t + S, and (d) log P$ t +1 < log P$ t and log P$ > log P$ t log P$ > log P$ t −1 then S = .6 × S and log P$ t +1 = log P$ t - S. 4. Repeat the third step until S is smaller than the required precision (in this article .00001).

588

MULTIVARIATE BEHAVIORAL RESEARCH

Confirmatory Latent Class Analysis: Model Selection Using Bayes Factors and (Pseudo) Likelihood Ratio Statistics.

Inequality constraints among class specific probabilities can be used to assign a specific meaning to the classes in a latent class model. Different m...
566B Sizes 0 Downloads 8 Views