Hierarchical Bayesian Modeling for Test Theory Without an Answer Key.

PSYCHOMETRIKA

2013 DOI : 10.1007/ S 11336-013-9379-4

HIERARCHICAL BAYESIAN MODELING FOR TEST THEORY WITHOUT AN ANSWER KEY

Z ITA O RAVECZ , ROYCE A NDERS , AND W ILLIAM H. BATCHELDER UNIVERSITY OF CALIFORNIA, IRVINE Cultural Consensus Theory (CCT) models have been applied extensively across research domains in the social and behavioral sciences in order to explore shared knowledge and beliefs. CCT models operate on response data, in which the answer key is latent. The current paper develops methods to enhance the application of these models by developing the appropriate specifications for hierarchical Bayesian inference. A primary contribution is the methodology for integrating the use of covariates into CCT models. More specifically, both person- and item-related parameters are introduced as random effects that can respectively account for patterns of inter-individual and inter-item variability. Key words: Cultural Consensus Theory, Bayesian statistics, hierarchical model, covariate modeling.

1. Introduction Cultural Consensus Theory (CCT) explores the shared knowledge of respondents while relying on formal cognitive and measurement models (Batchelder & Romney, 1988; Romney & Batchelder, 1999; Romney, Weller, & Batchelder, 1986). A typical data set for CCT analysis involves a set of respondents answering questions that pertain to some shared knowledge domain. The knowledge domain need not be factual, and may often be a cultural knowledge domain. For example, domains have involved illness beliefs (Hruschka, Kalim, Edmonds, & Sibley, 2008; Baer, Weller, de Alba Garcia, de Glazer, Trotter, Pachter et al. 2003; Weller, Baerm, Pachter, Trotter, Glazer, de Alba Garcia, et al. 1999), judgments of personality traits (Iannucci & Romney, 1990), and national consciousness (Yoshino, 1989). In each application, CCT models assess the assumption of a single consensus answer key, and infer this consensus-based truth which is shared by the respondents, while also accounting for their level of knowledge and guessing tendencies. When compared to simple, commonly practiced aggregation techniques, CCT-based information aggregation has proven to be superior (see e.g., Weller, Pachter, Trotter, & Baer, 1993). The focus of the present paper is to enhance application techniques for the General Condorcet Model (GCM, Batchelder & Romney, 1988; Karabatsos & Batchelder, 2003), which is a CCT model for dichotomous True/False data. A special case of the GCM, in which only the latent answer key and person-specific abilities are estimated, while setting the rest of the parameters to a certain value, is currently the most widely used CCT model (e.g.; recent applications are in Bimler, 2013; Miller, 2011; Hopkins, 2011). However, these other parameters can model important aspects of decision making, such as person-specific acquiescence (guessing) bias and differential item difficulty. For instance, some people may be biased to more often respond ‘True’ rather than ‘False’ when guessing. Secondly, it is unrealistic to assume that a questionnaires composed for any given topic has all equally difficult items, or that all are equally salient for the group of respondents.

Requests for reprints should be sent to Zita Oravecz, UCI, Department of Cognitive Sciences, 3213 Social & Behavioral Sciences Gateway Building, Irvine, CA 92697-5100, USA. E-mail: [email protected]

© 2013 The Psychometric Society

PSYCHOMETRIKA

The present paper demonstrates how GCMs can be extended hierarchically so that interindividual and inter-item differences can be analyzed in terms of random effects. Models that can capture individual differences in various psychological traits have been extensively applied in the field of personality psychology (hierarchical/multilevel models, see Raudenbush & Bryk, 2002; Snijders & Bosker, 1999). However, individual variability in psychologically interpretable parameters of cognitive models has only been in focus since Batchelder and Riefer (1999) and Riefer, Knapp, Batchelder, Bamber, and Manifold (2002). Therefore the number of cognitive psychometric models has just started to increase. For example, Klauer (2010), Lee (2011), Rouder and Lu (2005), Scheibehenne, Rieskamp, and Wagenmakers (2013), Smith and Batchelder (2010), and Wetzels, Vandekerckhove, Tuerlinckx, and Wagenmakers (2010) have shown how cognitive models can benefit from taking into account individual variation. Advantages of modeling items as random effects are demonstrated by De Boeck (2008). Generally speaking, in random-effect modeling the item- or person-specific traits are considered to be a sample from a population to which one aims to generalize the results. Additionally, we introduce the methodology for incorporating person and item covariates into the GCM. While covariate modeling is often utilized by models in fields such as personality psychology, educational measurements models etc., adding covariate information to cognitive models has not been a mainstream practice. We aim to demonstrate this technique and its merits so that it can be later generalized to other types of CCT models, or incorporated into other cognitive models. Finally, two techniques are described and compared for modeling the population distribution of probability variables in CCT models. The first is a hierarchical Gaussian approach, which involves the logit-transformation of the unit-scale parameters in the GCM. This hierarchical Gaussian structure is newly formulated for CCT models and it is shown to provide a flexible framework. The second technique avoids the logit-transformation of probability parameters by using the beta distribution as a population distribution (instead of the Gaussian) to directly model the parameters which are between 0 and 1, as described for parameters in similar models (Batchelder & Anders, 2012; Smith & Batchelder, 2010). In addition, the incorporation of person and item covariates is developed for each of these approaches, and illustrated with an application to real data. The paper is organized as follows. First, a summary of the most important properties of the GCM is provided. Then the extension of adding a population level is developed to provide for the hierarchical General Condorcet Model (HGCM) model. Next, a straightforward way of introducing covariate information for the HGCM is shown. Then statistical inference for the HGCM is derived in the Bayesian framework. Following these specifications, the usefulness of these extensions are demonstrated on a real data set pertaining to judgments of grammaticality for various types of English phrases. Finally, the discussion reflects on the results and the properties of the two hierarchical modeling choices that were applied.

2. Properties of the General Condorcet Model In a GCM setting, a set of respondents answer a number of True/False questions which measure aspects of the same underlying knowledge space. While in the present paper the focus is on dichotomous data, the GCM has been expanded to other response options as well (e.g.; Batchelder, Strashny, & Romney, 2010). 2.1. Data Assume that each person i = 1, . . . , N answers ‘True’ or ‘False’ to each of a set of k = 1, . . . , M questions. Then the data set Y consists of N × M answers, typically coded in the

ZITA ORAVECZ, ROYCE ANDERS, AND WILLIAM H. BATCHELDER

following way: Yik =

1 if i responds ‘True’ to item k, 0 if i responds ‘False’ to item k.

It is important to note that the GCM works with the complete N × M data of dichotomous responses, not just the marginals. 2.2. The Parametrization of the General Condorcet Model For an axiom-based description of the GCM please consult Karabatsos and Batchelder (2003). Here the model is instead described by a thorough interpretation of its parameters. First, the model specifies item-specific model parameters, Zk ∈ {0, 1}, which represent the ‘correct’ (consensus) answers to the items, and they are dichotomously coded to correspond to the response data: Zk =

1 0

if item k is ‘True’, if item k is ‘False’.

Based on the answer key Z = (Zk ), latent hit- and false-alarm rates can be defined as Hik = Pr(Yik = 1 | Zk = 1) and Fik = Pr(Yik = 1 | Zk = 0). So far, the model can be viewed as a general signal detection model, where the correct answers are latent parameters and the hitand false-alarm parameters are heterogeneous over both items and respondents. The GCM reparameterizes the hits and false-alarm rates with the double high threshold (DHT) model (see e.g.; Macmillan & Creelman, 2005; Morey, 2011) to describe the cognitive processes of the respondents. The DHT model assumes that a person either knows and responds with the correct answer to an item k with probability Dik , or with probability 1 − Dik , the response is made by guessing. The guess is driven by the person-specific guessing bias parameter gi , which is the probability of guessing ‘True’ when the answer is not known. Based on the DHT model, the hitand false-alarm rates are parameterized as ∀i,

Hik = Dik + (1 − Dik )gi ,

Fik = (1 − Dik )gi ,

(1)

where 0 < Dik , gi < 1. As pointed out in Batchelder and Romney (1988), it is necessary that ∀ik Hik ≥ Fik to identify the model. The DHT model described in Equation (1) satisfies this constraint by definition. Moreover, in Appendix A the formal proof of identifiability for the GCM is derived. The underlying process model of the GCM is illustrated by a tree diagram in Figure 1. The first split in the tree represents the latent state of item k and the remaining splits describe the response model. There are two detection thresholds, one for items with Zk = 1, and one for items with Zk = 0; in this way, the items are never incorrectly identified (or ‘Detected’) as ‘True’ or ‘False’. A key assumption of the GCM is that the response random variables, Y = (Yik ), satisfy a special conditional independence assumption given by Pr(Y = yN ×M | Z, H, F) =

N M i=1 k=1

Pr(Yik | Zk , Hik , Fik ),

(2)

PSYCHOMETRIKA

F IGURE 1. Processing tree of the General Condorcet Model for an item k.

for all possible realizations of y. Given the parameterizations in Equation (1), it is easily seen that the likelihood function of the model becomes L(Z, D, G | Y = Yik ) =

N M Y Z Y (1−Zk ) Dik + (1 − Dik )gi ik k (1 − Dik )gi ik i=1 k=1

(1−Yik )Zk × (1 − Dik )(1 − gi ) (1−Yik )(1−Zk ) × Dik + (1 − Dik )(1 − gi ) .

(3)

Note that since the terms in the exponential position of Equation (3) are dichotomous 1-0 variables, for their every combination, only one of the exponents is 1 and the remaining equal 0. A direct consequence of the single answer key assumption is that the correlation between two respondents over items equals the product of each respondent’s correlation with the answer key, formally for ∀1 ≤ i = j ≤ N : ρ(YiK , Yj K ) = ρ(YiK , ZK )ρ(Yj K , ZK ),

(4)

where K is a random variable that selects a random item index; so that ∀k, Pr(K = k) = 1/M, and E(YiK , Yj K ) − E(YiK )E(Yj K ) ρ(YiK , Yj K ) = . (5) Var(YiK ) Var(Yj K ) The term in Equation (5) and their counterparts for ρ(YiK , ZK ) and ρ(Yj K , ZK ) are easily calculated using the property of conditional expectation (see, e.g., Batchelder & Anders, 2012, p. 318 for details). Equation (4) leads to a consequence that for all distinct respondents denoted as i, j, m, n, ρ(YiK , Yj K )ρ(YmK , YnK ) = ρ(YiK , YnK )ρ(YmK , Yj K ),

(6)

which is a form of Spearman’s law of tetrads (Spearman, 1904). The law of tetrads is the basis for Spearman’s two-factor model for intelligence. In our case, this result occurs because there is a single answer key, Z = (Zk )1×M , behind the respondent-by-respondent correlations over items, and the two factors are the respondents’ abilities and the shared consensus answer key. In a later section on posterior predictive model checking we use this result to test whether the one latent underlying answer key assumption of the GCM is met in an example application.


Following a proposal for introducing item difficulty by Batchelder and Romney (1988), Karabatsos and Batchelder (2003) specified Dik , the probability that a person i knows the correct answer to item k, as a function of both the person’s ability and the item’s difficulty in the following way: Dik =

θi (1 − δk ) , θi (1 − δk ) + δk (1 − θi )

(7)

where θi ∈ [0, 1] is the ability parameter belonging to informant i, and δk ∈ [0, 1] denotes the item difficulty of question k. From a parameter interpretation point of view, formulating the probability of knowing the correct answer as a function of ability and item difficulty (Equation (7)) for the GCM shows a correspondence to one of the Rasch model parameterizations (Fischer & Molenaar, 1995; Rasch, 1960). If the GCM did not have the guessing bias parameter1 (i.e., if the lowest level of the decision tree in Figure 1 were deleted), it would result in a similar model to the Rasch model, with the exception of having an unknown answer key. However, in our formulation, the item difficulty and ability parameters are kept on the unit scale, which is not typical in psychometric test theory. Since the GCM has a person-specific guessing bias probability, keeping the ability and item difficulty also on the unit scale results in a convenient parameter interpretation framework typical of threshold models of signal detection. In the proposed model, the ability, item-difficulty, guessing bias and answer key parameters are estimated at the same time. The ability and item-difficulty scales directly relate to the probability of knowing the culturally accepted consensus-based answer, which is represented by the answer key. That is to say that the ability of a person is relative to how well his/her responses generally match the consensus truth of the group. Certainly, items tapping into some knowledge domain would be heterogeneous with respect to how difficult it is to know the consensus on different aspects of the knowledge domain represented by the items. Even if the consensus is weak, the model specifies and estimates an answer key, in terms of which we can interpret the ability and the item-difficulty parameters. This way the level of consensus knowledge informs us about how reliable the answer key estimate is (higher consensus—more stable estimate), while the posterior probability of the answer key estimate for each item reflects the degree of uncertainty itemwise. To summarize, the model introduced above has 2 ×N person-specific parameters, namely the ability parameters, θ = (θi )1×N , and the guessing bias, G = (gi )1×N , parameters. Also, it has 2 ×M item-specific parameters: the answer key for each item, Z = (Zk )1×M , and the itemdifficulty parameter for each item, δ = (δk )1×M . Except for the answer key, which takes discrete values of 0 or 1, all of the parameters are on the unit scale. Originally, statistical inference for the GCM was carried out in the classical inferential framework (Batchelder & Romney, 1988; Batchelder, Kumbasar, & Boyd, 1997). Later, Karabatsos and Batchelder (2003) derived inference for different versions of the GCM (including one with item difficulty, see later) in the non-hierarchical Bayesian framework. Recently Oravecz, Vandekerckhove, and Batchelder (in press) developed a user-friendly software package with a graphical user interface that can estimate non-hierarchical GCMs in the Bayesian framework. An extension of the GCM allowing cultural truth to be continuous as in fuzzy logic rather than two-valued can be found in Batchelder and Anders (2012).

1 The guessing bias parameter in the GCM is a person-specific, cognitive latent variable, which should not be confused with the item-specific guessing rate parameter of the three parameter logistic model (Birnbaum, 1968).

PSYCHOMETRIKA

3. Hierarchical Model Formulation for the GCM The modeling framework proposed for the GCM involves a hierarchical model structure that allows for the pooling of information across participants as well as items. In this setting, the person and item parameters are treated as random variables. These unit-scale parameters can be transformed onto the real line and normal (Gaussian) distributions can be chosen to model their population distributions. Although this proposed technique is generally used (see e.g.; De Boeck & Wilson, 2004) when dealing with unit-scale parameters, an alternative method that does not involve transforming the parameters, namely the use of the beta distribution, is also described in this section. The advantages and disadvantages these two modeling approaches will also be discussed. The hierarchical modeling framework allows for a straightforward way to involve covariate information in the model, without needing to resort to a two-stage analysis (i.e. first estimating model parameters, and then exploring their association with the covariates through correlation or regression analysis). Incorporating predictors in models is not yet a widely explored technique in the field of cognitive psychology, although in many cases it can be done in a straightforward manner and can result in interesting findings; this will be demonstrated later in the Application section. 3.1. Assigning Normal Population Distributions Unit-scale parameters can be transformed onto the real line by a link function. For the GCM, the suggested link function is the logit transform (other types, like the probit transform, are also possible), which is defined by logit(x) = log[x/(1 − x)], where x ∈ [0, 1] is a variable to which the logistic function is applied. On the logit scale, the pre-transformed values less than 0.5 correspond to negative values, while values greater than 0.5 correspond to positive values. The population distributions are modeled on the transformed scale. Person-Specific Parameters In the case of the logit-transformed ability parameters (θi ) and guessing biases (gi ), a bivariate normal population distribution can be assigned as their population distribution: logit(θi ) ∼ Normal2 (μl(θg) , l(θg) ), (8) logit(gi ) where Σl(θg) is a 2 × 2 covariance matrix representing the variation and covariation in the personspecific ability (θi ) and guessing bias parameters (gi ) of the population sample on the logit scale, while vector μl(θg) contains the population mean for these two variables. An easily derivable and scale-independent parameter based on Σl(θg) is the correlation between ability and guessing bias, which will be denoted as ρl(θg) . The l notation in the subindex indicates that these parameters are under the logit transform. Item-Specific Parameters Since the item-difficulty parameter, δk , is also on the unit-scale as are the informant response-probability parameters discussed above, one can apply the logit transformation and assign a population distribution on it in the same manner. These two procedures are formally written as

2 . (9) logit(δk ) ∼ Normal μl(δ) , σl(δ) In order to handle model identification in Equation (7), the population mean of the logit-scaled item difficulties can be set to zero (μl(δ) = 0) in Equation (9). Then the population mean of the ability parameter would represent the average probability (on the logit scale) that a respondent would know the correct answer. Alternatively the population mean of the abilities can also be


set to 0, and then the population mean of the item difficulty would be estimated. In any case the ability and item-difficulty parameters are interdependent and should be interpreted relative to each other. Finally, the answer key parameters can have a hierarchical structure as well. Typically in the hierarchical application, it is assumed that the answer key items, Zk , are generated hierarchically by a Bernoulli process with a specific hyperprior, π . That is, Zk ∼ Bernoulli(π),

(10)

where π ∈ [0, 1] is the probability of an answer key item being ‘True’. The majority of GCM applications (see summary in Weller, 2007), which are not hierarchical, fix the probability parameter of this Bernoulli process, π , to 0.5, and this setting is less flexible as it designates an a priori equal chance for every latent answer key parameter to either be ‘True’ or ‘False’. Incorporating Covariates In some applications, covariate information on the participants, such as demographic data on age, gender, nationality, education and/or responses on personality questionnaires, etc., are available in the data. Adding predictors can enhance model application, as this information can be used to explore connections between these covariates and the latent cognitive parameters of the GCM. Covariate modeling is already a well-established technique in the area of personality psychology and psychometric test theory (see e.g., De Boeck & Wilson, 2004; Gelman & Hill, 2007). The score of respondent i on covariate c (c ∈ 1, . . . , C) is denoted as xci . All respondentspecific covariate scores are collected into a vector of length C + 1, denoted as x i = (xi0 , xi1 , xi2 , . . . , xiC )T , typically with intercept xi0 = 1. Continuous covariate scores should always be standardized for the GCM to improve numerical stability of the model. Categorical variables should be dummy-coded. In order to avoid identification issues and to preserve the interpretation of the intercept as a population grand mean, we consider it a good practice to also standardize categorical and binary dummy-coded variables. If there are categorical variables that represent exclusive categories, for example if different exclusive groups are modeled, care should be taken that the dummy-coded variable does not itself form an intercept variable (i.e.; either the intercept should then be omitted entirely, or the categorical variable should be represented by a number of dummy variables that is equal to the number of exclusive categories minus one; the dummy variables will then indicate contrast from the reference category). Also, it should be noted that in the current formulation of the model, it is implicitly assumed that the variance-covariance structure is equal across different groups of respondents (or items; see below). Assume that all regression coefficients (including an intercept) for the ability are collected in a vector, β θ , and all regression coefficients for the guessing bias, are collected in another vector β g . Then consequently, the population mean receives a person index, i, and can be written as a dot product of the predictors and regression coefficients: μl(θi) = x Ti β l(θ) ,

(11)

μl(gi) =

(12)

x Ti β l(g) .

This is a general formulation that includes covariates, but in their absence, the same reasoning holds and β l(θi ) and/or β l(gi ) reduce to a scalar, which is interpreted as a population mean. Moreover, covariate information on the item side can be introduced as well; although item covariates are less often collected, one may gather this information by examining the characteristics of the items in the questionnaire that gave rise to the data. For example, in the case of a questionnaire exploring beliefs of a certain illness, the items might belong to topic groups such as causes, symptoms, and treatments. Other possible predictors may come from the grammatical form of the question, for example whether it is written as a negation or not. This information can be added as item covariate, denoted by h (h = 1, . . . , H ). Then the value for the

PSYCHOMETRIKA

covariate h on item k will be denoted as vhk , and these H covariates are included in the vector v k = (vk1 , vk2 , . . . , viH )T . Note that there is no intercept in this formulation in order to preserve model identification. As the goal is to keep the population mean fixed to 0, all types of covariates should be standardized. Similarly as the response-probability parameters, the population mean for item difficulty can be written as μl(δk) = v Tk β l(δ) .

(13)

The population mean is again decomposed into regression coefficients, (β l(δ) ), and covariates, (v k ). If there is no covariate information included or available on the items, then β l(δ) should be set to 0. With the novel model specification to include covariates, it is suggested here to apply the model as specified above for handling both cases of data with or without covariate information. In this way, in cases without covariate information, the population grand mean is represented either by an intercept, (β0l(θ) and β0l(g) ), for the person-specific parameters, or by 0 for the mean of item difficulties. 3.2. An Alternative: Using the Beta Distribution to Model the Population A previous hierarchical extension of the GCM in Batchelder and Anders (2012), without item difficulty and covariate information, used the beta as the population distribution for each parameter on the unit-scale. In their application, the beta distribution allows one to carry out inference on the unit-scale, which is the original scale of the probability parameters, and its interpretation is related to that. While statistical inference practices in psychometric test theory typically involve transforming the unit-scale parameters and employing the Gaussian distribution, some authors have instead argued for and implemented the use of the beta distribution (e.g.; Merkle, Smithson, & Verkuilen, 2011; Batchelder & Anders, 2012). In this subsection, the betadistribution-based formulation of the HGCM with the inclusion of covariates and item difficulty is presented and compared to the normal distribution-based approach. Person-Specific Parameters The person-specific parameters on the unit-scale, θi and gi , can be assigned a beta population distribution. In standard notation, the beta distribution is usually parameterized by two non-negative parameters, a and b, which respectively indicate whether the mass of the distribution is toward 1 or toward 0, as Beta(a, b). If probability variable θi is assumed to be beta distributed, its probability density function is written as f (θi ; aθ , bθ ) =

Γ (aθ + bθ ) aθ −1 θ (1 − θi )bθ −1 , Γ (aθ ) + Γ (bθ ) i

where Γ stands for the gamma function. Now for the sake of modeling a population mean and spread, the distribution can be reparameterized in terms of a mean μθ ∈ [0, 1] and a parameter that acts as a precision or inverse variance, τθ > 0 (e.g.; Kruschke, 2011). In this setting, the mean and precision are: μθ = aθ /(aθ + bθ ), τθ = aθ + bθ . Then the mean and variance of this beta distribution are respectively μθ

and μθ (1 − μθ )/(1 + τθ ).

In using this parameterization of the beta, hierarchical distributions of the untransformed parameters person-specific parameters can be set as

θi ∼ Beta μθ τθ , (1 − μθ )τθ . (14)


The distribution for the guessing bias parameter is obtained by substituting subscript θ with g in the above equation. Item-Specific Parameters Using similar developments as discussed for the person-specific parameters, the item parameter on the unit-scale, item difficulty δk , can be modeled with a beta distribution in the same way where

(15) δk ∼ Beta μδ τδ , (1 − μδ )τδ . In the unit-scale setting, neutral item difficulty is a value of 0.5, and in order to identify the model at the hierarchical level, one sets μδ = 0.5. Incorporating Covariates In the case of introducing covariates for the beta distribution, a similar design as discussed for the normal distribution is employed, except with the modification that the regression structure is positioned on the unit scale to properly locate each population mean in [0, 1]. As the logit was used earlier to transform values from the unit-scale to the real line, the inverse logit is the natural corollary for the inverse of this transform, as logit−1 (x) = exp(x)/(1 + exp(x)), to transform the real-line values back into the unit scale. In this way, the covariate information for the person- and item-specific parameters are set similarly as before (see details in the part on modeling with normal distributions), with the exception of the inverse logit transformation:

μθi = logit−1 x Ti β θ . (16) To derive the population mean for the guessing bias parameter and item-difficulty parameters subscript θ is substituted in Equation (16) to g or δ (with the additional change of i to k). 3.3. Summary of the Two Population Modeling Approaches Modeling with beta distributions has some potential disadvantages. First, estimating large values of the τ parameter is problematic. A characteristic property of the beta distribution parameterized by μ and τ is that sharp changes in the variance result from changes in τ between low values such as 0 to 8, and increasingly limited returns result from further increases. The result of this characteristic is that large underlying τ values are easily overestimated at high magnitudes when a diffuse prior is used. This overestimation issue may be handled with the use of a tighter prior for the τ parameter around a lower range of values (see e.g., Batchelder & Anders, 2012), such as a Gamma(4, 2), which is used here in the demonstrated application of the model on a real data set. Second, dependencies between the ability and guessing bias parameters at the hierarchical level cannot easily be incorporated into a hierarchical beta population distribution. However, applying the beta distribution has the advantage of modeling the parameter values on the probability scale. GCM parameters are interpreted cognitively as probability parameters as originated in the theory of signal detection, and their psychological interpretation relates to this framework. With regard to modeling the logit transform of the person-specific parameters with the normal distribution, the population parameter estimates of these variables are only interpretable on the transformed variable scale, not on the original unit scale.2 Also, the beta distribution exhibits somewhat greater flexibility in fitting population trends existent in the data. While 2 Although if returning to an interpretation of the estimated values on their original unit-scale is of interest, then these values can be approximated through the transformation of variables technique (see e.g., Mood, Graybill, & Boes, 1974). However, these approximations do not account for the limits of the unit scale, and therefore in the case of large variances on the transformed scale, these approximations are biased towards the extremes.

PSYCHOMETRIKA

the normal distribution has zero skewness, the beta distribution has the potential advantage in that it can accommodate population distributions that may be skewed in one direction. To summarize, we propose the hierarchical normal distribution-based GCM (HGCMN ) as a more advantageous alternative as compared to the GCM with beta population distributions (HGCMB ), especially when incorporating covariance in parameters. In order to relate to previous developments in the area of GCM modeling, we will also demonstrate statistical inference, including covariate modeling with the HGCMB .

4. Bayesian Statistical Inference We analyze the hierarchical models within the Bayesian framework (Gelman, Carlin, Stern, & Rubin, 2004; Kruschke, 2011). Classical, maximum likelihood statistical inference for hierarchical models is not a trivial task. For the model presented here, statistical inference with maximum likelihood would involve a high-dimensional integration over the numerous random-effect distributions. Since most of these integrals have no closed-form solutions, these would have to be approximated by finite sums, which is computationally prohibitive for models with a large number of parameters. In the Bayesian paradigm, the explicit integration over the random effects is avoided because the inference is based on the full joint posterior distribution of the parameters (and not on the marginal). Parameters in the Bayesian framework have a probability distribution, which offers an intuitively appealing way of thinking about uncertainty and the knowledge one has about the parameters. Moreover, the Bayesian framework offers a coherent method for making decisions. An advantage of Bayesian statistical inference is that sampling algorithms may easily be applied to sample from the posterior density of the parameters. The posterior density represents the probability distribution of the parameters given the data, and it is directly proportional to the product of the likelihood of the data (given the parameters) and the prior distribution of the parameters. Formally, p(ξ |Y) ∝ p(Y|ξ )p(ξ ), where ξ stands for the vector of all parameters in the model and where Y for the data (the normalization constant, p(Y) does not depend on the parameters and is therefore not considered). The prior distribution incorporates prior knowledge about the parameters. In the absence of unambiguous prior knowledge, there are a number of default priors that have been suggested in the Bayesian literature. The more data one acquires, the less influential the prior becomes on the posterior. Since the presented models yield high-dimensional posteriors (due to the large number of parameters), we opt for Markov chain Monte Carlo (MCMC, see e.g.; Robert & Casella, 2004) methods to draw values from the posteriors. In particular, these algorithms perform iterative sampling: values are drawn from approximate distributions, and the approximation to the posterior improves due to the increases in the number of samples. There are two freely available software packages, namely JAGS (Plummer, 2011) and WinBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000) to perform such computations. The Appendix C.1 provides scripts for the estimation of the model parameters, written for JAGS, however, it can be easily translated into WinBUGS as well. With JAGS, results may be easily investigated by using complementary software, such as R or MATLAB, which both contain freely accessible packages for the programs to communicate. 4.1. Prior Specifications In the Bayesian context, the priors on the person- and item-specific parameters (θi , gi , δk , and Zk ) of HGCMN are defined by their population distributions in Equations (8), (9), and (10). These population distributions have free parameters that are estimated from the data, and here


diffuse or vague prior distributions for these population parameters are assigned (Gelman et al., 2004). More specifically, a diffuse normal prior is assigned for the vector of regression weights: β l(θ) ∼ NormalJ +1 (0, 10IJ +1 ),

(17)

where I stands for the identity matrix. Subscript l(θ ) can be substituted with l(g) to denote the guessing bias population distribution priors, or l(θ ) is substituted with g for using the beta population distribution. The population covariance matrix for the bivariate normal distribution of the logit transform of θi and gi is modeled through an inverse-Wishart density with the identity matrix as a scale matrix, and with 3 degrees of freedom: Σl(θg) ∼ Inverse-Wishart(I2 , 3).

(18)

The derivation of the priors for the item-specific l(δk ) follows the same principle: β l(δ) ∼ NormalH +1 (0, 10IH +1 ).

(19)

The population variance of the item difficulty is assigned an inverse gamma prior distribution: σδ2 ∼ Inverse-Gamma(0.01, 0.01).

(20)

As for the prior on the Bernoulli probability for the answer key, Zk , a uniform prior distribution is assigned: π ∼ Uniform(0, 1).

(21)

The conditional posterior density of all model parameters can be found in Appendix B.

5. Application: Judging Grammaticality of Sentences To illustrate the amounts of information one can gain by applying the HGCMs to real data, the models are applied to a data set in Sprouse, Fukuda, Ono, and Kluender (2011), which involved dichotomous judgments of whether a sentence is grammatically acceptable (grammatical or not). This topic may be considered especially fit for HGCM analysis, as although there are rules to determine the grammaticality of phrases in English, the language is constantly evolving, it varies somewhat from region to region, and it is the users of the language who form its rules. Thus a model that can estimate the underlying consensus answers while taking into account abilities, item difficulties, and guessing tendencies is especially well-fit to investigate shared syntactic rules. Participants examined a variety of sentences, which were classified into several grammatical classes by linguists, and they responded whether or not each sentence type is grammatical. A major focus of the study was to measure so-called syntactic ‘island’ effects on assessments of grammaticality. In particular, syntatic islands relate to what is known as ‘wh-movement’ or ‘wh-extraction’: the ability to introduce a ‘wh’ question word such as ‘who’, ‘what’, ‘where’, and ‘which’ at the beginning of a sentence, and still retain grammaticality by rearranging the other words. For example: ‘She should stop talking about syntax’ can be rearranged with a whextraction as ‘What should she stop talking about ?’ The underscore represents the canonical position of the word that the ‘wh’ replaced by extraction. In many cases, one can introduce a ‘wh’ question away from its canonical position, or even manipulate this length further by introducing more words, yet still retain grammaticality, such as ‘What do you think she should stop talking about ?’ Now in contrast, a syntactic ‘island’ is a phrase in a sentence where generally,

PSYCHOMETRIKA

one cannot make a wh-extraction away from its canonical position while still retaining grammaticality. For example: ‘She should stop talking about syntax because it is confusing to me’ is grammatical, but when introducing any ‘wh’ question word, the resultant ‘island’ clause is ungrammatical: such as, ‘what should she stop talking about because it is confusing to me’.3 It is noted that some cases of wh-extractions out of particular island types are accepted as grammatical by some and not by others. The question for consensus analysis here by the HGCMs is to determine the consensus belief in the grammaticality of wh-extractions out of islands. In the study described in Sprouse et al. (2011), data—pertaining to the sentence types as described above—was obtained from a survey containing the two conditions: sentences with an ‘island’ and without an ‘island.’ Respondents were recruited from the Amazon Mechanical Turk website (Buhrmester, Kwang, & Gosling, 2011) and were paid $3 for their participation. The total sample size was N = 102 (one out of 103 was dropped due to missing covariates), while the total number of items were M = 64; 32 items for each condition of non-island versus island with various combinations explained below.4 An additional factor that was studied was the distance from the ‘wh’ to its canonical position such as: ‘Who thinks that John bought a car?’ (short) versus ‘What do you think that John bought ?’ (long). In both conditions, half of the sentences were short and half were long. In the analysis below, these two indicators (‘island’ and ‘long’) are used as predictors of item difficulty. There were four versions of the questionnaire, containing different tokens for each of the conditions, but from a linguistic point of view, the tokens share exactly the same properties. Thus the questionnaire types were collapsed to retain a non-sparse response matrix; however, the questionnaire-version was added as an indicator variable in the analysis to test whether respondents replying to different versions of the questionnaire exhibit different levels of ability or different guessing tendencies, which might be an indicator of inequality in the questionnaire versions. Finally, two other person covariates, namely gender and age, are also added in the analysis. With the HGCM, ‘island’ effects can be investigated in terms of the consensus answer key as well as item difficulty. The former is determined from examining the model’s latent answer key estimates on both ‘island’ and ‘non-island’ items, and these estimates take into account the cognitive variables of decision making such as ability and guessing bias. Second, while differential item-difficulty is also accounted for in the HGCMs using a parameter that is estimated in the context of the full model, these island versus non-island effects are well-summarized by involving an indicator on the item side: that is, a covariate that is coded by whether an item contains an ‘island’ or not. 5.1. Results of Fitting HGCMs Both the HGCMN and HGCMB with five person-specific standardized covariates (age, gender, and three dummy-coded variables indicating questionnaire version) and two item-specific covariates (indicating island/no-island and long/short) were fit to the data. The results are based on analysis with JAGS running six chains, in which the retained number of iterations from each chain was 4000, resulting in a final posterior sample size of 24000. All results were based on chains that passed the Rˆ convergence test (Gelman et al., 2004) and visual assessment. 3 This is an example of an ‘adjunct’ island. In this data set, there were four types of islands investigated: adjunct, subject, whether, and complex noun-phrase islands (see Sprouse et al., 2011). 4 In particular, there are four types of phrases: adjunct, subject, whether, and complex noun-phrases. The 64 items in the questionnaire are composed of each phrase-type having eight tokens as islands and eight tokens as non-islands. In each eight-token set, four tokens were long distances between the wh-word and its canonical position while four were short distances. Since the data set is used here for demonstrational purposes, for sake of simplicity, these four tokens were collapsed over of each condition (see Sprouse et al., 2011 for details).


With respect to the answer key estimates (Zk ), the posterior mean estimate of their population distribution hyperparameter, π , was 0.75 (posterior std: 0.05), indicating that items with a consensus answer “grammatically acceptable” were more likely in the questionnaires than ungrammatical items. The Zk posterior median estimates indeed showed this ratio: all (short and long) ‘non-islands’ (32 items) were classified as grammatical according to the model, as well as all short ‘islands’ (16 items). For each of these estimates, the posterior standard deviation was practically 0, indicating a high level of certainty in the estimates. With respect to the long ‘island’ items (16), they were all classified as non-grammatical. In the HGCMN , in eight out of 16 cases, the posterior standard deviation was again 0, and in the remaining eight cases, there was a very small amount of uncertainty (posterior standard deviation smaller than 0.08 for each item). As for the HGCMB , there was likewise no uncertainty in the 48 grammatical estimates and just a very small amount of uncertainty (posterior standard deviation smaller than 0.09) in the 16 posterior ungrammatical estimates. These findings of the model are interesting as the raw data showed lack of agreement for many items (there was only one item on which all respondents agreed). For example, one of the largest disagreements was on a long, island structure: the response split was 44 ‘True’ and 58 ‘False’ in terms of grammaticality judgments. Despite the nearly even split, the HGCMN classified the underlying answer key as “non-grammatical” with high uncertainty (std ≤ 0.04). The advantage of fitting an HGCM is that full response pattern is evaluated in the model, therefore the information across items is aggregated by using endogenous estimated differential weights on each respondent’s answer. To summarize, the consensus of the respondents was that short/long non-island and short island clauses were grammatical, and long island clauses were ungrammatical (the short island clauses retained their grammaticality as the wh-extraction involved no distance from the canonical location). These findings are consistent with previous findings based on simple statistics (e.g.; Sprouse et al., 2011). With respect to the predictors, the results from both models are shown in Table 1. As can be seen, the models are consistent in their findings. The intercept (β0l(θ) ) is 1.75 for the HGCMN and 1.33 (β0θ ) for the HGCMB , and these estimates suggest that the population is quite knowledgeable, and that they show a high level of consensus on the grammaticality of these sentences. For an item with difficulty 0 (that is the population mean difficulty in our model), the probability of any given informant knowing the correct answer can be approximated by taking the inverse logit of the population mean of the abilities (β0l(θ) ). However, this approximation works well only for cases where the population variance is small. In our case this turns out to be approximately 0.8, which suggests that the population is rather knowledgeable in general. We investigated whether the questionnaire types could affect performance. As mentioned previously, questionnaire 1 was coded as a baseline, and then three person-specific covariates were made to indicate if a different questionnaire was filled out by the participant. As shown in the results of Table 1, while all population posterior mean estimates for these coefficients are negative (−0.21, −0.06, −0.32), the magnitude of these negative effects is low, and the corresponding 95 % credible intervals (CI) are comparatively wide, providing no substantial evidence that the ability or the guessing bias estimates would differ remarkably as a function of these indicators. In addition to the questionnaire type, the two other person-specific covariates used to predict ability and guessing tendencies were age and gender. From Table 1 we can see that with respect to age and ability, the posterior mean estimate is rather positive (0.61), with a relatively narrow 95 % CI of (0.29, 0.93), suggesting that age is positively related to ability with older respondents performing better. The rest of the coefficients for ability and guessing bias have very low magnitudes and high posterior variance providing no evidence for effects. As for the item covariates, the two predictors respectively corresponded to whether the item was an island (structural effect), and whether the wh-word distance from the canonical position was long (length effect). As can be seen, both of these predictors have large magnitudes (0.51,

PSYCHOMETRIKA TABLE 1. Results on the regression coefficients based on the HGCMN and HGCMB .

Model parameter

Covariate data Posterior mean

Ability

Guessing bias

Item-difficulty

Intercept Age Gender Questionnaire 2 Questionnaire 3 Questionnaire 4 Intercept Age Gender Questionnaire 2 Questionnaire 3 Questionnaire 4 Structure (‘island’) Length (‘long’)

1.75 0.61 −0.10 −0.21 −0.06 −0.32 −0.37 −0.15 0.06 −0.18 −0.30 −0.08 0.51 1.11

HGCMN CI percentiles 2.5 % 97.5 % 1.32 0.29 −0.43 −0.61 −0.46 −0.73 −1.13 −0.57 −0.33 −0.72 −0.83 −0.61 0.13 0.73

2.20 0.93 0.21 0.19 0.34 0.08 0.22 0.28 0.45 0.36 0.23 0.43 0.93 1.52

Posterior mean 1.33 0.45 −0.09 −0.16 −0.06 −0.24 −0.66 −0.16 0.08 −0.16 −0.39 −0.06 0.47 0.92

HGCMB CI percentiles 2.5 % 97.5 % 0.99 0.19 −0.33 −0.48 −0.38 −0.55 −1.01 −0.49 −0.21 −0.54 −0.68 −0.44 0.16 0.63

1.68 0.77 0.14 0.15 0.25 0.05 0.06 0.19 0.36 0.23 0.12 0.31 0.85 1.29

1.11) and while their corresponding CIs are not especially narrow they do indicate a connection between these predictors and item-difficulty: the wh-word distance has a larger positive effect on item difficulty while the island structure has a smaller, but still positive effect. As can be seen from the various estimates, the HGCMN and HGCMB delivered very similar results. The models functionally differ from one another with respect to their usage of population distribution types: in the HGCMN , the logit-transformation is applied to all unit-scale parameters and a normal distribution is assumed as the population distribution. In contrast, the HGCMB models unit-scale parameters on their original scale. Despite these differences, in this case, the same conclusions could be drawn from the estimates of either model. Also, it was observed that when generating a large amount of random samples based on the population parameter estimates of the HGCMB and the HGCMN and the inverse of the logit transformation is applied to the latter values, the actual shapes of the two types of population distributions do not seem to differ remarkably. Another noteworthy difference of the two hierarchical models is the ability of the HGCMN to directly model the dependence between the person-specific parameters via multivariate normal parametrization. In this application, the dependency was modeled by ρθg that turned out to be equal to −0.18 (std = 0.14). By this parameter we allowed for the possible dependency between a person’s ability (θ ) and guessing bias (g) parameters to be expressed directly through the estimation. 6. Model Fit Model fit can be investigated in both absolute and relative terms. For the former, posterior predictive model checks (see for example; Gelman et al., 2004) can be used. For the latter, the Deviance Information Criterion (DIC; Spiegelhalter, Best, Carlin, & van der Linde, 2002) is suggested, though others may be used. 6.1. Posterior Predictive Model Checking Posterior predictive model checks (PPC) are set by selecting a statistic that reflects an important feature of the real data, calculating that same statistic for many replicated data sets (based


on the model and parameter estimates), and then comparing the statistic of the real data with the ones generated from the replicated data. If the real data statistic does not appear to be consistent with the distribution of statistics generated from the replicated data, the proposed model is considered to provide a poor description of the data. Posterior predictive checks have been criticized for being too optimistic (Dey, Gelfand, Swartz, & Vlachos, 1998), therefore although here we offer two tests further improvements are desirable in this area. One Answer Key An important assumption of the model is that all persons share the same latent answer key. As discussed earlier, this is expressed by the single-factor structure of the correlation matrix obtained from correlating the responses of informants across items. This property follows from the proven theorem that the correlation between any two informants is equal to the product of each informant’s correlation with the latent answer key, and this is formally written in Equation (6). Thus the model check for the assumption that all persons share the same latent answer key involves verifying if the person-by-person correlation matrix has a single-factor structure; and this is achieved for the GCM with a factor-analytic approach. Typically, standard minimum residual factor analysis (MINRES, Comrey, 1962) is utilized, from which one obtains the eigenvalues of the person-by-person correlation matrix. The single-factor structure is checked for by observing the pattern of the first and subsequent eigenvalues of the correlation matrix, and seeing how they decline. A one factor solution is supported by a sharp decline after the first eigenvalue, with the rest of the eigenvalues following a linearly decreasing trend. The posterior predictive check for the single answer key property of the data is carried out by assessing how closely the eigenvalue series of the correlation matrices of the posterior predictive data resemble the eigenvalue series of the correlation matrix of the real data. In particular, the graphical PPC involves plotting the many series of eigenvalues from the posterior predictive data against the series from the real data, in order to assess whether the posterior predictive data mimics a similar trend in eigenvalues as the real data; this is known as a graphical posterior predictive check. Figure 2 displays the results of the test carried out for the grammaticality data set for both the HGCMN (left panel) and HGCMB (right panel). The continuous gray lines depict the eigenvalue

F IGURE 2. PPC: Eigenvalue curves for both the HGCMN (left) and HGCMB (right) (grammaticality data set). The gray lines depict the eigenvalue series generated from the posterior predictive data, while the black line is the series from the real data.

PSYCHOMETRIKA

series generated from 1000 sets of the posterior predictive data, while the black line is the series from the real data. By the black line closely overlapping the gray area in each plot, the figure shows an appropriate fit of the eigenvalue trend in both cases, and thus the posterior predictive data has a similar single-factor design as the real data. Item Heterogeneity Another important property of the data concerns the marginal frequencies across items. We use a measure called Variance Dispersion Index (VDI, see, e.g.: Batchelder & Anders, 2012) to represent the variation in responses over informants on each item. VDI is simply calculated by first calculating the variance in responses over informants on each item (the variation of each kth column), and then again taking the variance of these column variances. Formally,

M 2 M 2 Vk M − Vk M (22) VDI(X) = k=1

k=1

where Vk = Pk (1 − Pk ),

and Pk =

N

Xik N.

i=1

The VDI-based PPC involves calculating the VDI of the real data and checking if it is within the 95th percentile distribution of VDI statistics calculated from model-based simulated data sets. To further illustrate the strength of the VDI check a HGCMN with homogeneous items (all δk was set to 0.5) was also fit to the grammaticality data. In Figure 3 the VDI model checks are displayed for the HGCMN (left panel) with heterogeneous (continuous line) and with homogeneous (dotted line) item-difficulty, and for HGCMB (right panel) with heterogeneous item-difficulty. The black lines depict the VDI statistic calculated from the real data. The VDI checks are satisfied for both models with heterogeneous item-difficulty. As the dotted line in Figure 3 illustrates, the VDI check was not passed for the item-homogeneous version of model.

F IGURE 3. PPC: VDI distributions based on generated data by the HGCMN (left panel) with heterogeneous (continuous line) and with homogeneous (dotted line) item-difficulty, and heterogeneous HGCMB (right panel) are depicted with a gray curve, while the black lines indicates the VDI value of the real data set.

ZITA ORAVECZ, ROYCE ANDERS, AND WILLIAM H. BATCHELDER TABLE 2. DIC results on different HGCMs fit to the grammaticality data set.

Model types

DIC

HGCMN with covariates HGCMB with covariates HGCMN no covariates HGCMN with neutral bias HGCMN with neutral item-difficulty

3842 3869 3876 4116 4309

Deviance Information Criterion The DIC evaluates a goodness-of-fit in relative terms. As it measures model fit in terms of deviance between the model and the data, models with smaller DIC values should be able to predict the same type of data better. Table 2 shows the DIC values for different versions of the HGCM on the grammaticality data: the HGCMN , the HGCMB , and the HGCMN where the neutral bias was set to the same, neutral level (gi = 0.5) for all persons, and finally the HGCMN in which item-homogeneity (neutral item-difficulty) was assumed. In all these models covariate information was included. In contrast, the HGCMN was estimated without covariate information as well. As can be seen, the HGCMN with its full parametrization seems to be the best fitting model for this data set. The second-best fitting model is the HGCMB , while HGCMN without covariate information still appears to do better than the version in which random effects were turned into fixed ones.

7. Discussion This paper aims to provide multiple extensions to the estimation techniques of the popular General Condorcet Model published originally by Batchelder and Romney (1988). By developments of this paper, the GCM can be embedded in a hierarchical inference framework, in which covariates of both persons and items can also be incorporated into the estimation; the accompanying code for these developments are provided as online supplements. We developed two approaches to estimating the model parameters: employing the normal distribution on the transformed parameters, the HGCMN , or employing the beta distribution on the unit scale, the HGCMB . A number of parameter recovery studies performed (though not reported in this paper) suggested that both models do well in recovering parameters from data simulated by the identical model, as well as by the other model: that is, the HGCMN did well in recovering equivalent parameters generated by HGCMB and vice versa. A possible disadvantage of the beta-distribution-based modeling approach is that possible connections or dependencies between abilities and guessing biases cannot be modeled directly, whereas they are incorporated in the normal-based model. However, in the Bayesian modeling framework, such dependencies can still be discovered by post-processing the person-specific posterior distributions of these two parameters (such post-hoc parameters are sometimes called structural or derived parameters, see Jackman, 2009; Congdon, 2003). For example, the directly modeled correlation of ability and bias for the informants in the HGCMN was ρθg = −0.18 (std = 0.14). Approximate measures can ultimately be obtained with the beta model by correlating the posterior samples of the person-specific ability and willingness to guess parameters at each iteration, which results in a posterior distribution of the derived correlation between the two parameters for the HGCMB . The posterior mean of this derived correlation parameter turns out to be −0.15 (std = 0.08), which is very close to the measure delivered by the HGCMN . When estimating parameters in the HGCMB , the constrained prior information of independent ability

PSYCHOMETRIKA

and guessing bias affected the estimation process, which can therefore mitigate the effect of a possibly existing dependence between these two parameters. This is because by not modeling correlations, independence is assumed, and therefore the parameter estimates are biased towards that (especially the population parameters). While the current application relied on a data set with a relatively large number of respondents and items, we emphasize that the HGCM also does well in recovering the answer key and other parameters with a relatively small number of informants. For example Batchelder and Anders (2012) show excellent parameter recoveries using a model similar to HGCMB with as low as N = 6 number of informants. Acknowledgements Work on this paper was supported by grant to the authors from the Army Research Office (ARO) and from the Oak Ridge Institute for Science and Education (ORISE). We would like to thank Jon Sprouse for making available to us his grammaticality data set. We would also like to thank the four anonymous reviewers and Joachim Vandekerckhove for their useful comments.

Appendix A. Proof of Identifiability for the General Condorcet Model Let the parameters be (D, G, Z) of the GCM with parameter spaces, respectively (0, 1)N , (0, 1)N , {0, 1}M . Let Ω be the parameter space of (D, G, Z). Let Y be an N × M matrix of 1s and 0s. Let S be the space of all Y. Let h : Ω → Π , where Π is the space of all probability distribution over Y. Let p(y | (D, G, Z)) be a particular probability density function of Π . Definition. In this context a model is identified in case h is one-to-one, meaning that two different sets of parameters necessarily produce different probability density functions over Y. Observation 1. The model is not identified if we allow all items to be false or if we allow all items to be true. Let Z be all 1s and Z be all 0s, then so long as

∀i Di + (1 − Di )gi = 1 − Di gi , the model gives identical probabilities. Observation 2. If we exclude these two extremes from the space of possible Z s, the model is identified. We need to show that ∀Y ∈ S :

p(Y | D, G, Z) = p Y | D , G , Z

⇒ D = D ,

G = G ,

Z = Z .

Suppose Z = Z . Then there are Zk = 1 = Zk and Zl = 0 = Zl . From these we have for all informants i: Di + (1 − Di )gi = Di + (1 − Di )gi and (1 − Di )gi = (1 − Di )g . From these we have Di = Di and Gi = G i . So if the model is not identified it must be that Z = Z . Pick k (without loss of generality since primed and unprimed can be swaped) with Zk = 1, Zk = 0. For all informants i we have

Di + (1 − Di )gi = 1 − Di gi . (A.1)


Next pick j such that Zj = 1, this is possible because we have eliminated the case where Z can all be zeros. If Zj = 1 then for all i,

Di + (1 − Di )gi = Di + 1 − Di gi , and coupled with Equation (A.1), this is not possible. On the other hand if Zj = 0, then for all informants i, we have

(1 − Di )gi = Di + 1 − Di gi , and coupled with Equation (A.1), this is not possible. Thus the model is identified.

Appendix B. Posterior Distributions of the HGCMs The posterior distribution given the data for the HGCM with Gaussian population distributions can be derived the following way: For notational convenience, all of the person- and itemspecific parameters are respectively collected into corresponding vectors (i.e., θ , g, δ, and Z). First, the conditional posterior for the model in which the probability variables have population distributions assigned on the transformed variable scale is derived as

Pr θ , β θ , σθ2 , g, β g , σg2 , δ, β δ , σδ2 , Z, π | Y ∝

I K

(Y ≡1) Zk Dik + (1 − Dik )gi ik i=1 k=1

(Y ≡0) × −Zk Dik + Dik + (1 − Dik )(1 − gi ) ik I logit(θi ) β l(θ) , × Normal2 l(θg) logit(gi ) β l(g) i=1

×

K k=1

K

Normal logit(δk ) | β l(δ) , σδ2 Bernoulli(Zk | π) k=1

× NormalJ +1 (β l(θ) | 0, 10IJ +1 )NormalJ +1 (β l(g) | 0, 1000IJ +1 ) × NormalH +1 (β l(δ) | 0, 10IH +1 )Uniform(π | 0, 1)

× Inverse-Wishart(I2 , 3)Inverse-Gamma σδ2 | 0.01, 0.01 . After the proportionality sign, the first double product describes the likelihood of the parameters given the data, based on Equations (3). It is followed by the products of the population densities of the person-specific parameters, as specified in Equations (8). The next line describes the population densities of the item-specific parameters as in Equations (9) and (10). Finally, the last two lines multiple all the above with the prior densities as chosen in Equations (17), (19), (21) and (20). The only modification for the prior settings of the HGCM with the beta population distributions concerns the variance parameters. As θi and gi are sampled univariately, priors have to be set for their ‘precision’ parameters (that determine their population variance). As is typically

PSYCHOMETRIKA

done for precision parameters, a moderately diffuse Gamma distribution can be chosen, where τθ ∼ Gamma(1, 0.1).

(B.1)

Then τg as well as τδ for item difficulty are set similarly. Note that the HGCMB is different only in terms of the population distributions for the item-and person-specific parameters. As discussed earlier, the beta distribution is parameterized in terms of regression coefficients and a precision parameter, as specified in Equations (16) and (B.1), and the posterior is written as Pr(θ , β θ , τθ , g, β g , τg , δ, β δ , τδ , Z, π | Y) ∝

I K

(Y ≡1) Zk Dik + (1 − Dik )gi ik i=1 k=1

(Y ≡0) × −Zk Dik + Dik + (1 − Dik )(1 − gi ) ik ×

I

Beta(θi | β θ , τθ )

i=1

×

I

Beta(gi | β g , τg )

i=1

×

K

Beta(δk | β δ , τδ )

k=1

K

Bernoulli(Zk | π)

k=1

× NormalJ +1 (β θ | 0, 10IJ +1 )NormalJ +1 (β g | 0, 1000IJ +1 ) × NormalH +1 (β δ | 0, 10IH +1 )Uniform(π | 0, 1) × Gamma(τθ | 1, 0.1)Gamma(τg | 1, 0.1)Gamma(τδ | 1, 0.1). Appendix C. JAGS Code for the HGCMs C.1. Normal Population Distributions model{ for (i in 1:n){ for (k in 1:m){ D[i, k]

Hierarchical bayesian modeling, estimation, and sampling for multigroup shape analysis.

A Bayesian hierarchical framework for modeling brain connectivity for neuroimaging data.

Mathematical modeling or waiting decades for an empirical answer?

ScreenBEAM: a novel meta-analysis algorithm for functional genomics screens via Bayesian hierarchical modeling.

Bayesian hierarchical modeling for subject-level response classification in peptide microarray immunoassays.

Bayesian Non-Parametric Hierarchical Modeling for Multiple Membership Data in Grouped Attendance Interventions.

Hierarchical approximate Bayesian computation.

A Hierarchical Bayesian Model for Crowd Emotions.

Item Response Theory Modeling of the Philadelphia Naming Test.

Prion amplification and hierarchical Bayesian modeling refine detection of prion infection.

Differentiating Wheat Genotypes by Bayesian Hierarchical Nonlinear Mixed Modeling of Wheat Root Density.

Monitoring schistosomiasis risk in East China over space and time using a Bayesian hierarchical modeling approach.

Estimating the distribution of sensorimotor synchronization data: A Bayesian hierarchical modeling approach.

Modeling visual search using three-parameter probability functions in a hierarchical Bayesian framework.

Probabilistic Inference: Task Dependency and Individual Differences of Probability Weighting Revealed by Hierarchical Bayesian Modeling.

A hierarchical Bayesian modeling approach to searching and stopping in multi-attribute judgment.

Bayesian hierarchical modelling for inferring genetic interactions in yeast.

Bayesian hierarchical models for network meta-analysis incorporating nonignorable missingness.

Bayesian hierarchical framework for occupational hygiene decision making.

Extended hierarchical Bayesian diffuse optical tomography for removing scalp artifact.

Hierarchical Bayesian models of subtask learning.

Bayesian hierarchical model for multiple repeated measures and survival data: an application to Parkinson's disease.

A default Bayesian hypothesis test for mediation.

An Integrative Bayesian Modeling Approach to Imaging Genetics.