psychometrika doi: 10.1007/s11336-014-9427-8

A RACE MODEL FOR RESPONSES AND RESPONSE TIMES IN TESTS

Jochen Ranger MARTIN-LUTHER-UNIVERSITY HALLE-WITTENBERG

Jörg-Tobias Kuhn UNIVERSITY OF MÜNSTER

José-Luis Gaviria UNIVERSIDAD COMPLUTENSE DE MADRID Latent trait models for responses and response times in tests are often pure statistical models without a close connection to features of the assumed response process. In the present paper, a new model is presented that is more closely related to assumptions about the response process. The model is based on two increasing stochastic processes. Each stochastic process represents the accumulation of knowledge with respect to one of two response options, the correct and incorrect response. Both accumulators compete and the accumulator that first exceeds a critical level determines the response. General assumptions about the accumulators result in a race between two response times that follow a bivariate Birnbaum Saunders distribution. The model can be calibrated with marginal maximum likelihood estimation. Feasibility of the estimation approach is demonstrated in a simulation study. Additionally, a test of model fit is proposed. Finally, the model will be used for the analysis of an empirical data set. Key words: item response model, response time model, process model, birnbaum saunders distribution.

1. A Race Model for Responses and Response Times in Tests In computerized tests it is common practice to record not only the responses, but also the response times. Usually only the responses are used for psychological assessment, but recently the question has been addressed whether it is possible to include the response times as well. Several measurement models have been developed that allow for the joint analysis of responses and response times in tests; see van der Linden (2009) and Schnipke and Scrams (2002) for an overview. Some of the models have been capable of accurately representing the structure of the data. However, most of the proposed models are statistical models in a pure sense and do not have a foundation in cognitive psychology. From an applied perspective, this is not necessarily a weakness: As long as the estimated latent traits are useful for predicting relevant quantities, there is no need for cognitive theory. Nevertheless, even from an applied perspective one can ask whether a more detailed modeling of the response process might improve the assessment of individual differences. With respect to the data structure the experiments in cognitive psychology bear a striking resemblance to applications of psychological tests. Individuals have to react to stimuli or items and the reactions are scored as correct or incorrect. So it is natural to ask whether the statistical models used in cognitive psychology for binary choice experiments can be transferred to the analysis of data from psychological tests. Several such statistical models for binary responses Correspondence should be made to Jochen Ranger, Martin-Luther-University Halle-Wittenberg, Halle, Germany. Email: [email protected]

© 2014 The Psychometric Society

PSYCHOMETRIKA

and response times have been developed in cognitive psychology so far; for an overview over the proposed approaches see Ratcliff and Smith (2004) for example. Three main classes of models can be distinguished, diffusion models, counter models and accumulator models. In the diffusion model, a stochastic process moves randomly between two absorbing boundaries that represent different response options, usually the correct and incorrect response. Individuals respond as soon as one of the two boundaries is hit for the first time. The most popular diffusion model is the Wiener diffusion model, which can be traced back to Ratcliff (1978). This model assumes that information is accumulated continuously over time. The accumulation is thereby governed by a Wiener process with linear drift. Since its first applications the Wiener diffusion model has been extended with respect to parameter variability between trials (Ratcliff & Rouder, 1998; Ratcliff & Tuerlinckx, 2002) and was integrated into a multilevel framework (Vandekerckhove, Tuerlinckx, & Lee, 2011). An alternative to the Wiener diffusion model is the Ornstein–Uhlenbeck process (Busemeyer & Townsend, 1992). Counter models, on the other hand, assume that individuals generate evidence for the given response options simultaneously in form of constant discrete units. Evidence is generated at randomly varying points of time and summed by a response specific counter. The counter that first reaches a decision boundary triggers the response. The first counter models were based on a homogenous poisson process (Pike, 1973; Townsend & Ashby, 1983), but later generalized to a nonhomogenous poisson process (Smith & Van Zandt, 2000). The assumption of independent counters was relaxed by Tuerlinckx (2004) and Ruan, MacEachern, Otter, and Dean (2008). Finally, in accumulator models the evidence for the different response options arrives cyclically at fixed points of time. The amount of evidence generated per time unit for each response option is not constant, but fluctuates randomly. For each response option an accumulator is assumed that sums the periodically arriving amounts of evidence. A response is made as soon a one of the accumulators crosses a decision criterion; see Vickers (1970) and Smith and Vickers (1988) for examples of such models. More recent developments relax some of the assumptions made in the first accumulator models like the independence of the different accumulators (Usher & McClelland, 2001) or allow for parameter variability between trials (Brown & Heathcote, 2005). Although computer-based testing also provides response and response time data, there are only few attempts to use the process models described above as measurement models in order to assess individual differences. Vandekerckhove et al. (2011) proposed a hierarchical version of the diffusion model by relating latent traits to the drift rate and the absorbing boundaries of the diffusion model. Similar latent trait versions of the diffusion model have been suggested for achievement tests by van der Maas, Molenaar, Maris, Kievit, and Boorsboom (2011) and for selfreport personality questionnaires by Tuerlinckx and De Boeck (2005). A counter model has been used by Tuerlinckx and De Boeck (2005) for data from psychological tests and by Otter, Allenby, and Van Zandt (2008) for data from conjoint experiments. Recently, Rouder, Province, Morey, Gomez, and Heathcote (2014) proposed a race model based on the linear ballistic accumulator useful for applications to high stakes tests. The reason for the slow propagation of process models in psychological assessment may be due to the fact that, despite the superficial similarity of the data provided by cognitive experiments and psychological tests, the cognitive process models cannot simply be transferred in a one-toone manner. First, in cognitive psychology one usually has hundreds of repetitions of the same task for each individual. In psychological assessment individuals respond only to few items. This restricts the complexity of the process model that can be fit to such limited data sets and only simple process models that at best approximate the solution process crudely can come into consideration. The challenge is to find a model which both guarantees enough flexibility in order to reflect the most important features of the solution process and is sufficiently simple in order to be estimable with the data sets in psychological assessment. The EZ-diffusion model by Wagenmakers, van der Maas, and Grasman (2007) and the accumulator model by Rouder et al. (2014) are both motivated

JOCHEN RANGER ET AL.

by this tradeoff between model complexity and statistical tractability. In the manuscript, a new model is suggested that likewise is supposed to represent components of the solution process but can be calibrated with small samples. Second, although the data is identical in experimental psychology and psychological assessment, the tasks are not, and the response process triggered by simple perceptual tasks might be fundamentally different from the response process triggered by complex test items. Typical experiments are about mapping perceptions into a given set of response options, thereby eliciting processes of response selection. The components of the response time models can thereby be related to the neural activity in perceptual areas of the brain. This close connection between neural activity and a stochastic process of information accumulation is not given in more complex problems and one can ask which aspect of the solution process a process model is supposed to represent. In typical intelligence items a test taker has to discover a regularity behind some stimulus material. Thereby, the test taker has to actively solve an item instead of choosing a response from a limited set of options. After presentation of the item, a test taker usually inspects the item, tries to identify the relevant dimensions and starts to accumulate knowledge by interrelating the given information until he/she is able to generate a response. The response options might provide information about the relevant dimensions at this stage. Over the time the test taker passes through the necessary solution steps until a potential solution has been reached. In free response format, the test taker would simply write down the solution. In multiple choice items, the test taker would search for the generated solution in the given set of response options and select the one corresponding to the generated solution. This permanent advance toward a response can be interpreted as an accumulation process. For each possible response one can assume an accumulator, which reflects the generated knowledge with respect to the response and represents the test taker’s progress toward the corresponding solution. The accumulation process ends whenever a critical level of knowledge has been reached that allows the generation of the response. As the threshold reflects the knowledge necessary for generating the response, this threshold is not under control of the individual, but set by the specific response. This conception of the threshold is similar to the concept of labor intensity in the model of van der Linden (2009). Note the similarity of this idea to the motivation of standard item reponse models, which can be derived from the assumption that a latent level of knowledge is dichotomized into a binary response at a fixed hurdle (Wirth & Edwards, 2007). Other items might elicit different response processes. Suppose an individual has to name the person that discovered America and four response options are given, like Marco Polo, Americo Vespuzio, Cristobal Colón and Hernán Cortés for example. In this case it is reasonable to assume that individuals access stored information and evaluate the plausibility of the given response options based on the partial knowledge retrieved so far. Thereby, individuals generate evidence for each of the response options and respond as soon as the evidence has reached a critical limit. This is a more passive response style as it is a process of response selection and individuals do not have to go through a solution path in order to solve the item. In this framework, the threshold represents the decision criterion used by the individual for all response options. Although the decision criterion is under control of the individual and depends on his/her motivation, it is reasonable to assume that the individual differences are negligible in high-stakes testing and sufficient time. It is obvious that the solution process, being a mixture of several strategies, can not be modeled in its entirety. It is the task of the psychometrician to identify the most relevant element of the process and derive the consequences for the chosen process model. The relevant questions are the number and nature of the stochastic processes and the conceptualization of the thresholds as subject or item specific. The diffusion model represents the preference for one of two response options and the boundaries are thus supposed to be under control of the test taker. However, one

PSYCHOMETRIKA

must keep in mind that the motivation of the diffusion model was the explanation of the speed accuracy tradeoff, the observation, that individuals can response faster but make more errors at the same time. This naturally implies variable bounds. This perspective on the solution process is misleading in high-stakes testing. It is true that test takers can be forced to select response options faster, but that does not mean that they can solve an item faster. Speeding up the solution process puts more weight on shortcut strategies which rely on processes of preference formation and response selection. However, one can question whether such processes are dominating in highstakes testing with generous time limits. Contrary to the diffusion model, it is not clear how to interpret the components in counter models and accumulator models. Ruan et al. (2008) proposed a counter model in order to model preference formation in conjoint experiments, assuming specific accumulators for each response option and individual specific thresholds. The race model of Rouder et al. (2014) is based on the linear ballistic accumulator model, but does not explicitly state what the accumulators represent or how the thresholds have to be interpreted. In this manuscript, we take the view that the accumulators reflect the process of knowledge acquisition. The proposed accumulator model is based on a bivariate version of the Birnbaum Saunders distribution. The model assumes two processes that reflect the stochastic accumulation of knowledge with respect to two response options, the correct response and an incorrect response. Each accumulator is accompanied by a response specific threshold. This threshold reflects the level of knowledge necessary to generate the corresponding response. Both processes of knowledge acquisition may be dependent, such that mutual influences between the processes can be accounted for. Note however that the proposed model can also be used to represent the process of preference formation underlying the selection of response options. In this case one probably would set the thresholds to the same value, so this is a special case of the more general version of the model with option specific thresholds. The outline of the manuscript is as follows. First, the model is derived and implications of the model are investigated. Then, the calibration of the model with marginal maximum likelihood estimation is illustrated and an approach to the evaluation of model fit is described. And finally, the model is used for the analysis of a real data set.

2. The Bivariate Birnbaum Saunders Distribution The Birnbaum Saunders distribution was introduced by Birnbaum and Saunders (1968) and has since then become popular in reliability research. The Birnbaum Saunders distribution is a univariate, two-parameter distribution with positive support proposed for modeling the time to fatigue failure. As demonstrated by Park and Padgett (2005), the Birnbaum Saunders distribution yields a good approximation to the first hitting time of a geometric Brownian motion process and a gamma process. The distribution has been generalized to the bivariate Birnbaum Saunders distribution by Kundu, Balakrishnan, and Jamalizadeh, (2010). Pan and Balakrishnan (2011) used this distribution for a race model constructed from the first hitting times of two correlated gamma processes. Multivariate versions of the Birnbaum Saunders distribution have been proposed by Kundu, Balakrishnan, and Jamalizadeh (2013), Caro-Lopera, Leiva, and Balakrishnan (2001) and Lemonte, Martinez-Florez, and Moreno-Arenas (2013). Although the Birnbaum Saunders distribution was developed in reliability engineering as a model for damage accumulation, the distribution can also be used as a model for knowledge acquisition as will be explicated in the following. The derivation is based on the assumption that individuals acquire knowledge over time. The momentary knowledge X (t) with respect to the correct response will be denoted as information and the momentary knowledge Y (t) with respect to the incorrect response as misinformation for the sake of simplicity. Individuals start at t = 0 without knowledge, such that X (0) = 0 and Y (0) = 0, although this assumption could be relaxed.

JOCHEN RANGER ET AL.

When working on an item of a test, the knowledge of the test taker increases randomly at fixed points of time ti , i = 1, 2, . . . , ∞. The points of time are assumed to be equidistant, separated by the time lag t. The increments X i and Yi at the distinct points of time ti are supposed tobe positive, independently variates with of   and identically distributed random     expectations 2 and a E X i = μ K and E Yi = μ M and variances of var X i = σ K2 and var Yi = σ M support corresponding to the positive real line. For the moment the two accumulation processes are supposed to be independent, but this assumption will be relaxed later. The total amount of information X (ti ) and misinformation Y (ti ) accumulated until point of time ti , or alternatively, after ti /t steps is just the sum of the single increments X (ti ) = ti /t ti /t j=1 X j and Y (ti ) = j=1 Y j in case that individuals start from zero. As a consequence of the assumptions stated above, the realized level of information and misinformation is positive and strictly increasing over time. Knowledge is accumulated until an accumulator specific threshold is reached, which will be denoted as C K for the information accumulator and C M for the misinformation accumulator. Denote TM and TK as the time the two accumulators reach their corresponding threshold. The probability that the thresholds are reached before some points of time t K and t M is identical to the probability that the critical levels of knowledge have been reached before t K and t M :     P TK ≤ tk , TM ≤ tm = P X (t K ) > C K , Y (t M ) > C M ⎞ ⎛ t K /t tM /t  X j > C K , Y j > C M ⎠ = P⎝ ⎛

j=1

t K /t

= P⎝

j=1

j=1

⎞     t /t C K − t K /t μ K M Y j − μ M C M − t M /t μ M X j −μ K ⎠. > √ > √ √ √ t t t K /t σ K t K /t σ K /t σ /t σ M M M M j=1 (1)

The probability in Equation 1 can be approximated with the distribution function of the bivariate standard normal distribution as has been shown by Kundu et al. (2010) and by Pan and Balakrishnan (2011) using a central the √ limit theorem originally given in Hunter (1974). Using √ reparametrization of α K = σ K / μ K C K and β K = C K t/μ K as well as α M = σ M / μ M C M and β M = C M t/μ M the probability can be expressed as 



P Tk < t K , Tm < t M = 2



1 αK



tK − βK

βK tK



1 , αM



tM − βM

βM tM

 ,

(2)

where 2 denotes the cumulative distribution function of the bivariate standard normal distribution with coefficient of correlation of zero. The distribution in Equation 2 is referred to as the bivariate Birnbaum Saunders distribution. Note that while the parameters α K , α M , β K and β M of the bivariate Birnbaum Saunders distribution in Equation 2 are identified, the parameters of the accumulator model in Equation 1 are not. This follows from the reparameterization. Multiplying both, the thresholds and all increments, by a constant, leads to exactly the same parameters of the bivariate Birnbaum Saunders distribution. This is identical to the identification problem in the diffusion model, which is identified except for multiplicative transformations. Accumulator models often assume independence between the accumulation processes (Rouder et al., 2014; Otter et al., 2008), although this assumption has also been relaxed (Ruan

PSYCHOMETRIKA

et al., 2008; Tuerlinckx & Boeck, 2005). The finishing times of the accumulators are dependent in case of mutual influences between the single accumulation processes like inhibition or fortification. One approach for modeling dependencies refers to the accumulation process itself, by assuming that the increments X i and Yi at each time step are correlated. Although the resulting distribution of the finishing times can be approximated with the same rational as the one leading to Equation 2, this distribution is rather cumbersome and difficult to use for applications in psychometrics (Pan & Balakrishnan, 2011). Another approach consists in accounting for dependencies between the finishing times of the accumulation process without close reference to the accumulation process itself, which is a more statistical way for dealing with this problem. One means to capture dependencies between the accumulators is by resorting to copulas. The most natural choice of a specific copula in the present case is the normal copula, which is simple to implement. One only has to allow that the bivariate standard normal distribution in Equation 2 has a coefficient of correlation ρ unequal to zero. This increases the flexibility of the distribution, allowing that expected response times for correct responses can be faster than expected response times for incorrect responses and vice versa. Note that this extension of the model is not fully supported by assumptions about the solution process such that the interpretability of the extended model is somewhat impaired. However, there is no need to use the extended model. When interpretability is preferred over flexibility, one simply can adhere to the original version of the model proposed in Equation 2. The observed response and response time result from a race between the two accumulators. In detail, the accumulator that first reaches its corresponding threshold the response.   determines , T and the observed Hence, the observed response time in an item is given by T = min T K M   response is X = I TK ≤ TM where I(x) is the indicator function. Note that X = 1 denotes a correct and X = 0 an incorrect response. The joint distribution of the observed response X and observed response time T follows from the subdensities implied by model based √ the race√ = t/β + on the√ response time distribution given in Equation 2. Defining u 1 K  √  √ √ √ √ β K /t /α K , t/β K − β K /t /α K , v1 = t/β M + β M /t /α M and v2 = t/β M − β M /t /α M , u2 = the joint distribution can be stated as   f t, x =





x  

1 u1 1 2 v2 −ρu 2 exp − u 2 1−  √ 2 2π 2t 1−ρ 2

1−x  

1 v1 1 2 u 2 −ρv2 exp − v2 . 1−  √ 2 2π 2t 1−ρ 2

(3)

The race model based on the bivariate Birnbaum Saunders distribution is flexible enough to generate a wide range of response and response time distributions as demonstrated in Figure 1. In Figure 1 the distribution of the response times f(t), as well as the conditional distributions for correct responses f(t|x = 1) and incorrect responses f(t|x = 0) is given for different values of the parameters. The parameters were chosen to obtain an expected response time of E(t) = 12s and a solution probability of P(x = 1) = 0.5. The distributions differ in the expected response time for correct and incorrect responses, that is, in E(t|x = 1) and E(t|x = 0). By choosing different values for the copula parameter ρ one can either make correct responses faster that incorrect responses or the converse. Note that the shapes of the distributions resemble the typical distributions of response time data in psychological tests. The proposed approach has similarities with known models for test data, but differs in several aspects. As highlighted by Leiva, Riquelme, Balakrishnan, and Sanhueza (2008) there are

JOCHEN RANGER ET AL.

f(t)

f(t|x=0) 0

E(t|1)−E(t|0) = −1

5

f(t|x=1)

10 15 20 25 30

E(t|1)−E(t|0) = −2 0.08

0.06

0.04

0.02

Density

0.00

E(t|1)−E(t|0) = 2

E(t|1)−E(t|0) = 1

E(t|1)−E(t|0) = 0

0.08

0.06

0.04

0.02

0.00 0

5

10 15 20 25 30

0

5

10 15 20 25 30

t Figure 1. Density of the response time distribution implied by the race model based on Equation 3 for different values of the parameters. In all distributions the expected response time is E(t) = 12 s and the solution probability is P(x =  1) = 0.5.  The distributions differ in the expected response time for correct and incorrect responses. The difference E t|x = 1 −   E t|x = 0 ranges from −2 to 2 s.

different assumptions about the response process that lead to different response time distributions. Assuming cumulative information acquisition justifies the Birnbaum Saunders distribution. The Weibull distribution can be derived from an extreme value type argument. And the Gamma distribution results as the waiting time of a counting process for the k-th event. With the exception of the Birnbaum Saunders distribution all these distributions have already been employed in psychology, so this manuscript might fill a gap in the existing literature. Recently, there have been efforts to model dependencies between the competing components in the race model. Probably the most elegant approach consists in modeling the dependencies on the level of information accumulation by assuming dependencies between the two stochastic processes (Kallsen & Tankov, 2006). However, as not two processes of information accumulation

PSYCHOMETRIKA

are observed, but only the time when one of the accumulators crosses its corresponding threshold, it is impossible to fit such complex models with the data typically available in psychological assessment. An alternative approach is to account for the dependencies on a higher level, by permitting dependencies between the results of the accumulation processes, that is, between the finishing times corresponding to the different response options. This strategy was chosen by Tuerlinckx and De Boeck (2005) who used a copula for the marginal survivor function. The specific copula used by these authors allows for an interpretation in terms of a frailty model, mimicking the effects of unaccounted latent traits. In the present manuscript the accumulators were joined by a Gaussian copula. Contrary to the approach of Tuerlinckx and De Boeck (2005) this allows for modeling inhibition between the accumulation processes. Response time distributions can be characterized by scale, shape and shift and all these aspects should be considered when modeling response times (Rouder, 2005). Shift parameters shift the response time distribution on the time axis, such that the minimal response time is greater than zero in the shifted distribution. A shift is regarded to be caused by lower levels of the response process such as reading the item or the motor execution of the response and denotes the part of the total response time one does not want to model. In experiments with simple decisions the response times are usually short and the time needed for the motor execution has a strong influence on the time to react. Whether a fixed shift parameter or even subject specific shift parameters are needed in psychological assessment is a controversial topic and probably depends on the test. Some authors emphasize its necessity (Rouder, 2005). Nevertheless, most response time models for test data suggested so far do not include a shift parameter. The good fit of models without a shift parameter suggests that shift might be ignorable in psychological assessment. A good experimental setting reduces the importance of a shift parameter even further (Casey & Tyron, 2001). In typical intelligence items, which are graphically presented and do not require long motor reactions, little might be gained using a shift parameter.

3. A Latent Trait Model Based on the Bivariate Birnbaum Saunders Distribution In psychological tests one usually can observe that the solution probability and the mean response times differ over individuals and items. These differences in the distribution of the observed responses and response times can be modeled by relating the parameters of the Birnbaum Saunders distribution, see Equation 3, to characteristics of the individual and the item. This assumption provides the foundation for a latent trait model. In this latent trait model it is assumed that the expected increase in information μKg in a specific item g is related to a first latent trait θ of an individual, while the expected increase in misinformation μMg is related to a second latent trait 2 and σ 2 and the copula ω. The remaining quantities, the thresholds CKg and CMg as well as σKg Mg parameter ρg are supposed to depend on the item solely. As the expected increase in information is restricted to be positive, it is natural to model the relation between μKg and θ witha log-link and a  ∗ + b θ . The same assumption leads to log μ ∗ linear model as log μKg = b0g 1g Mg = a0g + a1g ω. Given the reparameterization used for Equation 2, the relation between μKg and θ as well as between μMg and ω implies that also the parameters of the Birnbaum Saunders are related to the two latent traits. The log-link yields the relation    log[αKg ] = log σKg / μKg CKg = b01g −1/2 b1g θ log[βKg ] = log[CKg T /μKg ] = b02g −b1g θ    log[αKg ] = log σMg / μMg CMg = a01g −1/2 a1g ω log[βMg ] = log[CMg T /μMg ] = a02g −a1g ω,

(4)

JOCHEN RANGER ET AL.

where the terms unrelated to θ andω have been combined in the intercepts. Intercept b01g for ∗ and intercept b example subsumes b01g = log σKg − 1/2 CKg − 1/2 b0g 02g represents b02g = ∗ . Note that all parameters with subscript g in Equation 4 are regarded as item log[CKg T ] − b0g parameters and differ over different items. The assumption of a log-linear relationship between the expectation of a positive random variate and a set of predictors is standard practice in generalized linear models and in the gamma regression model especially. Whenever a key quantity in a model is required to be positive, it is the preferred link function; see the application of the log-link by Tuerlinckx and Boeck (2005) to model the number of decision nodes and by Otter et al. (2008) to relate the intensity function of a poisson counter model to predictors. In the proposed latent trait model, the expected increase of information and misinformation depends on two different latent traits. As an anonymous reviewer suggested one could also use a different parameterization, which, in combination with the log-link, amounts to log[μKg ] = ∗ + θ ∗ + ω∗ and log[μ ] = a ∗ + θ ∗ − ω∗ . The first latent trait θ represents the general b0g Mg 0g response speed or work pace of an individual and accounts for the fact that some individuals answer fast, irrespective of the response, while others always take long. The second trait determines the ∗ = a ∗ the second relative preference for one of the response options. Note, that in case of b0g 0g trait ω determines the ratio μKg /μMg while the first trait θ drops. This parameterization is very interesting as it distinguishes two aspects of the solution process, general work pace and ability. In personality tests and attitudinal scales this might be the best approach as only the relative preference could be related to the strength of the individual characteristic (although strong self schemata also could increase work pace of course). In achievement testing, it is less clear which approach is superior. In the present paper, we chose the model suggested in Equation 4 and give some justification for this choice in the section with the empirical application of the model. Besides, from a mathematical point of view, the differences are small. It is always possible to replace θ = θ ∗ + ω∗ and ω = θ ∗ − ω∗ and allow for a correlation between the latent traits. This is similar to rotation in factor analysis and it might depend on the area of application, which parameterization is more reasonable. As it is standard practice in regression analysis and in psychological research respectively only the expectations of the knowledge increments are related to the latent traits, and not their variances. This is a limitation of the approach. In personality assessment, it might be the case that the stability of the response process has diagnostic value. Whether this is also the case in achievement testing is totally unknown. Nevertheless, the calibration of latent trait models is always a challenge and it probably would not be wise to include additional latent traits related to the variance. The solution probability P(x = 1|θ, ω) implied by the model in Equation 4 is illustrated in Figure 2 for an item with parameters b01g = −0.5, b02g = 2.5, b1g = 0.5 and a01g = −0.3, a02g = 3.0, a1g = 0.5 and ρ = 0.0 for different values of the latent traits. Figure 2 additionally depicts the relation of the expected response time E(t|θ, ω) to the latent traits. Note the similarity of P(x = 1|θ, ω) to the standard item characteristic curves used in item response theory. With speed-accuracy tradeoff one denotes the phenomenon that individuals can increase their speed at the cost of accuracy in simple decision tasks. This phenomenon can easily be elicited by rewarding fast answers instead of accurate answers and vice versa. The speed-accuracy tradeoff is usually explained by the fact that individuals use more severe decision criteria when accuracy is important and more lenient ones when speed is sought after. The speed-accuracy tradeoff is very important in experimental psychology, where the tasks are usually so simple that accuracy is very high, provided that individuals have enough time to answer. How this concept can be transferred to data from psychological tests and the present model is less clear. The thresholds in the actual model are supposed to reflect task demands which can not be lowered or increased arbitrarily. This is different to the diffusion model, where the stochastic process represents the relative preference

1.0

PSYCHOMETRIKA

30

ω=−2.0

25 ω=−1.3

15

ω=−0.7

20

E(T|θ,ω)

0.6

ω=−2.0

0.4

P(x=1|θ,ω)

0.8

ω=−1.3

ω=−0.7

ω= 0.0

ω= 0.7

0.2

10

ω= 0.0 ω= 0.7

ω= 1.3 ω= 2.0

5

ω= 1.3 ω= 2.0 −2

−1

0

θ

1

2

−2

−1

0

1

2

θ

Figure 2.     Solution probability P x = 1|θ, ω and expected response time E t|θ, ω for different combinations of θ and ω. Item parameters are b01 = −0.5, b02 = 2.5, b1 = 0.5, a01 = −0.3, a02 = 3.0, a1 = 0.5 and ρ = 0.0.

for a response option and the thresholds correspond to the decision criteria employed by the individual. Nevertheless, stimulating speed or accuracy by rewards does have an effect on the solution process. Either manipulation should increase the motivation of an individual, his/her tendency to concentrate on the task and work hard. However, the effects could also be more drastic, by changing the whole response process instead of simply adjusting some parameters. Requesting speed seduces individuals to respond on partial knowledge by undermining persistence, that is, to approximate the response by intelligent guessing. This mode of responding might be very atypical for individuals in high-stakes tests. The empirical evidence supports the view that speed is more or less stable during the test (van der Linden, Breithaupt, Chuah, & Zhang, 2007) so at least there is no speed-accuracy tradeoff within a test. Nevertheless, some sort of speed-accuracy tradeoff can be modeled within the present approach. Therefore, one has to assume that individuals dispose of a fixed processing capacity κ, which is distributed on the two traits θ and ω, such that κ = θ + ω. A similar position concerning the speed-accuracy tradeoff, namely the idea that it has to be caused by changes in the latent traits, can be found in Linden (2009). Although it can be a deliberate choice or the effect of a strong schemata to selectively process certain information more intensely than other, this processing preference could also be stimuli induced, as some cues might attract more attention than others. Different allocations of processing capacity induce a speed-accuracy tradeoff, as demonstrated in Figure 3. Figure 3 represents the relation between the response accuracy P(x = 1|θ, ω) and

JOCHEN RANGER ET AL. θ+ω=−4

20

θ=−3

15

θ=−2.5

E(t|θ,ω)

θ=−0.5 θ+ω=−1

10

θ=−1.5

θ=0 θ+ω=0

θ=−1

θ=0.5

θ+ω=1

θ=−0.5

θ=1

θ+ω=2

θ=0 5

θ=−1 θ+ω=−2

θ=−2

θ=1.5

θ+ω=3

θ=0.5 θ=1

0.2

θ+ω=−3

θ+ω=4

0.4

θ=2 θ=2.5 θ=3 0.6

0.8

1.0

P(x=1|θ,ω)

Figure 3.

Speed-accuracy tradeoff generated by redistributing a fixed processing capacity κ = θ + ω on the single accumulators. Item parameters are b01 = −0.5, b02 = 2.5, b1 = 0.5, a01 = −0.3, a02 = 3.0, a1 = 0.5 and ρ = 0.0.

the expected response time E(t|θ, ω) for nine individuals with processing capacities between κ = θ + ω = −4 and κ = θ + ω = 4. The tradeoff results from allocating a large amount of the available processing capacity to the information accumulator, namely θ (right side of each line) versus allocating most of the available processing capacity on the misinformation accumulator, namely ω (left side of each line). For example, the bottom line in Figure 3 denotes the tradeoff for an individual with processing capacity of κ = θ + ω = 4, which systematically redistributes his fixed capacity from θ = 1, ω = 3 to θ = 3, ω = 1. As can be seen, the relation between the response accuracy and the expected response time has an inverted U-relation, which is a direct result of the θ − ω tradeoff enforced by θ + ω = const. In fact, a different relation between θ and ω would have implied a different speed-accuracy tradeoff. Whether the assumption of a fixed capacity, which is redistributed on the response options, is a plausible idea or not, is hard to evaluate given the insufficient data available in psychological assessment. From adaptive testing we know that adaptive tests usually gather more information with fewer items than nonadaptive tests. However, this does not necessarily reduce the test time as informative items (of intermediate difficulty) usually take longest (Wild, 1989). Ferrando and Lorenzo-Seva (2007) have demonstrated an inverted-U relationship between the response probability and the response time in personality scales, however the authors conducted a between-person analysis. So in general, more research is needed in this field. The joint distribution of the responses and response times in a test with several items follows from the assumption of conditional independence. The conditional independence assumption states that the responses and response times from different items are independent when conditioning on the latent trait of the individual. This is a common assumption in latent trait models and can be justified with the claim that the latent traits account for all systematic influences on the response process. In case of conditional independence, the joint distribution of all responses and response times given by an individual is just the product of the item specific distributions implied by Equation 3 and Equation 4. Let x = [x1 , . . . x G ] and t = [t1 , . . . tG ] be the observed responses and response times of an individual in a test of G items. The joints distribution follows from the conditional independence assumption as:

PSYCHOMETRIKA

  G f x, t|θ, ω =

g=1

  f tg , x g ; α K g (θ ), β K g (θ ), α Mg (ω), β Mg (ω), ρ ,

(5)

  where f tg , x g ; α K g (θ ), β K g (θ ), α Mg (ω), β Mg (ω), ρ is the density function given in Equation 3 and the parameters α K g (θ ), . . . , β Mg (ω) follow from Equation 4. 4. Model Calibration and Evaluation of Model Fit The proposed model can be calibrated with marginal maximum likelihood estimation. Equation 5 denotes the distribution of the responses and response times when conditioning on the latent trait values of an individual. The marginal distribution of the responses and response times follows after the integration of the conditional distribution given in Equation 5 over the distribution of the latent traits in the population of the potential test takers. Usually, this distribution f(θ, ω) is a bivariate normal distribution with zero mean, unit variance and coefficient of correlation ρθ,ω . Logarithmizing the integral and summing the terms over the subjects sam   in the calibration   N log f x i , t i |θ, ω f(θ, ω)dθ dω , ple, one obtains the marginal log-likelihood function i=1 where the summation is over the i = 1, . . . , N subjects in the sample. As the latent traits have been integrated out, the marginal log-likelihood function does not depend on the unknown latent traits of the individuals and is a function of the item parameters only. It is this function, which is maximized over the item parameters in marginal maximum likelihood estimation. In short tests up to twenty items, the best approach to maximization consists in using the Newton–Raphson algorithm or a quasi-Newton method (Nash, 1990). In order to demonstrate the feasibility of model estimation, a small scale simulation study was conducted. Item responses were generated for a test of 10 items. Sample sizes of 500 and 1,000 subjects were considered. Responses and response times were generated as follows. First, two values were drawn for every fictitious test taker from the bivariate normal distribution with a coefficient of correlation of ρθ,ω = 0.00 as his/her latent traits. Then, the responses and response times were generated according to the proposed model, see Equations 3 and 4. For all items, the same item parameters were chosen. The employed values are given in Table 1. These values resembled the values obtained in the empirical application of the model; see the next section for a detailed description. The chosen item parameters implied mean response times of about 11.7 s and mean solution probabilities of about 0.67. Correct responses were slightly faster (E(t|x = 1) = 11.9) than incorrect responses (E(t|x = 1) = 12.9) as it was the case in the real data example. Table 1. Parameter recovery of the item parameters in the simulation study: True Value (TV), average estimate (m), standard error of estimation (SD) and coverage frequency (CF) of confidence intervals for confidence level c = 0.95.

Sample 500

1000

Statistic

b01

b02

b1

a01

a02

a1

ρ

TV m SD CF TV m SD CF

−0.500 −0.509 0.059 0.950 −0.500 −0.504 0.041 0.947

2.500 2.492 0.057 0.947 2.500 2.496 0.040 0.943

0.500 0.494 0.045 0.952 0.500 0.496 0.031 0.956

−0.300 −0.319 0.109 0.920 −0.300 −0.312 0.075 0.943

3.000 2.976 0.147 0.913 3.000 2.984 0.099 0.932

0.500 0.492 0.079 0.944 0.500 0.493 0.055 0.944

0.000 0.040 0.261 0.931 0.000 0.029 0.180 0.942

The reported values are averaged over the ten items. Results are based on 250 replications.

ρθ,ω 0.000 0.047 0.125 0.972 0.000 0.027 0.084 0.980

JOCHEN RANGER ET AL. Table 2.

Empirical rejection rate of two tests for two sample sizes and different nominal Type-I error rates α in the simulation study.

Sample 500 1000

α = 0.01

Test for itemfit α = 0.05

α = 0.10

α = 0.01

0.012 0.010

0.047 0.057

0.080 0.112

0.008 0.004

Test for independence α = 0.05 α = 0.10 0.072 0.052

0.144 0.104

The reported values are averaged over the ten items. Results are based on 250 replications.

The model was calibrated with marginal maximum likelihood estimation as described before. Even though the item parameters were the same in every item, they were estimated freely without restricting them to the same value. The integral over the latent traits, which is required for the marginal likelihood function, was approximated according to the so called Mislevy histogram solution (Mislevy & Stocking, 1989), with 23 nodes equally spread from −5 to 5. This approximation of the integrand was chosen because the likelihood function was sometimes concentrated in a small area of the latent traits, a situation where Gauss Hermite quadrature tends to perform poorly (Schilling & Bock, 2005). However, the difference to Gaussian Hermite Quadrature were negligible when using 23 nodes. The marginal likelihood function was optimized with the procedure optim as provided by the software package R (R Development Core Team, 2009). Altogether, 250 simulation samples were generated for every simulation condition. The results are given in Table 1. As can be seen, the estimates of most parameters are virtually unbiased. In order to study the large sample properties of the estimates, Wald based confidence intervals were determined. The standard errors of estimation of the parameter estimates were calculated with the observed information matrix. The empirical coverage frequencies of the intervals are given in Table 1 for a confidence level of c = 0.95. As can be seen, the coverage probability is generally rather good. The coverage rate is slightly too small for the intercept parameters a01g and b02g of the second accumulator and for the association between the accumulators ρg when the sample size is 500. However, in general the coverage rates should be sufficient for most applications. In addition to the confidence intervals two tests were calculated. The first test was a likelihood ratio test of the hypothesis that all accumulators are independent. This test compares the likelihood of the restricted model with independent accumulators (ρg = 0 for all items) to the likelihood of the extended model with dependent accumulators (ρg = 0 for some items). The results of the test can be found in Table 2. The likelihood ration test of independence is liberal in samples of 500 subjects, but adheres to the nominal Type-I error rate in samples of 1,000 subjects. The second test was a test of item fit. This test assessed the fit of the model for each item separately. The rational of the test is similar to the χ 2 test of goodness of fit (Chernoff & Lehmann, 1954). Response times are binned for correct and incorrect responses. Then the observed number of binned responses is compared to the expected number implied by the model. This yields a χ 2 statistic which follows approximately a χ 2 distribution. The degrees of freedom have to be corrected though to account for the fact that the item parameters of the model have been estimated. This test of item fit was implemented for the following reasons. First, to provide a tool for item selection in test development, similar to the ones used in item response theory. And second, to complement the more traditional graphical checks of model fit like Q–Q-plots. Although undoubtfully useful, these graphical checks do not account for the fact that the same data was used for model calibration and model estimation and that the latent traits can not be estimated with high precision in short scales. The empirical rejection rate of the test of

PSYCHOMETRIKA

item fit is given in Table 2. Note that the results have been averaged over the items. The empirical rejection rate of the test is close to the Type-I error level, especially in samples of 1000 subjects. Having calibrated the model defined by Equations 3 and 4 it can be used as a measurement model in order to estimate the latent traits of the individuals. A standard approach in test theory is the so called maximum a posteriori (MAP) estimator (Baker & Seock-Ho, 2004). The MAP estimator is the maximum of the conditional distribution of the latent traits when the responses and response times have been observed and the latent traits are assumed to be distributed  according   to the  bivariate normal distribution. This conditional distribution is proportional to f x, t|θ, ω f θ, ω , where the first distribution denotes the joint distribution of the responses and response times given in Equation 5 and the second distribution is the bivariate standard normal distribution. A second simulation study was conducted in order to explore the performance of the MAP estimator. In this simulation study the data, that is the latent traits, the responses and response times, were generated as in the first simulation study. Then, the corresponding MAP estimator was calculated for every subject, using the true item parameters when determining the maximum of the posterior distribution. The usage of the true item parameters can be justified by the fact that with a large calibration sample there is only a small difference between the estimates and the true values. For the first latent trait θ , the correlation between the estimated trait value and the true latent trait was r θ ,θ = 0.91. For the second latent trait ω the correlation between the estimate and the true value was r ω,ω = 0.76. The reliability was lower for the second trait as the probability of a correct response was higher than the probability of an incorrect response. Altogether, taking into account that the test was rather short, the reliability is acceptable.

5. Empirical Application In order to assess the value of the model for psychological assessment, a real data set was analyzed. The data used for this investigation consisted of responses and response times in a test for chess playing proficiency. For a detailed description of the test and the data see van der Maas and Wagenmakers (2005). Here, the items from the subscales ‘positional move’ and ‘end move’ were used. Each subscale consists of 10 items displaying a chess position, for which the best move has to be stated. In all items a single move is clearly superior to all other moves. The items from the subscale ‘positional move’ are intended to assess positional knowledge while the items from the subscale ‘end move’ tap endgame knowledge. No fixed set of response options was given as the subjects had to indicate the right move on the chess board. Each response was scored as either true or false and no distinction between the different erroneous responses was made. A time limit of 30 s was set for each item. Overall, 259 subjects completed the test. First, the item difficulties and average response times were calculated and suitable items were chosen for the further analysis. Two items of the subscale ‘end move’ were excluded as they were regarded as not suitable. The first item from this subscale had a very high solution probability of 0.984, thus providing only little information about the misinformation accumulator. The last item was excluded as a large proportion of the subjects could not finish the item within the given time limit, so no valid information was available about the response time. In the subscale ‘positional move’ three items had to be excluded due to severe effects of speededness, namely item 5, item 7 and item 10. After the item selection the items from the two subscales were analyzed with the accumulator model as described before. Each subscale was analyzed separately. The estimated item parameters can be found in Table 3. Having calibrated the model, the proposed test of item fit was calculated. The resulting p-values of the single tests are given in Table 3 for each item.

JOCHEN RANGER ET AL. Table 3. Estimated item parameters (standard error of estimation) and results of the tests of item fit for two scales of the chess test.

Item

b01

b02

b1

Choose a move: endgame knowledge 1 −1.09 1.36 −0.47 (.09) (.04) (.03) 2 −0.79 2.09 −0.56 (.08) (.05) (.05) 3 −0.68 2.51 −0.59 (.10) (.07) (.07) 4 0.25 4.47 −0.78 (.33) (.41) (.28) 5 −0.07 2.89 −0.30 (.11) (.10) (.10) 6 0.92 5.65 −1.92 (.40) (.65) (.36) 7 −0.48 2.90 −0.78 (.13) (.17) (.14) 8 0.12 4.09 −1.20 (.30) (.42) (.31) Choose a move: positional knowledge 1 −0.83 2.25 −0.44 (.07) (.05) (.04) 2 −0.55 2.75 −0.39 (.18) (.17) (.06) 3 −0.50 2.60 −0.68 (.10) (.08) (.05) 4 −0.62 2.88 −0.45 (.10) (.11) (.06) −1.88 5 0.58 5.33 (.42) (.65) (.33) 6 −0.45 3.30 −0.59 (.17) (.23) (.14) 7 0.53 5.13 −1.46 (.62) (1.01) (.52)

a01

a02

a1

ρ

p-value

0.10 (.44) 0.25 (.44) −0.75 (.15) −0.67 (.07) 0.02 (.12) −0.55 (.05) −0.49 (.05) −0.75 (.08)

3.33 (.72) 4.12 (.73) 3.14 (.15) 2.84 (.07) 3.11 (.16) 2.73 (.06) 2.51 (.08) 2.79 (.06)

0.88 (.31) 0.55 (.50) 0.37 (.09) 0.42 (.05) 0.74 (.14) 0.40 (.05) 0.32 (.07) 0.24 (.05)

−0.45 (.42) −0.69 (.29) 0.12 (.37) −0.99 (0.00) −0.86 (.12) −0.99 (0.00) 0.34 (0.29) −0.27 (0.33)

0.14

0.07 (.35) −0.61 (.11) −1.23 (.20) −0.73 (.07) −0.88 (.06) −0.89 (.07) −0.84 (.06)

4.16 (.55) 2.53 (.11) 3.09 (.11) 2.61 (.06) 2.81 (.05) 2.76 (.04) 2.70 (.05)

0.31 (.39) 0.42 (.06) 0.41 (.05) 0.34 (.04) 0.25 (.05) 0.23 (.03) 0.30 (.04)

−0.81 (.22) 0.00 (.49) −0.44 (.38) 0.95 (0.06) −0.39 (.30) 0.60 (.24) 0.56 (.21)

0.16

0.24 0.63 0.02 0.58 0.32 0.20 0.49

0.33 0.03 0.05 0.06 0.08 0.16

The two latent traits correlated with ρθ,ω = −0.01 in the ‘end move’ subscale and with ρθ,ω = −0.72 in the ‘positional move’ subscale. The estimated item parameters are more or less reasonable. However, there are three exceptions. According to the estimates the two accumulators are nearly perfectly correlated in item four and item six of the ‘end move’ subscale and in item four of the ‘positional move’ subscale. This clearly indicates convergence problems of the marginal maximum likelihood estimator. Similar problems occurred in the simulation study with bad starting values and small datasets. The sample size of 259 subjects is rather small for fitting a complex model. Note that the precision of the correlation estimates is generally rather low. However, simplifying the model by restricting the correlation to zero was not possible. Testing above revealed that the accumulators for ρg = 0 in all items with the likelihood ratio test proposed   are not independent, both in the ‘end move’ subscale χ 2 = 58.9, df = 8, p < .01 and in the  2  ‘positional move’ subscale χ = 87.5, df = 7, p < .01 . The results of the tests of item fit were promising. Overall, the model was capable of representing the response and response time distribution in the single items. Only one item was marked

PSYCHOMETRIKA

20

30

40

10

20

30

40

25 15 5

20

30

Sample Quantiles

QQ−Plot − Item 3

5 10

Sample Quantiles

30 20

10

50

0

10

20

30

40

50

QQ−Plot − Item 4

QQ−Plot − Item 5

QQ−Plot − Item 6

20

30

40

50

Theoretical Quantiles

0

10

20

30

40

50

Theoretical Quantiles

20 5 10

30 20 5 10

30 20

10

30

Theoretical Quantiles

Sample Quantiles

Theoretical Quantiles

Sample Quantiles

Theoretical Quantiles

5 10

Sample Quantiles

QQ−Plot − Item 2

5 10

Sample Quantiles

QQ−Plot − Item 1

10

20

30

40

50

Theoretical Quantiles

30 20 5 10

Sample Quantiles

QQ−Plot − Item 7

10

20

30

40

50

Theoretical Quantiles

Figure 4. Q–Q-plot for the marginal response time distribution in the seven items of the ‘positional move’ subscale.

as misfitting in the subscale ‘end move’. The results are not as good for the subscale ‘positional move’, where two items stood out with α = 0.05. However, an inspection of the Q–Q-plots contrasting empirical quantiles with the theoretical quantiles implied by the model did not indicate a severe misfit of the model. The Q–Q-plots for the items of the subscale ‘positional move’ are given in Figure 4. This subscale was chosen as the items of this scale had in general the worse model fit. As can be seen, the misfit is mostly a result of the larger response times, which can not be represented by the model. This is due to the time limit imposed by the test. But besides this violation, the lower part of the empirical distribution matches the theoretical distribution implied by the model rather well. The diagnostic value of the model was assessed by investigating the predictive power of the latent traits. Therefore, the elo score of an individual was predicted from the estimated trait values. The elo score determines the position of a chess player in the ranking list and summarizes the number of wins and defeats in the past. In a first step of the analysis, the latent traits were estimated for the individuals using the proposed accumulator model as a measurement model with the item parameters given in Table 3. Estimation was based on the maximum of the posterior distribution of the latent traits given the response and response time pattern of an individual. The usage of the so called MAP estimator is standard practice in item response theory. Using a linear regression

JOCHEN RANGER ET AL.

model, the estimates from the subscale ‘end move’ could predict the elo score with a coefficient of determination of R 2 = 0.528. The estimates from the subscale ‘pos move’ could predict the elo score with a coefficient of determination of R 2 = 0.513. The predictive power of the accumulator model was compared to the predictive power of a two-parameter logistic model for the responses in combination with a standard factor model for the log response times. Therefore, the two-parameter logistic model was fitted to the responses and the ability of the individuals was estimated from their responses alone with the corresponding MAP estimator. Then, the log response times were analyzed by a standard factor model and the underlying factor values, the speed of responding, was determined. Both quantities, ability and speed, were used for predicting the elo score via a linear regression model. In the ‘end move’ subscale the coefficient of determination was R 2 = 0.513 and in the ’positional move’ subscale the coefficient of determination was R 2 = 0.516. One can notice that the predictive power of the accumulator model is higher in the first scale, although the difference is small. The two latent traits of the accumulator model were correlated with further external variables, which were assessed in the chess test in addition to the two subscales. These variables comprised an individual motivation score (Mot), the ability to predict the next move in a chess game (Predict), a measure of chess knowledge (Knowl) and the ability to remember chess positions (Recall). The results can be found in Table 4. The table also contains the correlations of the external variables with the latent trait of the two-parameter logistic model (ability) and the factor score of the factor model (speed). Note that ability and speed both correlate substantially with all external variables. This indicates that both traits share some content related to chess playing proficiency. In the accumulator model, only the first latent trait θ , which determines the acquisition of information, is correlated with the external criteria. The second trait ω related to the acquisition of misinformation has only very weak correlations (subscale ‘end move’) or weak correlations (subscale ‘position move’) with the external criteria. As both estimators had about the same reliability this finding can not be explained by different amounts of attenuation. This suggests the interpretation that the accumulator model succeeds in concentrating the chess playing proficiency in one latent trait more purely. The model allows to measure prolific work pace and not just general processing speed. This might be advantageous in the present area of application, as chess playing proficiency consists in making good decisions fast.

6. Discussion Since their introduction in the 1960s, item response models have become a popular method in psychological assessment. Item response models postulate latent traits that are related to the observable responses given by an individual. This relation between the latent traits and their observable manifestations allows the inference of the latent traits from the behavior of an individual. Item response models have been used successfully in educational testing, but also for attitudinal scales and personality tests. The possibility to use the same model in different areas of application has been an important reason for the success of these models. Their general applicability is also the weakness of the item response models. Due to the generality of the models they are only weakly connected to the response process. This is in contrast to the process models used in cognitive psychology, which are derived from detailed assumptions about the response process. However, these models are usually not designed to measure individual differences and may be difficult to use in psychological assessment. The actual manuscript is an attempt to combine both approaches, although the focus is clearly on psychometrics. The proposed model is based on more pronounced assumptions about the response process and can be used for psychological assessment, similar to item response models. This approach might be advantageous. By modeling elements of the response process, the latent traits derive their meaning from the sub-processes

PSYCHOMETRIKA Table 4.

Correlations between the two latent traits θ and ω as defined by the accumulator model, the ability estimate from the two parameter logistic model and the factor score of the standard factor model and several external criteria.

Model

Trait

Elo

Choose a move: endgame knowledge AC-Model θ 0.669 ω −0.187 2-PL Ability 0.662 FA Speed −0.502 Choose a move: positional knowledge AC-model θ 0.626 ω 0.360 2-PL Ability 0.637 FA Speed −0.503

Mot

Predict

Knowl

Recall

0.324 0.083 0.106 −0.337

0.613 −0.156 0.554 −0.457

0.437 −0.217 0.497 −0.298

0.581 −0.095 0.497 −0.478

0.211 0.192 0.131 −0.237

0.466 0.240 0.502 −0.383

0.426 0.182 0.525 −0.309

0.470 0.302 0.430 −0.417

Elo elo-score, Mot motivation questionnaire, Predict predict-a-move test, Knowl verbal knowledge questionnaire, Recall recall test of chess position. The traits θ and ω were estimated with the accumulator model, using the responses and response times in the corresponding subscale. The ability level (Ability) of an individual was estimated by the two-parameter (2-PL) logistic model using the responses. The speed trait (Speed) was estimated via factor analyzing (FA) the log response times. Results are reported separately for estimates from the ‘end move’ subscale and the ‘pos move’ subscale.

they are related to. Thereby, different components of the response process can be separated. This might result in better predictors than the usual latent traits, which mingle very different attributes of an individual like ability, processing speed or motivation; see van der Maas et al. (2011) for a more profound discussion of the benefits of process models. In the proposed model two accumulators are assumed that represent the acquisition of information and misinformation. Both accumulators are associated and depend on different latent traits. Contrary to standard accumulator models the presented approach allows for dependencies between the accumulators and can account for mutual inhibition between the two processes of knowledge acquisition. Rather general assumptions about the process of knowledge accumulation give rise to the bivariate Birnbaum Saunders distribution, from which the distribution of the responses and response times in a test can be derived. The proposed model meets all demands on measurement models in psychological assessment. It can be calibrated with the data typically available in psychological assessment and it allows for estimation of the latent traits. A test of item fit was also proposed. So there are no more obstacles to use the present model than to use a standard item response model. The applicability of the model is wider than it may seem on first sight. The model could also be used in order to account for the motivation of an individual when the second accumulator is assumed to represent the increasing level of demotivation (Cosineau, 2004; Dufau, Grainger, & Ziegler, 2012). This feature makes the actual model applicable to open questions, where individuals usually stop working after some time even though they have not found the right solution. The model can also be used for attitudinal scales and personality tests, when relating the two accumulators to the endorsement and rejection of an item. This even allows modeling of graded responses, when the extend of agreement over disagreement is related to the difference of the two accumulators at the time of responding. Summing up, the approach offers an interesting alternative to standard item response models.

JOCHEN RANGER ET AL. References Baker, F., & Seock-Ho, K. (2004). Item response theory: Parameter estimation techniques. New York, NY: Marcel Dekker. Birnbaum, Z., & Saunders, S. (1968). A new family of life distributions. Journal of Applied Probability, 6, 319–327. Brown, S., & Heathcote, A. (2005). A ballistic model for choice response times. Psychological Review, 112, 117–128. Busemeyer, J., & Townsend, J. (1992). Fundamental derivations from decision field theory. Mathematical Social Sciences, 23, 255–282. Caro-Lopera, F., Leiva, V., & Balakrishnan, N. (2001). Connection between the Hadamard and matrix products with an application to matrix-variate Birnbaum–Saunders distributions. Journal of Multivariate Analysis, 104, 126–139. Casey, M., & Tyron, W. (2001). Validating a double-press method for computer administration of personality inventory items. Psychological Assessment, 13, 521–530. Chernoff, H., & Lehmann, E. (1954). The use of the maximum likelihood estimates in χ 2 -tests for goodness of fit. Annals of Mathematical Statistics, 25, 579–586. Cosineau, D. (2004). Merging race models and adaptive networks: A parallel race network. Psychonomic Bulletin and Review, 11, 807–825. Dufau, S., Grainger, J., & Ziegler, J. (2012). How to say no to a nonword: A leaky competing accumulator model of lexical decision. Journal of Experimental Psychology Online First. Ferrando, P., & Lorenzo-Seva, U. (2007). An item response theory model for incorporating response time data in binary personality items. Applied Psychological Measurement, 31, 525–543. Hunter, J. (1974). Renewal theory in two dimensions: Asymptotic results. Advances in Applied Probability, 6, 546–562. Kallsen, J., & Tankov, P. (2006). Characterization of dependence of multidimensional Levy processes using Levy copulas. Journal of Multivariate Analysis, 97, 1151–1572. Kundu, D., Balakrishnan, N., & Jamalizadeh, A. (2010). Bivariate Birnbaum–Saunders distribution and associated inference. Journal of Multivariate Analysis, 101, 113–125. Kundu, D., Balakrishnan, N., & Jamalizadeh, A. (2013). Generalized multivariate Birnbaum–Saunders distributions and related inferential issues. Journal of Multivariate Analysis, 116, 230–244. Leiva, V., Riquelme, M., Balakrishnan, N., & Sanhueza, A. (2008). Lifetime analysis based on the generalized Birnbaum– Saunders distribution. Computational Statistics and Data Analysis, 52, 2079–2097. Lemonte, A., Martinez-Florez, G., & Moreno-Arenas, G. (2013). Multivariate Birnbaum–Saunders distribution: Properties and associated inference. Computational Statistics and Data Analysis Online First,. doi:10.1080/00949655.2013. 823964. Mislevy, R., & Stocking, M. (1989). A consumer’s guide to LOGIST and BILOG. Applied Psychological Measurement, 13, 57–75. Nash, J. (1990). Compact numerical methods for computers. Linear algebra and function minimisation. Bristol: Adam Hilger. Otter, T., Allenby, G., & Van Zandt, T. (2008). An integrated model of discrete choice and response time. Journal of Marketing Research, 45, 593–607. Pan, Z., & Balakrishnan, N. (2011). Reliability modeling of degradation of products with multiple performance characteristics based on gamma processes. Reliability Engineering and System Safety, 96, 949–957. Park, C., & Padgett, W. (2005). Accelerated degradation models for failure based on geometric Brownian motion and gamma processes. Lifetime Data Analysis, 11, 511–527. Pike, R. (1973). Response latency models for signal detection. Psychological Review, 80, 53–68. R Development Core Team. (2009). R: A language and environment for statistical computing. [Computer Software Manual] Vienna, Austria. http://www.R-project.org ISBN 3-900051-07-0. Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85, 59–108. Ratcliff, R., & Rouder, J. (1998). Modeling response times for two-choice decisions. Psychological Science, 9, 347–356. Ratcliff, R., & Smith, P. (2004). A comparison of sequential sampling models for two-choice reaction time. Psychological Review, 111, 333–367. Ratcliff, R., & Tuerlinckx, F. (2002). Estimating parameters of the diffusion model: Approaches to dealing with contaminant reaction times and parameter variability. Psychonomic Bulletin and Review, 9, 438–481. Rouder, J. (2005). Are unshifted distributional models appropriate for response time. Psychometrika, 70, 377–381. Rouder, J., Province, J., Morey, R., Gomez, P., & Heathcote, A. (2014). The lognormal race: A cognitive-process model of choice and latency with desirable psychometric properties. Psychometrika Online First. Ruan, S., MacEachern, S., Otter, T., & Dean, A. (2008). The dependent Poisson race model and modeling dependence in conjoint choice experiments. Psychometrika, 73, 261–288. Schilling, S., & Bock, R. (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika, 70, 533–555. Schnipke, D., & Scrams, D. (2002). Exploring issues of examinee behavior: Insights gaines from response-time analyses. In C. Mills, M. Potenza, J. Fremer, & W. Ward (Eds.), Computer-based testing: Building the foundation for future assessments (pp. 237–266). Mahwah, NJ: Lawrence Erlbaum. Smith, P., & Van Zandt, T. (2000). Time-dependent Poisson counter models of response latency in simple judgment. British Journal of Mathematical and Statistical Psychology, 53, 293–315. Smith, P., & Vickers, D. (1988). The accumulator model of two-choice discrimination. Journal of Mathematical Psychology, 32, 135–168. Townsend, J., & Ashby, F. (1983). Stochastic modeling of elementary psychological processes. Cambridge: Cambridge University Press.

PSYCHOMETRIKA Tuerlinckx, F. (2004). A multivariate counting process with Weibull-distributed first-arrival times. Journal of Mathematical Psychology, 48, 65–79. Tuerlinckx, F., & De Boeck, P. (2005). Two interpretations of the discrimination parameter. Psychometrika, 70, 629–650. Usher, M., & McClelland, J. (2001). On the time course of perceptual choice: The leaky competing accumulator model. Psychological Review, 108, 550–592. van der Linden, W. (2009). Conceptual issues in response-time modeling. Journal of Educational Measurement, 46, 247–272. van der Linden, W., Breithaupt, K., Chuah, S., & Zhang, Y. (2007). Detecting differential speededness in multistage testing. Journal of Educational Measurement, 44, 117–130. van der Maas, H., Molenaar, D., Maris, G., Kievit, R., & Boorsboom, D. (2011). Cognitive psychology meets psychometric theory: On the relation between process models for decision making and latent variable models for individual differences. Psychological Review, 118, 339–356. van der Maas, H., & Wagenmakers, E. (2005). A psychometric analysis of chess expertise. American Journal of Psychology, 118, 29–60. Vandekerckhove, J., Tuerlinckx, F., & Lee, M. (2011). Hierarchical diffusion models for two-choice response times. Psychological Methods, 16, 44–62. Vickers, D. (1970). Evidence for an accumulator model of psychophysical discrimination. Ergonomics, 13, 37–58. Wagenmakers, E., van der Maas, H., & Grasman, R. (2007). An EZ-diffusion model for response time and accuracy. Psychonomic Bulletin and Review, 14, 3–22. Wild, B. (1989). Neue Erkenntnisse zur Effizienz des tailored-adaptiven Testens [New insights into the efficiency of tailored-adaptive testing]. In K. Kubinger (Ed.), Moderne Testtheorie [Modern test theory] (2nd ed., pp. 179–187). Weinheim: Beltz. Wirth, R., & Edwards, M. (2007). Item factor analysis: Current approaches and future directions. Psychological Methods, 12, 58–79. Manuscript Received: 26 APR 2012

A Race Model for Responses and Response Times in Tests.

Latent trait models for responses and response times in tests are often pure statistical models without a close connection to features of the assumed ...
439KB Sizes 5 Downloads 7 Views