Item Response Theory Modeling of the Philadelphia Naming Test.

JSLHR

Research Article

Item Response Theory Modeling of the Philadelphia Naming Test Gerasimos Fergadiotis,a Stacey Kellough,b and William D. Hulab,c

Purpose: In this study, we investigated the fit of the Philadelphia Naming Test (PNT; Roach, Schwartz, Martin, Grewal, & Brecher, 1996) to an item-response-theory measurement model, estimated the precision of the resulting scores and item parameters, and provided a theoretical rationale for the interpretation of PNT overall scores by relating explanatory variables to item difficulty. This article describes the statistical model underlying the computer adaptive PNT presented in a companion article (Hula, Kellough, & Fergadiotis, 2015). Method: Using archival data, we evaluated the fit of the PNT to 1- and 2-parameter logistic models and examined the precision of the resulting parameter estimates. We regressed the item difficulty estimates on three predictor

variables: word length, age of acquisition, and contextual diversity. Results: The 2-parameter logistic model demonstrated marginally better fit, but the fit of the 1-parameter logistic model was adequate. Precision was excellent for both person ability and item difficulty estimates. Word length, age of acquisition, and contextual diversity all independently contributed to variance in item difficulty. Conclusions: Item-response-theory methods can be productively used to analyze and quantify anomia severity in aphasia. Regression of item difficulty on lexical variables supported the validity of the PNT and interpretation of anomia severity scores in the context of current word-finding models.

T

in overall communicative functioning (Carragher, Conroy, Sage, & Wilkinson, 2012; Herbert, Hickin, Howard, Osborne, & Best, 2008). Models of picture naming generally describe it as a complex multistage process in which earlier stages are dominated by semantic information and later stages by phonological information (Caramazza, 1997; Dell, 1986; Levelt, Roelofs, & Meyer, 1999). The most widely accepted models of naming involve a spreading-activation mechanism that acts across this sequence of stages (Dell, 1986; Levelt et al., 1999). Successful recognition of the stimulus leads to activation of a conceptual semantic representation of the target, which in turn activates a lexical representation that is specified for meaning and syntax. This is followed by retrieval of the word’s phonological form and sequencing of its constituent phonemes. Accounts differ in their details, but within such models aphasic naming errors are generally considered to result from reduced or insufficiently persistent activation of target representations relative to competing nontarget representations and/or noise in the system (Dell, Schwartz, Martin, Saffran, & Gagnon, 1997; Schwartz, Dell, Martin, Gahl, & Sobel, 2006). Depending on the severity of the lesion and its locus within the model, different error patterns may result. For example, reduced activation of lexical-semantic representations may result in a larger proportion of semantic errors (e.g., “dog” for the target

he ability to name objects reflects cognitive processes that are typically impaired in aphasia, and successful language production depends heavily on these processes. Indeed, naming impairment is a core feature of aphasia (Goodglass & Wingfield, 1997; Kohn & Goodglass, 1985; Nickels, 2002), and naming deficits often continue to affect individuals with aphasia even if other symptoms of aphasia remit (Goodglass & Wingfield, 1997). Naming tests can provide useful information regarding the severity of naming impairment and a means of assessing the effects of treatment and recovery. Naming accuracy is a good indicator of overall aphasia severity (Schuell, Jenkins, & Jimenez-Pabon, 1964; Walker & Schwartz, 2012), and improvement in naming has been linked to improvement

a

Portland State University, OR Geriatric Research Education and Clinical Center, VA Pittsburgh Healthcare System, PA c University of Pittsburgh, PA b

Correspondence to Gerasimos Fergadiotis: [email protected] This article is a companion to Hula et al., “Development and Simulation Testing of a Computerized Adaptive Version of the Philadelphia Naming Test,” JSLHR, doi:10.1044/2015_JSLHR-L-14-0297 Editor: Rhea Paul Associate Editor: Swathi Kiran Received September 5, 2014 Revision received January 6, 2015 Accepted February 15, 2015 DOI: 10.1044/2015_JSLHR-L-14-0249

Disclosure: The authors have declared that no competing interests existed at the time of publication.

Journal of Speech, Language, and Hearing Research • Vol. 58 • 865–877 • June 2015 • Copyright © 2015 American Speech-Language-Hearing Association

Downloaded From: http://jslhr.pubs.asha.org/ by a Univ Of Newcastle User on 11/16/2017 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

865

“cat”), whereas reduced activation of phoneme representations may result in more frequent phonological errors (e.g., “tat” for the target “cat”; Foygel & Dell, 2000). Evidence suggests that overall naming severity is associated with different error-type distributions, with increasing severity associated with higher proportions of phonological, unrelated, and nonword errors (Foygel & Dell, 2000; Roach, Schwartz, Martin, Grewal, & Brecher, 1996). The Philadelphia Naming Test (PNT; Roach et al., 1996) is a prominent naming test that was developed as part of a larger set of studies investigating models of lexical retrieval in normal processing and aphasia (e.g., Dell & O’Seaghdha, 1992; Dell et al., 1997; Martin, Dell, Saffran, & Schwartz, 1994). The PNT is a 175-item picture confrontation-naming test, with each item containing a blackand-white line drawing depicting a common noun of high, medium, or low frequency, on the basis of the Francis and Kučera (1982) counts (Roach et al., 1996). The naming targets are one to four syllables in length and were selected from a larger set of 277 items on the basis of the responses of 30 healthy control subjects. One hundred thirty-six items were named correctly by all 30 control subjects, and all items had a passing rate of at least 85%. More recently, the test–retest reliability for overall accuracy on the PNT has been estimated at .99 in a sample of participants with aphasia (Walker & Schwartz, 2012). Finally, in addition to providing an assessment of overall naming accuracy, the PNT scoring system includes procedures for coding types of response errors (e.g., semantic, formal, unrelated, etc.).

Item Response Theory Item response theory (IRT; Lord, Novick, & Birnbaum, 1968) was first applied to language performance in aphasia by Willmes (1981) and has increasingly been used in clinical and rehabilitation measurement over the past decade (e.g., Eadie et al., 2006; Jette et al., 2007; Reeve et al., 2007). In IRT, statistical models are used to make inferences about test takers’ underlying latent traits on the basis of their observed patterns of behaviors. In contrast to observable behaviors such as confrontation-naming responses, a hypothetical latent trait such as naming ability cannot be directly observed, but must be inferred on the basis of indicator variables instead. For example, the one-parameter logistic (1-PL) IRT model defines the probability that an examinee responds correctly to an item (an observed behavior), given person ability and item difficulty (both latent variables on the same scale). The 1-PL model can be represented mathematically as P x i ¼ 1 q j ¼

eaðqj di Þ ; 1 þ eaðqj di Þ

(1)

where P(xi = 1|qj) is the probability of a correct response xi = 1 by examinee j given her latent trait qj, a is the item discrimination parameter (assumed to be equal for all items), and di is item i’s difficulty parameter. Item difficulty describes the location of an item on the ability spectrum. Items with higher difficulty values are more likely to elicit

866

incorrect responses, and examinees with higher ability are more likely to respond correctly to a given item. The 1-PL model further assumes all of the items are equally discriminating. For all practical purposes, the 1-PL model is mathematically equivalent to the Rasch model1 that was developed independently (Andrich, 2004; de Ayala, 2009; Rasch, 1980). In both cases, the latent trait qj is a mathematical instantiation of the construct measured by the test. Even though qj is not directly observed, it can be estimated on the basis of the test taker’s observed responses. For an introduction to IRT concepts and applications in the context of speech-language pathology, interested readers are directed to Baylor et al. (2011). For a more general and complete presentation, see de Ayala (2009), Embretson and Reise (2000), and Hambleton and Swaminathan (1985). One of the costs of IRT modeling is that the data must meet strong assumptions. The most commonly used IRT models assume that the item set is unidimensional, meaning that the items all respond to a single common underlying ability or trait. A second assumption concerns the specific form of the model. The simplest IRT model, the 1-PL model, assumes that all of the items are equally discriminating, that is, that each item is related to the latent trait with the same strength. The two-parameter logistic (2-PL) model relaxes this assumption, permitting items to vary in their discrimination. This complicates both the estimation and the interpretation of the model. The 2-PL model also requires larger sample sizes for stable estimation of item discrimination parameters (de Ayala, 2009). Both the 1-PL and 2-PL models also assume that guessing does not affect performance, that is, that persons of very low ability are very unlikely to respond correctly. Given the low potential for correct guessing of confrontation-naming items, this assumption is likely to be met and will not be discussed further. A final assumption is that the items are locally independent, that is, that all responses are unrelated, conditional on the latent trait. Put differently, this assumption states that items have the same probability of a correct response after accounting for differences in their location (or difficulty) along the dimension being measured. One typical example of a situation that would cause items to violate this assumption would be sets of comprehension questions that refer to particular passages of text. Item sets referring to the same passage would be expected to show local dependence because they are related not only by reading-comprehension ability but also by their connection to the same text. If the data usefully approximate IRT model assumptions, substantial benefits can accrue to test users. One of these benefits is person ability estimates that are in theory 1 Despite their mathematical similarity, the IRT 1-PL and Rasch models are associated with very distinct approaches to latent-trait measurement, which can be summarized as descriptive versus prescriptive (Andrich, 2004; Massof, 2011). In IRT the model may be modified to better describe the observed data. In the Rasch approach, the model represents substantive claims about the requirements of measurement, and the data are typically edited to conform to it.

Journal of Speech, Language, and Hearing Research • Vol. 58 • 865–877 • June 2015


independent of the particular items administered and therefore directly comparable across different tests measuring the same construct (Embretson & Reise, 2000). By contrast, total correct (or percent correct) scores on the PNT and the Boston Naming Test (Kaplan, Goodglass, & Weintraub, 1983) that are based on classical test theory cannot be directly compared, because the two tests are not equated for difficulty. Likewise, norm-referenced scores on aphasia tests such as the Comprehensive Aphasia Test (Swinburn, Porter, & Howard, 2004) and the Boston Diagnostic Aphasia Exam (Goodglass, Kaplan, & Barresi, 2001) cannot be directly compared, because of their dependence on the particular characteristics of their respective standardization samples. A second benefit of IRT models is that they represent the precision of score estimates as a sample-independent function of individual ability level (Embretson & Reise, 2000). This feature permits one to model the fact that an easy test given to a person with severe impairment provides a better score estimate than the same test given to a person with mild impairment. In classical test theory, score precision is expressed as a single average value that is dependent on the variability in the sample at hand. A third advantage of IRT models is that they provide scores with a stronger claim to interval, as opposed to ordinal, status (Embretson & Reise, 2000). Thus, a change of 1 on the latent-trait scale has the same meaning (in terms of the log odds of a correct response) regardless of where on the scale it occurs. By contrast, the difference in PNT raw scores between one and six items correct has very different implications than the difference between 75 and 80 items correct. IRT has been successfully applied to confrontation naming in aphasia by del Toro et al. (2011). They used IRT to develop a short form of the Boston Naming Test for people with stroke-induced aphasia and compared it to short forms developed on the basis of data from individuals with dementia (Graves, Bezeau, Fogarty, & Blair, 2004; Mack, Freed, Williams, & Henderson, 1992). One of the main goals of their study was to quantify the amount of statistical information captured by each of the short forms. Statistical information in an IRT framework is inversely related to measurement precision, and importantly, information can be quantified as a function of the level of ability. This framework allowed the authors to (a) place the different short forms on the same ability scale, (b) identify the range of ability for which each short form was informative, and (c) identify ability ranges for which each form maximized precision. Such a task would not have been possible using classical test theory, under which the notion of information as a function of ability level does not exist. One of the recommendations del Toro et al. (2011) gave for future research was comparison of short forms with computerized adaptive applications of the Boston Naming Test. In addition to providing practical benefits to test users, IRT modeling can also support investigation of test validity. In the current conceptualization of validity theory reflected in the Standards for Educational and Psychological Testing (American Educational Research Association [AERA], American Psychological Association [APA],

& National Council on Measurement in Education [NCME], 1999), there are different sources of validity evidence that yield information relevant to a specific intended interpretation of test scores (Lissitz, 2009; Mislevy, 2006; Zumbo, 2007). Because the interpretation of the PNT scores depends on premises about the cognitive processes used by test takers, one source of construct-related validity evidence is the use of cognitive theory to evaluate aspects of any proposed psychometric model that purportedly measures the latent trait of interest. As discussed earlier, in IRT models qj represents the unobserved measured construct. However, the construct that is captured by a measurement tool may not be the construct the test developer had in mind when developing the test. For that reason, the former is also referred to as the enacted construct, and the latter as the intended construct (Gorin & Embretson, 2006). The extent to which the intended and the enacted constructs are aligned may provide clear evidence for answering the central question of construct validity, which is whether the processes measured by items are those intended by the test designer. Item difficulty modeling is one approach to evaluating the alignment of the intended and enacted constructs and strengthening the substantive interpretation of a test (Embretson, 1998; Gorin, 2006; Gorin & Embretson, 2006). As discussed earlier, the cognitive processes that underlie confrontation naming are well described in the psychological literature. According to cognitive theory, the probability of naming an item correctly, given a certain trait level, depends on several lexical characteristics of a stimulus that affect the outcome of specific cognitive subprocesses necessary for accurate word retrieval and production. For example, word frequency (Dell, 1990; Kittredge, Dell, Verkuilen, & Schwartz, 2008), age of acquisition (Ellis & Morrison, 1998), and word length (Nickels & Howard, 2004) are associated with accurate word retrieval and production. This theoretical understanding of the intended construct and its relationship to other factors creates expectations about how test items should behave. Specifically, the estimated difficulty parameters that determine the probability of successful naming for a given latent-trait score should be a function of the factors that cognitive theory predicts will affect the difficulty of the items. For example, long, lowfrequency words that are acquired later in life should be more difficult to name than short, high-frequency, early-acquired words. Thus, the quality of the model can be assessed by its ability to predict variance in item difficulty parameters. The purposes of the current study were to (a) investigate whether the PNT could be adequately fitted to an appropriate IRT model, (b) evaluate the precision of the resulting person score (naming ability) and item difficulty estimates, and (c) provide a theoretical rationale for the interpretation of the resulting scores by relating explanatory variables to item difficulty. Specifically, we evaluated the assumptions of unidimensionality, equal discrimination, and local dependence, and also conducted tests of item-level model fit. Next, we computed the sample reliability for the item difficulty and person score estimates, and plotted the conditional standard error for the latter. Finally, we asked

Fergadiotis et al.: Item Response Theory and the PNT


867

whether three lexical variables (length, age of acquisition, and frequency) were significant predictors of item difficulty. Because the PNT items are relatively homogenous—that is, they are all confrontation-naming items with targets that are concrete, relatively familiar common nouns—we predicted that the data would approximate the assumptions of unidimensionality and local independence. We had no a priori predictions regarding the assumption of equal discrimination or item-level fit. We expected that the precision of person ability and item difficulty estimates would be high. Finally, we hypothesized that length, age of acquisition, and lexical frequency would each account for a significant proportion of the variance in item difficulty.

Method Participants Archival PNT response data from 276 individuals were obtained from the Moss Aphasia Psycholinguistics Project Database (MAPPD; Mirman et al., 2010) on May 6, 2012, for use in the current study. In preparing the data for analysis, we excluded healthy control participants (n = 20) and participants with missing data in their first complete administration of the PNT (n = 5). In order to verify the validity of the data, we compared the final prepared data set to a similar data set that we had independently downloaded from the MAPPD in November 2011. Finally, we recoded a small number of responses (1.7 % of the data), to which a lenient coding rule for participants with apraxia of speech had been applied, in order to make the response coding consistent for participants with and without apraxia of speech. A full description of the data preparation steps and a rationale for the recoding of leniently scored responses are provided in the online supplemental materials (see Supplemental Text A). All 251 participants included in the current analyses were people living in the community who had experienced a left-hemisphere stroke resulting in aphasia, were right-handed native English speakers, and had no comorbid neurologic illness or history of psychiatric illness (Mirman et al., 2010). Demographic data and descriptive statistics are provided in Table 1. Of note, the current data set includes the 150 participants used by Walker and Schwartz (2012) in their initial development and validation of the static PNT short forms.

Table 1. Demographic and clinical characteristics of the participants. Characteristic Ethnicity African American Asian Hispanic White Missing Education (years) M SD Minimum Maximum Missing Age (years) M SD Minimum Maximum Missing Months since aphasia onset M SD Minimum Maximum Missing Western Aphasia Battery Aphasia Quotienta M SD Minimum Maximum Missing Philadelphia Naming Test (% correct) M SD Minimum Maximum

Value

34% 0.4% 1.2% 44% 20% 13.6 2.8 7 21 20% 58.8 13.2 22 86 20% 32.9 51.0 1 381 20% 73.4 16.6 27.2 97.8 51% 61% 28% 1% 98%

a

Kertesz, 2007.

value equal to 4 divided by the square root of the sample size is taken as the criterion below which fit is considered acceptable (de Ayala, 2009). The Tanaka goodness of fit indexes the residual item variances, with values ≥ .90 indicating acceptable fit and values ≥ .95 indicating good fit (McDonald, 1999). The approximate chi-square statistic tests the null hypothesis that the residual interitem correlations are equal to zero and is evaluated by a significance test where p values > .05 are taken to indicate adequate fit (De Champlain & Gessaroli, 1998).

Dimensionality Assessment

IRT Model Assessment

To assess the dimensionality of the PNT, we fitted the dichotomous (correct vs. incorrect) response data for all 251 participants to a unidimensional confirmatory itemlevel factor model using NOHARM 4.0 (Fraser & McDonald, 1988). We examined three indices of model fit: the rootmean-square of residuals, Tanaka goodness of fit, and an approximate chi-square statistic (c2GD ; De Champlain & Gessaroli, 1998). The root-mean-square of residuals is an indicator of the size of the average residual item covariances, with smaller values indicating better overall fit; a

We tested the 1-PL model assumption of equal item discrimination using the R package ltm version 0.9-9 (Rizopoulos, 2006). We fitted both a 1-PL and a 2-PL model and examined four indicators of relative model fit. We first conducted a likelihood ratio (DG2) significance test of the difference in overall model fit between the 1-PL and 2-PL models (de Ayala, 2009). For this test, a significant result indicates that permitting discrimination to vary by item improves model fit. A nonsignificant result, on the other hand, indicates that the simpler 1-PL model should be retained.

868



It is often the case that models with more parameters demonstrate significantly better model fit, even when the improvement in model fit may not be practically meaningful (de Ayala, 2009). To examine the magnitude of difference in model fit, we used an R2D statistic, which is calculated as the relative reduction in G2 caused by fitting a more complex model and indicates the increase in proportion of variance accounted for by the 2-PL model relative to the 1-PL model (de Ayala, 2009). We also evaluated the Akaike information criterion (AIC), which takes into account model complexity in addition to goodness of fit, and the Bayesian information criterion (BIC), which is similar to the AIC but carries a larger penalty for model complexity (de Ayala, 2009). We tested the assumption of local independence using Yen’s (1984) Q3 statistic, which is based on interitem residual correlations. Our final evaluation of model fit was conducted using item-level information-weighted (infit) and outlier-sensitive mean-square (outfit) and standardized fit statistics (Smith, 1991), which are based on the squared standardized differences between the model expectations and the observed responses. High values of these statistics indicate the presence of unexpected responses or response strings and thus identify underfitting items. Low values indicate responses and response strings that are overly predictable and thus identify items that are potentially more discriminating than the model suggests. Indeed, mean-square fit values typically correlate highly with 2-PL model discrimination, with underfitting items tending to have low discrimination estimates (Wright, 1996). Whereas Wright and Linacre (1994) have proposed cutoff values for the infit and outfit meansquares that are frequently used to drive item reduction under the Rasch model, we chose not to use this strategy for three primary reasons. First, simulation studies have indicated that the distributional properties of these statistics are influenced by a number of factors including sample size, test length, information weighting, and item difficulty, meaning that application of a single cutoff value will lead to inconsistent type I and type II error rates (Smith, Schumacker, & Bush, 1998; Wang & Chen, 2005). Second, although the present sample size is adequate for estimating item difficulty and overall evaluation of model fit, it is only minimally adequate for estimating unique item discrimination values. Given the strong relationship between item discrimination and item fit to the 1-PL (or Rasch) model, we are thus reluctant to draw strong conclusions about the fit of particular items. Finally, item exclusion on the basis of infit or outfit statistics is an iterative process that, when conducted on a single data set of the present modest size, has high potential to capitalize on chance variation in the data, leading to reduced predictive utility. Instead of applying fixed cutoffs, we—following Massof (2011)—compared the distributions of the infit and outfit statistics to their expected distributions: chi-square divided by its degrees of freedom for the meansquare and standard normal for the standardized statistic. We also adopted a conservative strategy with respect to item exclusion.

Precision of Naming Ability Estimates The item information function quantifies the degree to which an item or set of items decreases the uncertainty in an ability estimate. Item information for the 1-PL model is computed as Ij(q) = a2Pj(1 – Pj), and information is summed across items to represent the certainty of an estimate provided by a test or set of items. The inverse square root of the total information function provides the conditional standard error of measurement for a test. The average reliability s2 for a sample can be computed as r ¼ 1 sɛ2 , where s2q is q the total variance of the ability estimates and s2ɛ is their average error variance. Thus, under typical IRT scaling conventions, where the person scores are scaled to M = 0 and SD = 1, conditional standard errors ≤ 0.31 are associated with reliability coefficients ≥ .9. A sample reliability coefficient for item difficulty estimates may be computed in a similar manner.

Item Difficulty Modeling Three explanatory variables were prepared and entered as predictors of item difficulty: (a) length in number of phonemes, (b) rated age of acquisition in years (Kuperman, Stadthagen-Gonzalez, & Brysbaert, 2012), and (c) Lg10CD frequency norms from the SUBTLEUS corpus (Brysbaert & New, 2009). Lg10CD is estimated using a corpus of subtitles on the basis of 8,388 films and television episodes that included 51.0 million words. It is a function of the number of films in which the word appeared, that is, its contextual diversity (CD). Values for each of the PNT items on these variables can be found in the online supplemental materials (see Supplemental Text B).

Analysis and Results Dimensionality Results of the confirmatory factor analysis suggested that the one-factor model demonstrated good fit, root-meansquare of residuals = .0101 vs. criterion = .2525, Tanaka goodness of fit = .9840, Χ2GD (df = 15050) = 8182.055, p > .99. The M and SD of the residual correlations were 0.0003 and 0.046, respectively (minimum = −0.22, maximum = 0.18), with 3.2% taking absolute values > 0.1. From these results, we concluded that the data satisfied the assumption of unidimensionality.

IRT Model Assessment IRT model-fit assessment revealed that the data showed significantly better fit to the 2-PL model than to the 1-PL model, DG2 = 384.26, df = 174, p < .001. At the same time, the R2D value (.0098) indicated that the 2-PL model provided an improvement in model fit of slightly less than 1%. The AIC favored the 2-PL model (AIC1-PL = 39,432, AIC2-PL = 39,395), whereas the BIC, which carries a larger penalty for model complexity, indicated better fit for the



869

1-PL model (BIC1-PL = 40,052, BIC2-PL = 40,629). Our interpretation of these results is that the 2-PL model showed better fit, but that the improvement relative to the 1-PL model was small. Evaluation of local independence using the Q3 statistic after fitting the 1-PL model suggested that the data approximated this assumption. Less than 5% of item pairs obtained residual correlations greater than two SDs from the mean. Plots of the item-fit statistics for the 1-PL model, presented in Figure 1, suggested that there were differences between the observed and expected distributions. The Kolmogorov–Smirnoff test found significant differences between the information-weighted ( p = .028) and outliersensitive ( p < .001) mean-squares and the expected chisquare distribution divided by its degrees of freedom, and marginally significant differences between the informationweighted ( p = .062) and outlier-sensitive ( p = .051) z-standardized values and the expected normal distribution. The individual item-fit values are provided in the online supplemental materials (see Supplemental Text B). Plots

of the modeled and empirical item characteristic curves for examples of underfitting (with elevated mean-square and standardized fit values), overfitting (with low fit values), and well-fitting (with values close to expectation) items are provided in Figure 2. Because the improvement in fit conferred by the 2-PL model was relatively small and the sample size was not large enough for stable estimation of item discrimination, we proceeded with the 1-PL model without excluding any items. We discuss the rationale for this decision in more detail later.

Precision of Naming Ability and Item Difficulty Estimates The M (SD) of the ability estimates was 0.10 (1.44), and the sample reliability coefficient was .98. The conditional standard error of measurement ranged from 0.13 to 0.48, and the curve for standard error of measurement is presented in Figure 3. The M (SD) of the item difficulty

Figure 1. Relative frequency plots of one-parameter logistic model item-fit statistics and their expected distributions. The histogram bars in each plot represent the observed fit values, and the curves represent the expected distributions. For the upper plots, the expectation is a chi-square divided by its degrees of freedom (df = 250); for the lower plots, it is a standard normal distribution.

870



Figure 2. Modeled (solid) and empirical (dotted) item characteristic curves and item difficulty estimates (d) for examples of underfitting, well-fitting, and overfitting items. The shaded areas represent 95% confidence intervals about the empirical curves.

estimates was −0.45 (0.75), and the sample item reliability coefficient was .97. The common item discrimination estimate was 1.258 (95% confidence interval [1.210, 1.306]). The item–person map in Figure 4 displays the distributions of item difficulty and person ability estimates relative to one another, with selected examples of item content. The complete set of item difficulty estimates is provided in the online supplemental materials (see Supplemental Text B).

Item Difficulty Modeling The means, standard deviations, and correlations for item difficulty and the explanatory variables are presented in Table 2. All variables were significantly correlated with item difficulty. Age of acquisition and number of phonemes were strongly related to Lg10CD but less strongly to each other. The explanatory variables significantly predicted the item difficulty parameters (see Table 3). Age of acquisition



871

Figure 3. Plot of the conditional standard error curve for naming ability estimates provided by the PNT. The 95% confidence interval (CI) around any qj can be constructed based on the curve and the following equation: CI(qj) = qj ± 1.96 × SEM(qj). For example, the 95% CI for qj = 0 is [−.14, .14].

Table 2. Correlations, means, and standard deviations for item difficulty and explanatory variables. Variable 1. Item difficulty 2. Number of phonemes 3. Lg10CD 4. Age of acquisition M SD

1

2

3

4

— .63 −.62 .64 −0.45 0.75

— −.50 .38 4.49 1.77

— −.50 2.71 0.53

— 4.90 1.33

Note. All correlations are significant at the .001 level.

and number of phonemes were the strongest predictors, b1 = .379, t(170) = 6.98, p < .001, and b1 = .361, t(170) = 6.60, p < .001, respectively. The set of predictors also explained a substantial proportion of variance in item difficulty parameters, R2 = .63, adjusted R2 = .62, F(3, 171) = 96.94, p < .001. In addition, both variables had a significant unique contribution to the explanatory power of the linear model on the basis of their respective squared semipartial correlation coefficients (10% and 11%, respectively; see Table 2). The unique proportion of variance shared between Lg10CD and item difficulty was approximately 4%.

Discussion The purpose of this study was to evaluate the potential utility of employing IRT techniques to analyze data Figure 4. Map of items and persons showing selected item content. The figure orders the difficulty of the items on the left side and the level of naming ability of the participants on the right side. Items at the bottom of the scale are easiest to name. Participants with the least naming ability are at the bottom of the scale and are expected to have difficulty even with the easiest items.

872



Table 3. Summary of regression analyses for variables predicting item difficulty. Correlations Variable Constant Phonemes Lg10CD Age of acquisition

B

SD (B)

β

t

Zero-order

Partial

Semipartial

−1.22 0.15 −0.36 0.21

0.35 0.02 0.08 0.03

.36*** −.25*** .38***

−3.50 6.60 −4.36 6.98

.63 −.62 .64

.45 −.32 .47

.31 −.20 .33

***p < .001.

from a popular naming test, the PNT (Roach et al., 1996), for quantifying the severity of word-finding difficulties. To do this, we used data obtained from 251 participants with aphasia archived in the MAPPD online database (Mirman et al., 2010). First, we evaluated whether PNT response data met the following assumptions: (a) The PNT items are related to a common ability (naming), (b) the PNT items are related to that ability with equal strength (equal item discriminations), and (c) naming responses on the PNT are independent of each other, conditional on examinee ability. We examined overall fit of the 1-PL and 2-PL models and item-level fit of the former in a calibration sample composed of the 251 participants from the MAPPD data set. The analyses suggest that the PNT closely approximates the assumption of unidimensionality. Comparison of the overall fit of the 1-PL and 2-PL models suggests that the 1-PL assumption of equal item discrimination was not tenable, but the relative improvement in fit associated with estimating unique item discriminations in the 2-PL model was minimal. Examination of information-weighted (infit) and unweighted (outlier-sensitive; outfit) mean-square and standardized item fit revealed local misfit that was consistent with the analyses of overall model fit. A number of items obtained fit values that were higher or lower than expected, suggesting the presence of responses or response strings that were poorly predicted by the 1-PL model. The most egregious misfit was demonstrated by the outliersensitive mean-square (see Figure 1, upper right panel). This statistic is elevated when individuals give responses that are unexpected, for example when a person with a high overall score answers an easy item incorrectly or a person with a low overall score answers a difficult item correctly. This statistic is sensitive to small numbers of very-off-target observations. By contrast, the information-weighted statistics are most sensitive to unexpected or overly predictable response strings occurring when item difficulty and person ability are more closely matched, and are less sensitive to individual responses. Elevated item-fit values are also typically associated with lower estimated discrimination, and depressed values are associated with higher discrimination. In this sense, the item-fit analyses are consistent with the tests of overall fit that favored the 2-PL model. We chose to proceed with the 1-PL model despite the results of the fit analyses for a number of reasons. First, we note that the overall increase in variance accounted for

by the 2-PL model was small, < 1%. Second, in comparing the item-fit values with their expected distributions, the largest misfit was shown by the outlier-sensitive meansquare. As noted before, this statistic can be affected by a small number of unexpected observations occurring when item difficulty and person ability are poorly matched. This concern can be further alleviated with the use of IRT-based computer adaptive testing with the PNT that is currently being developed (Hula, Kellough, & Fergadiotis, 2015). Computer adaptive testing is designed to avoid these situations by targeting items to persons on the basis of their previous responses (Meijer & Nering, 1999). Third, even though the information-weighted mean-squares also revealed some misfit, the values were within ranges typically considered acceptable (Wright & Linacre, 1994), and we were reluctant to exclude items from a well-studied test solely on the basis of these statistics estimated from a relatively small sample (Chen et al., 2014). We did examine some of the individual response strings in order to determine if there were obvious reasons for the observed misfit, with limited results. For items with elevated outlier-sensitive mean-squares, we found that relatively small numbers of unexpectedly correct or incorrect responses (e.g., three to five) were sufficient to provoke misfit. For the information-weighted statistics, the items with the highest values (wig, bowl, train) did seem to be either more susceptible to visual confusion (hair for wig, cup for bowl) or more visually complex (the stimulus for train shows two disconnected portions of the same train traveling along a track whose connection is not within the frame of the drawing). If the poor fit of these items to the 1-PL model is replicated in future samples, it will be useful to consider the relative advantages of dropping them from the test versus generalizing the measurement model to account for their lower discrimination. In any case, the presence of unexpected responses is consistent with the frequent finding (and common clinical observation) of high intraindividual trial-totrial variability in language performance by persons with aphasia (Freed, Marshall, & Chuhlantseff, 1996; Howard, Patterson, Franklin, Morton, & Orchard-Lisle, 1984; Kolk, 2007). The use of IRT methods to study within-person variability in aphasic language performance is a potentially productive area for future research. Consistent with previous findings (Walker & Schwartz, 2012), the reliability of the score estimates provided by the PNT (.98) was excellent. Although the PNT items were not



873

optimally targeted to the present sample, as indicated by the difference between the mean item difficulty and person ability (−0.45 vs. 0.10), the large number of items in the PNT provided for good precision across the range of ability estimates. Likewise, the present sample of 251 participants with aphasia provided precise estimates of item difficulty, as indicated by the item reliability coefficient of .97. In a companion article (Hula, Kellough, & Fergadiotis, 2015), we report the results of a simulation study investigating the potential for IRT-based computer adaptive testing procedures to reduce the length of the PNT while minimizing the loss of measurement precision. The high precision of score estimates provided by the full PNT and the excellent reliability of the present item difficulty estimates make those procedures well suited for this purpose. The item difficulty estimates can also be used to construct probabilistic criterion-referenced interpretations of individual score estimates. For example, one can use the 1-PL model equation (Equation 1) to estimate that a person with a naming ability score of 0 on the present scale has an approximately 50% chance of responding correctly to items with difficulty values close to 0 (seal, strawberries, waterfall, pumpkin, mountain) and an approximately 78% chance of responding correctly to items with difficulty values close to −1 (pencil, clock, chair, snake, fork). If these results prove sufficiently replicable and generalizable, such interpretations have clear practical benefits for both clinicians and researchers with respect to stimulus selection for assessment and treatment. The common scaling of persons and items afforded by IRT models also facilitates the potential cross-calibration of different naming tests to a common scale. We discuss this possibility further below. The regression analysis conducted in this study provides important information regarding the substantive meaning of the construct underlying confrontation-naming items on the PNT. Naming ability can be conceptualized as a composite skill that represents a constellation of several latent cognitive processes that are activated for word access and retrieval. The results provide an explanatory model that gives meaning to the construct of confrontation naming, including a list of construct-relevant item features that could be used for the development of future test items. First, as predicted by models of word retrieval, a substantial portion of variance in item difficulty can be accounted for by number of phonemes in the target word and age of acquisition. This was evident in both the magnitude of the zero-order correlations between the aforementioned variables and item difficulty and their significant unique contributions to the overall explanatory power of the regression model. The number of phonemes is directly related to the demands placed on the phonological assembly during Step 2 of Dell’s (1986) model. With respect to age of acquisition, the analysis in this study does not help pinpoint the locus of the effect of this variable within Dell’s model. Nevertheless, our findings are in agreement with studies that suggest that age of acquisition affects, at least in part, distinct processes such as word retrieval or prelexical conceptual steps (Kittredge et al., 2008). These results combined suggest that

874

the vector of the IRT item difficulties reflects the sum of the demands that each item places on the cognitive subsystems that serve word production.

Future Directions Consistent with the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999), Gorin (2007) defined validity as “the extent to which test scores provide answers to targeted questions” (p. 456). Following from this view of validity, the focus of this study was on justifying the use of the PNT as an indicator of anomia severity, which is a clinically useful property. However, there are planned extensions to this work that will have broader significance. One such extension will involve using IRT methods to calibrate scores for all of the commonly used naming tests for aphasia to a common scale. This will permit direct comparison of scores obtained using different tests, with obvious benefits for both clinical practice and research. Furthermore, because of the statistical properties of IRT models (and the 1-PL and Rasch models in particular), the item calibrations, unlike traditional test norms, are not dependent on the ability distribution of the calibration sample (Embretson & Reise, 2000). In theory, the same calibrations can be obtained from separate groups of participants with differing levels of anomia severity. Following this line of reasoning, it becomes possible to think of constructing a scale of naming ability that is independent of any particular naming test and any particular set of sample norms, much as the scale of temperature depends on neither any particular thermometer nor any particular set of temperature measurements (Bond & Fox, 2007). A further extension of this particular line of research will build upon the regression results presented in this article. A strong relationship between item difficulty and cognitive features was established, and approximately two thirds of the variance in item difficulty was explained on the basis of three readily available, theoretically cogent variables. The three variables used here constitute an initial list to consider as potential sources of item difficulty in construction of word-finding items. Further, it is important to note the potential for further improving the predictive model by developing and utilizing variables that tap into semantic processing which is not directly reflected in the variables used in this study. In addition to its implications for construct validity, the regression model can form the basis for generating additional confrontation-naming items and calibrating them to the established scale. Such algorithmic item generation has already been successful in several other content domains (Bennett, 1999; Gierl & Lai, 2012), and in theory it could open up the entire set of picturable nouns to use for both norm-referenced and criterion-referenced naming assessment. A second critical future direction for this line of research involves developing a multidimensional diagnostic measurement model (e.g., Embretson, 1998) motivated by current cognitive models of normal and aphasic naming



(e.g., Dell, 1986). Even though scores derived from unidimensional models are useful in scaling and ordering persons along a severity continuum, these proficiency scores contain limited diagnostic information necessary for the identification of those persons’ specific strengths and weaknesses. A test with reference to a multidimensional model informed by current cognitive theory could provide useful information about the processes underlying naming performance and individual differences among persons with aphasia relevant for clinicians and researchers alike. Given that such models require large sample sizes to produce stable parameter estimates (e.g., Sheng, 2010), such an exploration would be premature at this point. However, clinicians who are interested in delineating the combination of abilities that individuals use can draw from the relative literature that is available at this point and that is not contingent upon the IRT modeling presented here. For example, if clinicians are interested in using the PNT to create profiles of clients, in addition to obtaining a severity score, then they could utilize the instructions from Roach et al. (1996) to build error distributions on the basis of which to draw inferences about the underlying cognitive deficits of people with aphasia.

Limitations Despite the utility and relatively large size (at least in terms of aphasia research) of the archival data set that we used in this study, there are a few notable limitations of the present work associated with it. First, because the data were primarily collected in a single metropolitan region (Philadelphia, PA), they are not demographically representative of the population of individuals with aphasia in the United States. For example, African Americans appear to be overrepresented in the sample, and other ethnic groups, especially Whites, are likely underrepresented. These conclusions about the demographic composition of the sample are necessarily approximate, because descriptive data are missing for many cases. A second limitation is related to the archival nature of the data set and concerns the potential conflation of phonemic paraphasias and apraxic speech errors. It is likely that some observations of incorrect naming responses are attributable to apraxic speech errors rather than aphasic errors of language processing. From a perspective focused on assessment of the severity of naming impairment in aphasia (with or without concomitant apraxia of speech), this may be considered a secondary issue that does not negate the utility of the present analysis. From another perspective, one focused on diagnostic assessment of the cognitive processes underlying naming performance, it is an important issue that must be addressed. However, the present data set is not suitable for this purpose, given that both the received diagnostic criteria for apraxia of speech and the lenient scoring rule applied to cases with apraxia were evolving over the same time period that the data were collected. The disambiguation of phonemic and apraxic errors will be an important topic for future studies, especially as

we progress to consider measurement models that predict not only whether a response is correct or incorrect, but also error type. A third limitation concerns the size of the sample. As we noted earlier, the current sample of 251 participants was large enough to support stable estimation of item difficulty parameters and global evaluation of model fit. However, it was not large enough to permit firm conclusions about the fit of particular items to the 1-PL (or, equivalently, to the Rasch) model nor reliably estimate item discriminations under the 2-PL model. It will be important to evaluate the replicability of the item-fit results reported here in a larger and more representative sample so that the most appropriate decisions about model choice and item selection can be made. A final limitation concerns the generality of the unidimensional 1-PL model for different subgroups of persons with aphasia. The present analysis explicitly assumes that the item difficulty estimates are invariant for all demographic and clinical subpopulations. Thus, the model makes the same predictions about performance regardless of whether a given participant has primarily semantic naming impairment, phonological impairment, or both. The planned multidimensional model extension discussed earlier is one way of addressing this issue for these clinical subgroups. An alternative approach that can be applied to demographic subgroupings involves tests of differential item functioning across targeted subgroups. Testing the assumption of measurement invariance is a critical part of model and test validation that is beyond the scope of the present investigation but should be addressed in future studies.

Acknowledgments This research was supported by VA Rehabilitation Research and Development Career Development Award C7476W (awarded to William Hula) and the VA Pittsburgh Healthcare System Geriatric Research Education and Clinical Center. The authors would like to acknowledge the helpful assistance of Daniel Mirman and Myrna Schwartz. The contents of this article do not represent the views of the Department of Veterans Affairs or the United States government.

References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Andrich, D. (2004). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(Suppl. 1), I7–I16. Baylor, C., Hula, W., Donovan, N. J., Doyle, P. J., Kendall, D., & Yorkston, K. (2011). An introduction to item response theory and Rasch models for speech-language pathologists. American Journal of Speech-Language Pathology, 20, 243–259. Bennett, R. E. (1999). Using new technology to improve assessment. Educational Measurement: Issues and Practice, 18(3), 5–12.



875

Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.). Mahwah, NJ: Erlbaum. Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977–990. Caramazza, A. (1997). How many levels of processing are there in lexical access? Cognitive Neuropsychology, 14, 177–208. Carragher, M., Conroy, P., Sage, K., & Wilkinson, R. (2012). Can impairment-focused therapy change the everyday conversations of people with aphasia? A review of the literature and future directions. Aphasiology, 26, 895–916. Chen, W.-H., Lenderking, W., Jin, Y., Wyrwich, K. W., Gelhorn, H., & Revicki, D. A. (2014). Is Rasch model analysis applicable in small sample size pilot studies for assessing item characteristics? An example using PROMIS pain behavior item bank data. Quality of Life Research, 23, 485–493. de Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: Guilford. De Champlain, A., & Gessaroli, M. E. (1998). Assessing the dimensionality of item response matrices with small sample sizes and short test lengths. Applied Measurement in Education, 11, 231–253. Dell, G. S. (1986). A spreading-activation theory of retrieval in sentence production. Psychological Review, 93, 283–321. Dell, G. S. (1990). Effects of frequency and vocabulary type on phonological speech errors. Language and Cognitive Processes, 5, 313–349. Dell, G. S., & O’Seaghdha, P. G. (1992). Stages of lexical access in language production. Cognition, 42, 287–314. Dell, G. S., Schwartz, M. F., Martin, N., Saffran, E. M., & Gagnon, D. A. (1997). Lexical access in aphasic and nonaphasic speakers. Psychological Review, 104, 801–838. del Toro, C. M., Bislick, L. P., Comer, M., Velozo, C., Romero, S., Gonzalez Rothi, L. J., & Kendall, D. L. (2011). Development of a short form of the Boston Naming Test for individuals with aphasia. Journal of Speech, Language, and Hearing Research, 54, 1089–1100. Eadie, T. L., Yorkston, K. M., Klasner, E. R., Dudgeon, B. J., Deitz, J. C., Baylor, C. R., . . . Amtmann, D. (2006). Measuring communicative participation: A review of self-report instruments in speech-language pathology. American Journal of Speech-Language Pathology, 15, 307–320. Ellis, A. W., & Morrison, C. M. (1998). Real age-of-acquisition effects in lexical retrieval. Journal of Experimental Psychology: Learning, Memory, and Cognition, 24, 515–523. Embretson, S. E. (1998). A cognitive design system approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3, 380–396. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum. Foygel, D., & Dell, G. S. (2000). Models of impaired lexical access in speech production. Journal of Memory and Language, 43, 182–216. Francis, W. N., & Kučera, H. (1982). Frequency analysis of English usage: Lexicon and grammar. Boston, MA: Houghton Mifflin. Fraser, C., & McDonald, R. P. (1988). NOHARM: Least squares item factor analysis. Multivariate Behavioral Research, 23, 267–269. Freed, D. B., Marshall, R. C., & Chuhlantseff, E. A. (1996). Picture naming variability: A methodological consideration of

876

inconsistent naming responses in fluent and nonfluent aphasia. Clinical Aphasiology, 24, 193–205. Gierl, M. J., & Lai, H. (2012). The role of item models in automatic item generation. International Journal of Testing, 12, 273–298. Goodglass, H., Kaplan, E., & Barresi, B. (2001). The assessment of aphasia and related disorders. Philadelphia, PA: Lippincott, Williams & Wilkins. Goodglass, H., & Wingfield, A. (Eds.). (1997). Anomia: Neuroanatomical and cognitive correlates. San Diego, CA: Academic Press. Gorin, J. S. (2006). Test design with cognition in mind. Educational Measurement: Issues and Practice, 25(4), 21–35. Gorin, J. S. (2007). Reconsidering issues in validity theory. Educational Researcher, 36, 456–462. doi:10.3102/0013189X07311607 Gorin, J. S., & Embretson, S. E. (2006). Item difficulty modeling of paragraph comprehension items. Applied Psychological Measurement, 30, 394–411. Graves, R. E., Bezeau, S. C., Fogarty, J., & Blair, R. (2004). Boston Naming Test short forms: A comparison of previous forms with new item response theory based forms. Journal of Clinical and Experimental Neuropsychology, 26, 891–902. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston, MA: Kluwer-Nijhof. Herbert, R., Hickin, J., Howard, D., Osborne, F., & Best, W. (2008). Do picture-naming tests provide a valid assessment of lexical retrieval in conversation in aphasia? Aphasiology, 22, 184–203. Howard, D., Patterson, K., Franklin, S., Morton, J., & OrchardLisle, V. (1984). Variability and consistency in picture naming by aphasic patients. Advances in Neurology, 42, 263–276. Hula, W. D., Kellough, S., & Fergadiotis, G. (2015). Development and simulation testing of a computerized adaptive version of the Philadelphia Naming Test. Journal of Speech, Language, and Hearing Research, 58, 878–890. Jette, A. M., Haley, S. M., Tao, W., Ni, P., Moed, R., Meyers, D., & Zurek, M. (2007). Prospective evaluation of the AM-PACCAT in outpatient rehabilitation settings. Physical Therapy, 87, 385–398. Kaplan, E., Goodglass, H., & Weintraub, S. (1983). Boston Naming Test. Philadelphia, PA: Lea and Febiger. Kertesz, A. (2007). Western Aphasia Battery–Revised. New York, NY: Grune & Stratton. Kittredge, A. K., Dell, G. S., Verkuilen, J., & Schwartz, M. F. (2008). Where is the effect of frequency in word production? Insights from aphasic picture-naming errors. Cognitive Neuropsychology, 25, 463–492. Kohn, S. E., & Goodglass, H. (1985). Picture-naming in aphasia. Brain and Language, 24, 266–283. Kolk, H. (2007). Variability is the hallmark of aphasic behaviour: Grammatical behaviour is no exception. Brain and Language, 101, 99–102. Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44, 978–990. Levelt, W. J. M., Roelofs, A., & Meyer, A. S. (1999). A theory of lexical access in speech production. Behavioral and Brain Sciences, 22, 1–38. Lissitz, R. W. (Ed.). (2009). The concept of validity: Revisions, new directions, and applications. Charlotte, NC: Information Age. Lord, F. M., Novick, M. R., & Birnbaum, A. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Mack, W. J., Freed, D. M., Williams, B. W., & Henderson, V. W. (1992). Boston Naming Test: Shortened versions for use in Alzheimer’s disease. Journal of Gerontology, 47, P154–P158.



Martin, N., Dell, G. S., Saffran, E. M., & Schwartz, M. F. (1994). Origins of paraphasias in deep dysphasia: Testing the consequences of a decay impairment to an interactive spreading activation model of lexical retrieval. Brain and Language, 47, 609–660. Massof, R. W. (2011). Understanding Rasch and item response theory models: Applications to the estimation and validation of interval latent trait measures from responses to rating scale questionnaires. Ophthalmic Epidemiology, 18, 1–19. McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum. Meijer, R. R., & Nering, M. L. (1999). Computerized adaptive testing: Overview and introduction. Applied Psychological Measurement, 23, 187–194. Mirman, D., Strauss, T. J., Brecher, A., Walker, G. M., Sobel, P., Dell, G. S., & Schwartz, M. F. (2010). A large, searchable, webbased database of aphasic performance on picture naming and other tests of cognitive function. Cognitive Neuropsychology, 27, 495–504. Mislevy, R. J. (2006). Cognitive psychology and educational assessment. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 257–305). Westport, CT: American Council on Education/Praeger. Nickels, L. (2002). Therapy for naming disorders: Revisiting, revising, and reviewing. Aphasiology, 16, 935–979. Nickels, L., & Howard, D. (2004). Dissociating effects of number of phonemes, number of syllables, and syllabic complexity on word production in aphasia: It’s the number of phonemes that counts. Cognitive Neuropsychology, 21, 57–78. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests (expanded ed.). Chicago, IL: University of Chicago Press. Reeve, B. B., Hays, R. D., Bjorner, J. B., Cook, K. F., Crane, P. K., Teresi, J. A., . . . Cella, D. (2007). Psychometric evaluation and calibration of health-related quality of life item banks: Plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Medical Care, 45(5, Suppl. 1), S22–S31. Rizopoulos, D. (2006). ltm: an R package for latent variable modeling and item response theory analyses. Journal of Statistical Software, 17(5), 1–25.

Roach, A., Schwartz, M. F., Martin, N., Grewal, R. S., & Brecher, A. (1996). The Philadelphia Naming Test: Scoring and rationale. Clinical Aphasiology, 24, 121–133. Schuell, H., Jenkins, J. J., & Jimenez-Pabon, E. (1964). Aphasia in adults. New York, NY: Harper & Row. Schwartz, M. F., Dell, G. S., Martin, N., Gahl, S., & Sobel, P. (2006). A case-series test of the interactive two-step model of lexical access: Evidence from picture naming. Journal of Memory and Language, 54, 228–264. Sheng, Y. (2010). A sensitivity analysis of Gibbs sampling for 3PNO IRT models: Effects of prior specifications on parameter estimates. Behaviormetrika, 37(2), 87–110. Smith, R. M. (1991). The distributional properties of Rasch item fit statistics. Educational and Psychological Measurement, 51, 541–565. Smith, R. M., Schumacker, R. E., & Bush, M. J. (1998). Using item mean squares to evaluate fit to the Rasch model. Journal of Outcome Measurement, 2, 66–78. Swinburn, K., Porter, G., & Howard, D. (2004). Comprehensive Aphasia Test. New York, NY: Psychology Press. Walker, G. M., & Schwartz, M. F. (2012). Short-form Philadelphia Naming Test: Rationale and empirical evaluation. American Journal of Speech-Language Pathology, 21(Suppl.), S140–S153. Wang, W.-C., & Chen, C.-T. (2005). Item parameter recovery, standard error estimates, and fit statistics of the WINSTEPS program for the family of Rasch models. Educational and Psychological Measurement, 65, 376–404. Willmes, K. (1981). A new look at the Token Test using probabilistic test models. Neuropsychologia, 19, 631–646. Wright, B. D. (1996). Comparing Rasch measurement and factor analysis. Structural Equation Modeling: A Multidisciplinary Journal, 3, 3–24. Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370. Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125–145. Zumbo, B. D. (2007). Validity: Foundational issues and statistical methodology. In C. R. Rao and S. Sinharay (Eds.), Handbook of statistics 26: Psychometrics (pp. 45–79). Amsterdam, the Netherlands: Elsevier Scientific.



877

Item response theory analysis of the life orientation test-revised: age and gender differential item functioning analyses.

Effects of Education on Differential Item Functioning on the 15-Item Modified Korean Version of the Boston Naming Test.

Development and Simulation Testing of a Computerized Adaptive Version of the Philadelphia Naming Test.

What is the Ability Emotional Intelligence Test (MSCEIT) good for? An evaluation using item response theory.

Practical Consequences of Item Response Theory Model Misfit in the Context of Test Equating with Mixed-Format Test Data.

A Bivariate Generalized Linear Item Response Theory Modeling Framework to the Analysis of Responses and Response Times.

An item response theory analysis of the Olweus Bullying scale.

Hemodynamic response in a geographical word naming verbal fluency test.

Evaluation of psychometric properties and differential item functioning of 8-item Child Perceptions Questionnaires using item response theory.

Item Response Theory Approaches to Harmonization and Research Synthesis.

Using classical test theory, item response theory, and Rasch measurement theory to evaluate patient-reported outcome measures: a comparison of worked examples.

Geriatric Anxiety Scale: item response theory analysis, differential item functioning, and creation of a ten-item short form (GAS-10).

Vegetable parenting practices scale. Item response modeling analyses.

Validation of the cross-linguistic naming test: a naming test for different cultures? A preliminary study in the Spanish population.

Hierarchical Bayesian Modeling for Test Theory Without an Answer Key.

Item response theory analysis of the modified Roland-Morris Disability Questionnaire in a population-based study.

An Analysis of the Connectedness to Nature Scale Based on Item Response Theory.

An analysis of item response theory and Rasch models based on the most probable distribution method.

Item Response Theory analysis of the Autonomy over Tobacco Scale (AUTOS).

Item response theory analysis of the Lichtenberg Financial Decision Screening Scale.

Measuring the quality of life in hypertension according to Item Response Theory.

The Accuracy of Computerized Adaptive Testing in Heterogeneous Populations: A Mixture Item-Response Theory Analysis.

The Impact of Non-attempted and Dually-Attempted Items on Person Abilities Using Item Response Theory.

Using item response theory to enrich and expand the PROMIS® pediatric self report banks.