PsychologicalReports, 1991, 69, 739-743. @ Psychological Reports 1991
ITEM DIFFICULTY AND DISCRIMINATION AS A FUNCTION O F STEM COMPLETENESS ' CLAUD10 VIOLATO
The Uniuersiv of Calgary Summary.-The effects on item difficulty and discrimination of stem completeness (complete stem or incomplete stem) for multiple-choice items were studied experimentally. Subjects (166 junior education students) were classified into three achievement groups (low, medium, high) and one of two forms of a multiple-choice test was randomly assigned to each subject. A two-way factorial design (completeness x achievement) was used as the experimental model. Analysis indicated that stem completeness had no effect on either item discrimination or difficulty and there was no interaction effect with achievement. It was concluded that multiple-choice items may be very robust in measuring knowledge in a subject area irrespective of variations in stem construction.
Although multiple-choice items continue to be very popular for measuring academic achievement (Aiken, 1987), their design and use have not been adequately investigated (Tollefson, 1987; Violato & Harasym, 1987). A great deal of further empirical work needs to be done on the construction and use of multiple-choice items (Aiken, 1987; Green, 1984; Violato & Marini, 1989). The effects that several general characteristics of multiple-choice items have on the difficulty and discrimination of the item continue to be of concern (e.g., Green, 1984). These include the number of options of the items, the number of correct responses, the use of inclusive alternatives (e.g., "all of the above"), the completeness of the stem (e.g., an open-ended phrase or a complete question), and the orientation of the stem (i.e., positive or negative). A number of studies have now been conducted investigating these factors; for recent reviews of these studies, see Aiken (1987) and Violato and Marini (1989). The effect of stem completeness on item difficulty and discrimination continues to be particularly troublesome. Testing experts (Gronlund & Linn, 1990; Oosterhof, 1990) generally advocate a closed-stem format based on the assumption that confusion and ambiguity will be minimized when the problem is presented complete in the stem. It is reasoned that an open stem makes the problem more difficult to comprehend than a closed one. The empirical evidence, however, does not unequivocally support this contention. Dunn and Goldstein (1959), for example, were unable to show
'Requests for reprints should be sent to C V~olato,Ph.D., Department of Teacher Education and Supervision, University of Calgary, Calgary, Alberta, Canada T2N 1N4.
740
C. VIOLATO
any systematic effects for type of stem format (open or closed). The findings of Dudycha and Carpenter (1973), however, did indicate significant effects. Employing 1,124 students in introductory psychology, they found that open stems were more difficult than closed ones but item discrimination was unaffected. Violato and Harasym (1987) reported similar findings with senior students in education as did Violato and Marini (1989). These conflicting findings and suggestions may be a result of the "robustness" of multiplechoice items. McMorris, Brown, Snyder, and Pruzek (1972) have found that violating item-construction principles (such as using open vs closed stems) does not necessarily influence item difficulty and discrimination. Similarly, Green (1984) and Weiten (1984) have noted that "flaws" in the stem do not invariably affect item difficulty and discrimination. Aiken (1982) has also suggested that multiple-choice items are very "robust" with respect to variations in stem construction. Rarely have empirical studies of item difficulty and discrimination as functions of stem format considered the possible interaction of testees' ability or achievement with stem format. It is possible that higher achieving testees might be less confused by incomplete stems than lower achieving testees. Reynolds (1979), for example, found it was necessary to simplify multiple-choice items at least for test takers at the extreme end of the ability distribution (mildly mentally retarded) to minimize confusion. In a review of factors which affect test performance in the cognitive domain, Frederiksen (1986) concluded that, "what a test measures, then, depends in part on . . . its match to the expertise of the test takers" (p. 450). Clearly, the ability, expertise, or achievement of the test taker may interact with item format. The equivocal and contradictory nature of the findings reviewed above, therefore, may be due, in part, to the effect of achievement. That is, the failure to take into account the different achievement levels of the testees may have confounded the results as achievement interacted with stem format. To explore this possibility further, achievement was included as a variable in the present study. The possible effects of stem completeness (open or closed) on two dependent variables, item difficulty and discrimination, formed the focus of the present study. Achievement was also included as a factor to test for possible interactions with item format. The following experiment was conducted, then, to determine the effects of stem orientation and completeness on item difficulty and discrimination. Provisions were made to test for possible interactions with achievement level. METHOD
Experimental Design A two-way factorial design was used as the experimental model. Accordingly, a 2
x
3 fixed-effects, repeated-measures analysis of variance was
ITEM ROBUSTNESS: DIFFICULTY AND STEM COMPLETENESS
741
employed. The first independent variable, Factor A, was stem completeness (complete or incomplete), and the second, Factor B, was level of achievement (low, medium, high). Subjects Junior students enrolled in a course on developmental psychology taught by the present author served as subjects. Of the total 166 students, 63 were men (38%) and 103 were women (62%). The final examination for the course served as the test instrument. In this way, a normal examination environment was maintained. Subjects were randomly assigned to take either Form A or B of the test. The Instrument Two parallel forms of a test, Forms A and B, were created from a bank of multiple-choice test items on developmental psychology. The specific content of the test included the development of cognition, affect and emotion, language, memory, sex roles, as well as learning, socialization, heredity and environment, exceptionality, and adolescence. Of the total 100 items on the final exam, 24 were randomly selected as the experimental items. Each experimental item was written in both open format (e.g., "Human development is BEST characterized as . . .") and closed format (e.g., "Which of the following BEST characterizes human development?"). Half (12) of the experimental items on Form A were closed and the other half (12) were open. The closed-open pattern was reversed on Form B so that those that were open on Form A were closed on Form B and vice versa. Achievement Subjects were rank-ordered based on their total composite scores earned in the course. This rank-ordered distribution was divided into three equalsized groups and each subject was classified as belonging to one achievement group: high (n = 55), medium (n = 5 3 , or low (n = 56).
RESULTS The over-all performance on Forms A (M = 63.3, SD = 7.3) and B (M = 62.8, SD = 9.0) showed that there were no significant differences between the two forms (t,,, = 1.25, p > .05). Internal consistency reliability (KuderRichardson Formula 20) for Form A was 0.63 and for Form B was 0.78. Item difficulties and discriminations were computed for both open-stem and closed-stem format across achievement grouping. The mean difficulties and discrimination are summarized in Table 1. First, a two-way (completeness x achievement) repeated-measures, fixed-effects analysis of variance was run with item difficulty as the dependent variable. There was a main effect for achievement (F,,,,, = 9.78, p < .01) but not for completeness (F,,,,, = 0.06, p>.05). Also, there was no interaction for achevement by completeness
742
c . VIOLATO
M E ~ N D
Achievement Group'
TABLE 1 ~ C U L TAND Y DISCRIM~NATION INDICES BY ACHIEVEMENT GROUPS
No. of Items
Low Medium High *For Difficulty, p
.05). Second, a similar two-way analysis was run with item discrimination as the dependent variable. There were no main effects (for completeness, F,,,,, = 0.04, p > .05; for achievement, F,,,,,= 0.56, p > .05), and no interaction of completeness x achievement (F,,,,, = 0.84, p > .05). From these results, it can be concluded that item discrimination was unaffected by completeness and did not vary across the achievement groups. Neither was difficulty affected by completeness although it varied across achievement groups; see Table 1. There were no interactions for either difficulty or discrimination as dependent variables.
DISCUSSION The present results indicate that incomplete stems do not increase the difficulty of the item. Nor does it affect the discrimination. These findings, however, are in contrast to assertions of some textbook writers (e.g., Gronlund & Linn, 1990; Oosterhof, 1990) and some empirical findings (Dudycha & Carpenter, 1973; Violato & Harasym, 1987). The present results also did not support Frederiksen's (1986) conclusion that a test's format interacts with the expertise of the test taker since there were no interactions between item format and achievement groupings. Dunn and Goldstein (1959), in accord with the present results, found no systematic effect on item difficulty or discrimination as a function of stem completeness. Similarly, McMorris, et al. (1972) concluded that item difficulty and discrimination are fairly "robust" with different stem formats. Aiken (1987) also concluded that multiple-choice items are robust even when item construction guidelines (such as using open instead of closed stems) are violated. These assertions are supported by the present findings as well as research on other aspects of stem format. Violato and Harasym (1987) and Violato and Marini (1989) found that negative stems were not more difficult than positive ones, although Wason (1961) and Zern (1967) reported that this affected difficulty of "binary" statements (i.e., true-false). These findings and suggestions may indeed indicate that the items are robust. The main conclusion from the present results, therefore, is that item difficulty and discrimination may be very robust in measuring examinees'
ITEM ROBUSTNESS: DIFFICULTY AND STEM COMPLETENESS
743
knowledge of subject matter even when item-writing guidelines are violated. At least this robustness extends to stem completeness although the evidence reviewed above indicates this might be true for positive or negative stems as well. Quite dearly, a number of characteristics of multiple-choice items may affect item discrimination and difficulty. The present study focused on the effects of stem completeness and supported the view of the "robustness" of multiple-choice items. Nevertheless, given the widespread use of multiplechoice items in both standardized and classroom tests, much more research is required. REFERENCES AKEN,L. R. (1982) Writin multiple-choice items to measure higher-order educational objectives. Educational ana?~sychologicalMeasurement, 42, 803-806. AIKEN, L. R. (1987) Testing with multiple-choice items. Journal of Research and Development in Education, 20, 44-58. DUDYCHA, A. L., & CARPENTER, J. R. (1973) Effects of item format on item discrimination and difficulty. Journal of Applied Psychology, 58, 116-121. DUNN,T. F., & GOLDSTEM,L. G . (1959) Test difficulty, validity and reliability as functions of selecred multiple-choice item construction principles. Educational and Psychological Measurement, 19, 171-179. FREDEMSEN,N. (1986) Toward a broader conception of human intelligence. American Psychologist, 41, 445-452. GREEN,K. (1984) Effects of item characteristics on multiple-choice item difficulty. Educational and Psychological Measurement, 44, 551-561. GRONLUND, N. E., & LINN,J. E. (1990) Measurement and evaluation in teaching. (6th ed.) New York: Macmillan. M c M o m , R. F., BROWN, J. A,, SNYDER,G. W., & PRUZEK,R. M. (1972) Effects of violating item construction principles. Journal of Educational Measurement, 9, 287-295. OOSTERHOF,A. C. (1990) Chssroom applications of educational measurement. Columbus, OH: Merrill. REYNOLDS, W. M. (1979) Utllity of multiple-choice test formats with mildly retarded adolescents. Educational and Psychological Measurement, 39, 325-331. TOLLEFSON, N. (1987) A comparison of the item difficulty and item discrimination of multiple-choice items using the "none of the above" and one correct response options. Educational and Psychological Measurement, 47, 377-383. VIOLATO,C., & H ~ S Y M P. H. , (1987) Effects of structural characteristics of stem format of multiple-choice items on item difficulty and discrimination. Psychological Reports, 60, 1259-1262. VIOLATO,C., & MAFSNI, A. E. (1989) Effects of stem orientation and completeness of multiple-choice items on item difficulty and discrimination. Educational and Psychological Measurement, 49, 287-296. WASON,P. C. (1961) Response to affirmative and negative binary statements. British Journal of Psychology, 52, 133-142. WEITEN, W. (1984) Violation of selected item construction principles in educational measurement. Journal of Experimental Education, 52, 174-178. ZERN, D. (1967) Effects of variations in question-phrasing on true-false answers by gradeschool children. Psychological Reports, 20, 527-533.
Accepted October 4, 1991.