Research in Nursing & Health, 1991, 14, 165- 168

Focus on Psychometrics Aspects of Item Analysis Sandra Ferketich

During the course of instrument development, investigators are faced with the challenge of developing a psychometrically sound instrument that has a minimal number of items or components. Although instrument developers may encounter specific problems in relation to different types of tests, there are three areas of concern that are frequently encountered. These concerns relate to (a) instrument length, (b) scale homogeneity, and (c) instrument sensitivity. The purpose of this article is to discuss selected aspects of item analysis in relationship to these three commonly encountered and interrelated areas of concern.

Many scales or instruments used in nursing research are composites made up of the sum of scores from a series of components or items. Although there are many criteria for making judgments about the efficacy of a given test for a specific population and purpose, a general goal is to develop a test of minimal length while maintaining or achieving acceptable support for its reliability and validity. In other words, the challenge is to develop an instrument composed of the “best set” of items. In order to locate that “best set” among all possible sets, there are generally three areas of concern encountered by the test developer. These are (a) test length, (b) scale homogeneity, and (c) test sensitivity. The purpose of this article is to describe how selected aspects of item analysis can be used to assist the investigator to address these concerns. Item analysis focuses on the individual item in a composite instrument. The investigator can legitimately focus on an individual item since the characteristicsof the composite are a direct function of the characteristics of the individual items (Ghiselli, Campbell, & Zedeck, 1981). Additionally, problems regarding instrument performance are often more easily discerned when dealing with individual items. Although most of the discussion in this article focuses on numerical estimates, none are a substitute for the logical arguments that are an integral element of validity assessments. Keep-

ing that in mind, analysis of individual items still seems to be a reasonable approach to addressing questions relating to total instrument performance. Assessment of individual item characteristics is best done after a large scale field test of a pool from which the final instrument will be constructed. Nunnally (1978) recommends that this initial pool be composed of 1.5 to 2 times as many items as the final instrument. In this way there will be a sufficient number of items available to discard and still retain an adequate number of items in the instrument. Further, Nunnally (1978) states that it is probably difficult to achieve a Cronbach’s alpha, actually a KR-20, of .80 with less than 30 dichotomous items. If there are multipoint items, considerably fewer than 30 may be adequate to reach this level of alpha. The issue of how large a sample is needed is partially dependent upon the number of items in the instrument. Because of the number of computations that are done to produce the statistics needed for analyzing the characteristics of a given item, there are many opportunities for chance to operate. There should be at least 5 times as many subjects as items or at least 200 to 300 subjects, whichever is greatest, in order to minimize the probability of misleading results based on chance (Crocker & Algina, 1986; Nunnally, 1978). Furthermore, the subject group should be reflective of the population for which the test is designed.

This is the third article in the series, “Focus on Psychometrics,” edited or contributed by Sandra Ferketich, PhD, RN. Dr. Ferketich is an associate professor in the College of Nursing and the Director of the Predoctoral and Postdoctoral Clinical Nursing Research Instrumentation Fellowship Program (National Research Service Award Institutional Grant No. 5T32 NR07029) at the University of Arizona. Requests for reprints can be addressed to Dr. Sandra Ferketich, College of Nursing, University of Arizona, Tucson, A2 85721.

0 1991 John Wiley & Sons, I n c . 0160-6891/91/020165-04 $04.00

166

RESEARCH IN NURSING & HEALTH

However, finding this number of subjects may be difficult if the instrument is designed for vulnerable or rare populations. Given the constraints of clinical populations, item analysis is frequently performed with far fewer subjects. The prudent investigator needs to be cautious about discarding items based on a single small sample test of an instrument.

INSTRUMENT LENGTH The length of an instrument is affected by a number of factors including clinical constraints, theoretical considerations, and construct complexity. When addressing parsimony in an instrument, it is useful to think of both length and psychometric pmperties . Although a short instrument may be desirable, it must be of adequate length to represent the universe of interest. Such arguments about adequacy of representation are more properly the base of discussions regarding content validity. However, when striving to decrease the length of the instrument, it is crucial to keep such arguments in mind. When analyzing individual items from the pool, a common practice is to cull the pool and retain those items that have the best interitem correlations. This can substantially shorten the test, increase the scale homogeneity, and improve reliability estimates. However, there may be a concomitant effect on scale sensitivity as discussed later. At first appearance increasing scale homogeneity by decreasing the number of items seems to directly refute the argument that the longer the test, the greater its reliability. It can be shown that Cronbach’s alpha, as a measure of internal consistency, increases as the number of items increases if the average interitem correlation remains constant. A quick examination of the formula for standardized alpha (Cronbach, 1951) illustrates the relationship. Using a correlation matrix, the formula for alpha is:

where a = the number of items b = the sum of the interitem correlations As can be seen only a and b can vary. If the average interitem correlation is held constant then a is a function of change in b. Under this condition, a change in b can exert it’s full effect. With the average interitern correlation held at .2, the alpha for 5 items is .556 whereas it is .714 for 10 items. Zeller and Carmines (1980) provide further in-

formation on the relationship of the number of indicants, the average interindicant correlation and the change in alpha. However, since the process used in item analysis is to remove items that have the lowest interitem correlations, the average interitem correlations are not held constant. In fact, items that do not substantially meet the level of correlation that other items meet are the first to be discarded. Thus, under these circumstances, substantially shortening the instrument increases the homogeneity and internal consistency estimates of reliability. Alpha for an 8-item scale with an average interitem correlation of .20 is .67. If two items are removed and an average interitem correlation of .40is obtained, the alpha is .80. Therefore, it can be seen that increasing the average interitem correlation by culling poor items can be powerful. The magnitude of the change in alpha decreases as the number of items increases, however. Information regarding individual item characteristics can be obtained from a number of sources. Perhaps one of the easiest ways to accumulate the data is to use the results from a reliability routine of one of the major software packages. No matter what routine is used, the following pieces of information will help in making a decision about whether any given item should be retained or deleted: (a) an interitem correlation matrix, (b) a corrected average interitem correlation coefficient, (c) a corrected item to total correlation coefficient, and (d) information about the alpha estimate if this item is dropped from the scale. The corrected correlation coefficients should be obtained particularly when there is a small number of items in the instrument or scale. When a correlation coefficient is produced between the item and the total, the total contains information about the item. Thus, if the item is a large part of the total, the estimate of the correlation will appear much stronger than is warranted because the correlation coefficient of 1.OO will be included in the calculations. Of course, as the number of items increases, the effect is minimized. The correlation matrix can offer invaluable information about the way a particular item relates to other items. A constructed example of a correlation matrix is provided in Table 1. The average interitem correlations are all between ,34 and .48. Although there is no hard and fast criterion for level of correlation, a rule of thumb is that items that correlate below .30 are not sufficiently related and therefore do not contribute to measurement of the core factor and that items that correlate over .70 are redundant and probably unnecessary. It would appear from the summary statistic average

167

ITEM ANALYSIS / FERKETICH

Table 1. Correlation Matrix Among Items

item 1 item 2 item 3 item 4 item 5 item 6 item 7 item 8 item 9 item 10

---

a b

.45 .30 .79 .56 -.04 .45 .62 .26 .48

.26 .15 .49 -.27 .34 .38 .47 .76

.25 .33 .79 .33 .44 .54 .13

.35 .53 .34 .51

.42 .33 .14 .60

.43 .?5 .p9 .19

.48 .44 .38

.57 .68

.46

719 .44

619 .40

519 .37

519 .43

619 .37

219 .34

919 .39

819 .48

619 .39

---

.24

- - - - - - - - .67- - - - .23- - -

-.I

619 .47

"The proportion of times the item correlates between .30 and .70 with other items in the scale. bThe average correlation of the item with all others in the scale

interitem correlation that, each item is within those parameters. On closer inspection, however, it can be seen that of the nine possible times that Item 6 could correlate, it meets the criteria only twice. The dotted line in the matrix assists the reader in following Item 6 correlations. Thus, the majority of times that the item is correlated with other items in the scale, the correlations are less than .30 or over .70. The correlation between Item 6 and Item 3 is .79 indicating redundancy. This information would not have been readily available if only the summary statistic were used. The investigator also may want to examine the correlations between Item 1 and Item 4 and between Item 4 and Item 6 to determine why there are such high correlations between these items. The summary statistic of corrected item to item correlations is also provided by most statistical packages. This summary statistic, when combined with the above detailed information about each item, provides a more complete picture of item performance. Item to item correlations above .30 are desired. In practice, however, it is possible to achieve an alpha of .7 1 with an average interitem correlation of .20 provided the test has a length of 10 or more items (Zeller & Carmines, 1980). Item to total corrected correlations are also provided by most statistical packages. The higher the corrected correlation between the item and the total, generally the better the item. Again, correlations above .30 are good (Nunnally, 1978). If there is a pool of items with correlations this high or above, the investigator is in good shape. If not, sets of 5 or 10 items with the next highest item to total correlations can be added in an attempt to obtain the instrument length needed to improve

internal consistency reliability estimates. At any point where alpha failed to improve or became worse or item homogeneity decreased the process of adding additional items from this pool should stop. The fourth bit of information is the revised alpha if the item were dropped from the instrument. This is readily available on some packaged programs such as the Statistical Package for Social Sciences (SPSS). Perhaps because this information is so readily available, some researchers have abused the information by making it the sole determinant of whether to retain or drop an item. As a part of the overall information amassed on a particular item, the information on alpha if the item is dropped can be invaluable. The investigator does need to keep in mind that the cumulative effect of dropping more than one item may be inadequately reflected in the estimate. The estimate of alpha if the item is dropped is based on only the one item and not a combination of items.

SCALE HOMOGENEITY Obviously much of the prior discussion directly relates to development of a scale composed of a homogeneous set of items. The higher the interitem correlations, the more homogeneous the scale. By deleting items with low interitem correlations, scale homogeneity has increased. The argument for achieving the goal of similarity among items is central to the estimates of both reliability and validity. In the formula for alpha shown above, if the total number of indicants is kept constant, then

168

RESEARCH IN NURSING

the only other term that can vary is the sum of the interitem correlations. Alpha will increase as the value for the sum of the interitem correlations increases. Thus reliability, as assessed by measures of internal consistency, is directly related to degree of interitem correlation. Given the argument that scale validity is a function of its adequate measurement of an attribute, then each item in that scale also should be an adequate measure of that attribute. Therefore, the more the items relate to each other, the more they measure the same attribute. Of course, examination of coefficients beyond internal homogeneity is warranted. Item validity estimates can be obtained by correlating the score of an individual item with an outside criterion. Test validity estimates can be obtained by correlating the score on the composite with an outside criterion but again this is a direct function of the item validity estimates. Thus the importance of scale homogeneity is apparent for both reliability and validity arguments.

TEST SENSITIVITY Test sensitivity is defined here as the ability of the scale to discriminate among individuals with varying levels of the attribute of interest. Sensitivity of a given test may deviate across populations. For example, certain fatigue instruments were designed to measure that attribute among healthy working populations. When given to populations in which fatigue is a daily occurrence of some magnitude, the items and thus the scale may fail to detect small differences among subjects. Consider chronically ill populations, those with chronic fatigue syndrome, or those undergoing treatments such as those for cancer. In these populations finer discrimination may be necessary than among healthy populations for whom fatigue may be a transient effect brought on by lack of sleep or exertion. As already stated, item performance is a direct index of scale performance. As such, each item can be examined for the distribution of subjects among the available options for that item. If an inadequate amount of variance is observed on most or all of the items, then little variance among subjects will be observed on the composite scale. One approach to solving this problem is to increase the number of components in the composite. In general, as the number of items increases, the

a HEALTH

standard deviation of the total score increases. When there are high levels of interitem correlations, there is a greater effect on the standard deviation of the scale. When there is a perfect interitem correlation, the amount of increase in items is equal to the change in standard deviation. However, since there is little reason to have a scale composed of perfectly correlated items, the effect of increasing the number of items has less impact on the scale standard deviation (Ghiselli, Campbell, & Zedeck, 1981).

SUMMARY Each of the issues explored in this article had solutions based on adding or deleting items. Modification of items, additional estimates of content validity and other theoretical solutions were not discussed. The trade off for each of the solutions discussed is that by changing the number of items, other properties could be varied. For example, when making items more homogeneous by decreasing the number of items, the test may become less sensitive. When sensitivity is in jeopardy, one course of action is to increase the number of items. Of course, increased length may be problematic given a particular subject population. Needless to say, the checks and balances in test construction can be quite challenging. Evaluating items based on several sources of information and knowing the interrelated effects of certain actions are of assistance to the instrument developer. In the final analysis, however, it is a judgment call of the investigator that makes the difference.

REFERENCES Cronbach, L. (1951). Coefficient alpha and the internal structure of test. Psychometrika, 16, 297334.

Crocker, L . , and Algina, J . (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart and Winston. Ghiselli, E . , Campbell, J . , and Zedeck, S . (198 1 ) . Measurement theory f o r the behavioral sciences. San Francisco: W. H. Freeman. Nunnally, J . (1978). Psychometric theory. New York: McGraw-Hill. Zeller, R . , and Carmines, E. (1980). Measurement in the social sciences. New York: Cambridge University Press.

Focus on psychometrics. Aspects of item analysis.

During the course of instrument development, investigators are faced with the challenge of developing a psychometrically sound instrument that has a m...
366KB Sizes 0 Downloads 0 Views