Research Quarterly for Exercise and Sport

ISSN: 0270-1367 (Print) 2168-3824 (Online) Journal homepage: http://www.tandfonline.com/loi/urqe20

New Approaches to Determining Reliability and Validity John M. Linacre To cite this article: John M. Linacre (2000) New Approaches to Determining Reliability and Validity, Research Quarterly for Exercise and Sport, 71:sup2, 129-136, DOI: 10.1080/02701367.2000.11082796 To link to this article: http://dx.doi.org/10.1080/02701367.2000.11082796

Published online: 13 Feb 2015.

Submit your article to this journal

Article views: 33

View related articles

Citing articles: 4 View citing articles

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=urqe20 Download by: [NUS National University of Singapore]

Date: 06 November 2015, At: 00:13

Linacre

Research Quarterly for Exercise andSport ©2000 by the American Alliance for Health, Physical Education, Recreation and Dance Vol. 71, No.2, pp. 129-136

New Approaches to Determining Reliability and Validity

Downloaded by [NUS National University of Singapore] at 00:13 06 November 2015

John M. Linacre

Keywords: validity, reliability, measurement, rating scales

Why "NewApproaches"?

T

h e current approaches to measurement in the social sciences were developed in the 1930s and have remained essentially unchanged for more than 50 years. Compare these summaries: "A good test is one that does what it is supposed to do... 1. A good test must actually measure what it is supposed to measure (validity). 2. It must do this accurately and consistently (reliability) . 3. It must be fair to the students (objectivity)." (Micheels & Karnes, 1950, p. 103) and "There are three characteristics essential to a sound measuring instrument reliability, objectivity, and validity... 1. A test or measuring instrument is valid if it measures what it's supposed to measure ... 2. A reliable instrument measures whatever it measures consistently . 3. Objectivity is defined in terms of the agreement of competentjudges about the value of a measurement." (Baumgartner &Jackson, 1982, pp. 91-93, 139) then "We should seriously consider using test results only from tests that are valid, reliable and accurate...

John M. Linacre is with the MESA Psychometric Laboratory at the University of Chicago.

RDES: June 2000

1. Validity - Does the test measure what it is supposed to measure? 2. Reliability - Does the test yield the same or similar scores (all other factors being equal) consistently? 3. Accuracy - Does the test fairly closely approximate an individual's true level of ability, skill, or aptitude?" (Kubiszyn & Borich, 2000, p. 297) Clearly the concepts of reliability and validity have been helpful or they would not have survived so long nor be propounded everywhere in social science as ideals. Yet the addition of further criteria such as "objectivity" and "accuracy" hint that practitioners have noticed deficiencies in at least the current implementations of reliability and validity, if not in the concepts themselves. Doubts about the currently dominant measurement paradigm in the social sciences are raised when we consider measurement in the physical sciences. Measurement in the physical sciences has been a continuing story of success. Effective measurement of heat facilitated the development of steam engines and internal combustion engines. Effective measurement of speed motivated the theory of relativity and brought on nuclear science. There are no similar developments in the field of social science. It could be argued, "But physicists know what they are measuring, social scientists do not!" The history of thermometry belies this (Choppin, 1985). There were contradictory theories, contradictory measurement systems, and a multitude of confusing external factors. But there was a hermeneutic cycle: theory motivated better measurement; better measurement motivated better theory. Gradually untenable theories were eliminated and external factors accounted for. Now, for all practical purposes, the molecular motion theory of thermodynamics is so nearly perfect that it can be taken for granted and applied with minimal thought. There have been no similar advances in social science theory or social science measurement.

129

Linaere

Downloaded by [NUS National University of Singapore] at 00:13 06 November 2015

The terms "reliability", "validity" and "scores" are curiously absent from the central corpus of physical measurement literature (e.g., Ku, 1969). Terms such as "accuracy", "precision" and "standards" occur frequen tly. The contrasting words reflect contrasting measurement philosophies. Social science measurements are vague, global and distorted. Physical science measurements are specific, local and linear. The intention of the "new approaches" is to endow social science measurement with those attributes that have proved highly productive in physical science measurement.

The NewApproaches vs. The Current Approaches What is the practical difference between the new approach of specific, local, linear measurement and the current approach of vague, global, distorted measurement? Let us demonstrate with an example. The Knox Cube Test (Knox, 1914) is a test of attention span and short-term memory. The Arthur (1947) version comprises 18 items. It requires the subject to copy a series of simple actions performed by the examiner. The exact repetition of an item is scored as a success, anything else is a failure. This test was administered to 35 students in Grade 2 through 7 (Wright & Stone, 1979, p. 33). The format of this test is similar to many tests of fitness and dexterity in which a series of tasks must be successfully completed in order for the subject to be credited with a success.

tury.According to the new approach, however,the conclusions just drawn, though not grossly incorrect, are inadequate.

The New Approach: Validity Validity is no longer established, once for all time, for the whole test, by criteria only indirectly related to the content of the test such as the chronological age of the subjects. Instead, validity is reevaluated every time the test is administered, for each item in the test, according to the substantive theory which the test items are intended to implement. In the case of the Knox Cube Test, this can be achieved by inspecting a map showing the items located along the variable, rank-ordered according to their P' values. In Figure 1, the items are inversely ranked according to their p-values. A theory of attention span and shortterm memory would be that longer and more complex tapping patterns indicate higher functioning. In Figure 1, the digits indicate the order in which four cubes, numbered sequentially 1 to 4 from examiner's left to right, are tapped. We see that the items with the highest p-values, the easiest items, are 1-4, 2-3 and 1-2-4. The item with the lowest p-value, the most difficult item, is 4-1-3-4-2-1-4. This is aligned with our expectation from substantive theory about brain functioning and confirms Knox (1914). Patterns 3-4-1 and 1-4-3-2, however, challenge the substantive theory. They are out of order. 3-4-1 is shorter and appears simpler than 1-4-3-2. Thus the specific validity of these items is called in question. Perhaps this disordering is merely an idiosyncrasy of this sample; per-

The Current Approach According to the current approach, Knox (1914) establishes the validity of the test. He reports that a simple tapping pattern of 4 taps in sequence can be accomplished by 4-year-olds. A complex tapping pattern of 5 taps out of sequence can be accomplished by 11 year olds. Since greater physical age in children is usually associated with longer attention span, it is concluded that the test is measuring what it is supposed to measure. The internal KR-20 reliability of the test for the sample reported by Wright and Stone (1979) is .76. Though not as high as the .90 preferred by test publishers, it is high enough to indicate a statistical separation between the high and low performers on the test, and so is taken to be good enough for simple decision making. Results of the test are reported as raw scores. For the sample, they range from 3 to 14, and so seem able to function as useful indicators of performance differences. The test is satisfactory according to current criteria and, indeed, has been used successfully for almost a cen-

130

Rank Hierarchyof Knox Cube TestTapping Patterns (lower p-values) 4-1-3-4-2-1-4 1-3-2-4-1-3 1-4-2-3-1-4 1-4-3-1-2-4 1-4-2-3-4-1 1-3-2-4-3 1-4-3-2-4 1-3-1-2-4 2-4-3-1 1-4-2-3 3-4-1 1-3-2-4 2-1-4 1-4-3-2 1-3-4 1-4 1-2-4 2-3 (higherp-values)

Figure 1. Testitemsranked by p-value.

ROES: June 2000

Linaere

haps it is an artifact of the manner in which subjects recall sequences; whatever the reason, both the validity of the test and the formulation of the substantive theory are suspect until the disordering is resolved. This is an example of the specific investigation of validity that is inherent in the new approach.

whom this test appears to be optimally targeted, i.e., those scoring near 10 out of 18. It is this local investigation of test reliability that is characteristic of the new approach.

The New Approach: Measures vs. Scores

Downloaded by [NUS National University of Singapore] at 00:13 06 November 2015

The New Approach: Reliability In the current approach, the "test reliability" is often assumed, erroneously, to be true of all samples and to apply equally to all subjects within the samples. Thus every subject who ever takes the test is regarded as having the same standard error of measurement, computed directly from the published reliability coefficient. Under the new approach, a separate standard error is computed for each subject. These can be summarized for any particular sample, to produce reliability coefficients equivalent to KR-20. The key question, however, is no longer "how reproducible are these test scores overall?", but "how accurate and precise are the locations of these subjects on the variable, considering purposes for which the test is intended this time?" As with physical measurement, accuracy relates to how close a reported location is to its actual value. Precision relates to how well the test reproduces the reported location, whether accurate or not. Figure 2 shows the 35 subjects who took the Knox Cube Test ordered by raw score. We see that even though the 35 subjects came from 6 grades (2-7), the test has failed to differentiate between one-third of the sample, 12 subjects, giving them all the same raw score of 10. For these subjects, the test has no reliability. In fact, dropping these subjects from the sample raises the KR-20 reliability of the test from 0.76 to 0.83. Thus for this test the least reliable decisions will be those relating to subjects for

Under the current approach, subjects are characterized by raw scores or their proxies, such as stanines, percentiles etc. All these produce numbers that lead to distorted placements of the subjects along the variable. For instance, raw scores give the appearance of being linear measures, because they are cardinal numbers. Indeed,

Map of Scores and Measures from Knox Cube TestData Score Subjects Items: Tapping Patterns I

+

I I

16

+

I I

15

10. 9. 8. 7. 6. 5. 4. 3.

Frequency XX

X XXXX XXXXX XXXXXXXXXXXX XXX

XX

I I

13

X

+

12

I I

XXXX +

11

I I

1-4-3-2-4

1-3-1-2-4

+

I I

10

XXXXXXXXXXXX

9

XXX

8

XX

7

XX +

6

XX

5

X

+

I I

2-4-3-1

+

I I

1-4-2-3

I I

1-3-2-4 1-4-3-2

3-4-1 2-1-4

+

I I

1-3-4

+

I I

4

+ 3

X

X

RaES: June 2000

1-3-2-4-3

XXXXX +

X

Figure 2. Sample distributionfor the Knox Cube Test

1-4-2-3-4-1

I I

XX

XX XX

1-3-2-4-1-3 1-4-2-3-1-4 1-4-3-1-2-4

+ 14

Subjects by RawScore Score 14. 13. 12. 11.

4-1-3-4-2-1-4

I I I

1-2-4

1-4

2-3



Figure 3. Map of Rasch measures for the Knox Cube Test.

131

Downloaded by [NUS National University of Singapore] at 00:13 06 November 2015

Linacre

the numbers are linear, but their meaning as scores is not. Linearity implies that one more equal-size increment can always be made. For the numbers representing the scores, this is true. For the scores themselves it is not. There is a maximum score and a minimum score. This range-restriction in raw scores forces the scores to be nonlinear, and distorts their meaning for the score-user and for subsequent analysis. Stanines and percentiles have different, but equally distorting, properties. It can be demonstrated in numerous ways that the necessary and sufficient means oflinearizing raw scores is the Rasch model (Fischer, 1995). In Figure 3, this is applied to the Knox Cube Test data. Now the vertical scaling is linear. Subjects are aligned with items on which their predicted success is 50%. To the left are shown the subjects' raw scores. Their non-linearity is apparent. The advance from 5 to 6 is less than that from 10 to II.

Table 1. Rasch measures for the Knox Cube Test sample. Scores, Measures, Frequencies of Knox Cube Test Sample Score 3 4 5 6 7 8 9 10 11 12 13 14

Measure 2.8 4.3 5.2 6.0 6.7 7.5 8.4 9.6 10.9 12.0 13.0 13.9

S.E. 2.0 1.2 .9 .9 .9 .9 1.1 1.2 1.1 1.0 1.0 1.0

Frequency 1 0 1 2 2 2 3 12 5 4 1 2

%

Percentile 1 3 4 9 14 20 27 49 73 86 93 97

2.9 .0 2.9 5.7 5.7 5.7 8.6 34.3 14.3 11.4 2.9 5.7

It is now seen why so many subjects achieved a score of 10. There is no tapping pattern of a difficulty aligned to differentiate between them. This test would be improved by patterns that are easier than 1-3-1-2-4, but harder than 2-4-3-1. Wright and Stone (1979) demonstrate from substantive theory and additional data that 3-1-4-2 and 1-2-3-4-2 are such patterns. Constructing a linear measurement framework not only makes validation of a test simpler, but also gives the resultant person measures the arithmetical properties that test users imagine that raw scores, stanines or percentiles contain. Rescaling the raw scores into a convenient range of linear measures further improves the usability of the measures. Table I contains such linear measures, their standard errors and descriptive statistics for this sample. In this example, the linear measures have been deliberately scaled to mimic the non-linear raw scores. This eases the transition of users from the current to the new approach. The new approach enables every test user to verify the validity of the test for his or her own purpose and also to determine the substantive meaning of the results of the test. The operational meaning of each measure can be directly ascertained from a map of the test like Figure 3. The meaning of the test, i.e., what the test is measuring, can be determined simply by inspecting the item hierarchy on the map. The meaning of a particular measure is that the subject is expected to succeed on those items lower than the subject on the map, and fail on those higher. Further, the new approach produces linear measures for the subjects (and also the items) ideal for simple arithmetic manipulation, plotting, and analysis by conventional statistical techniques.

Table 2. Characteristics of AAHPER items.

AAHPER Youth Fitness Test (Baumgartner & Jackson, 1982) Observed Categories values 11

Ordered

Increment

22-100

Max. No. different values 79

3

1-39,40-79,80-100

Pullups

0-16

17

11

6

0, 1,2-5,6-10, 11-15, ...

Shuttle Run

9.7-12.4

28

20

6

9.5-9.9, 10.0-10.4, ...

Broad Jump

48-84

37

12

8

45-49, 50-54, ...

Ball Throw

54-170

117

33

7

50-69, 70-89, ...

600-yard Run

95-156

62

27

7

90-99, 100-109, ...

50-yard Dash

6.8-10.3

36

20

8

6.5-6.9,7.0-7.4, ...

132

Item Range

Observed different

Situps

RDES: June 2000

Linaere

The application of these new approaches to attitude surveys, Likert-scales and other rating-scale instruments is well-described and documented (e.g., Wright & Masters, 1982). Relevant papers in the field of physical fitness by Looney, Safrit, Spray and Zhu are in the Reference list.

Downloaded by [NUS National University of Singapore] at 00:13 06 November 2015

Applying the New Approaches to Physical Fitness Data Scores of 40 students on the 7 item AAHPER Youth Fitness Test (1976) are given by Baumgartner and Jackson (1982, p. 61). Table 2 summarizes some characteristics of the items of this test. According to the current approach, it is seen that the substance of the 7 items, "Situps" etc., conform with the notion of fitness, and so the test seems to measure what is intended. Thus it is deemed "valid". The raw score reliability of original data is .77 (when time data are reversed and decimal fractions multiplied by 10). Because the observed ranges of values differ markedly across items, the raw score reliability of these data increase to .89 when the observations are standardized by item to Tscores. This revised reliability is generally deemed satisfactory. T-scores also seem to be convenient numbers for subsequent analysis. Consequently this test, as it stands, satisfies the criteria of the current approach. Under the new approach, consideration of validity is more thorough and detailed. First, consideration is given to the observations themselves. Whatever the test is measuring, higher data values must indicate more of it. Of course, as with the current approach, this requires the three timed items (Shuttle run, 600-yard run, 50-yard dash) to have their data reversed by subtraction from some convenient value. Then all items are positively correlated. But validation of the observations further requires close inspection of the results of analysis to determine if higher data values for each item consistently indicate higher performance on the test overall. Rasch methodology applied to rating scale analysis supplies powerful diagnostics of this. Table 3 is an example. In Table 3, the Broad Jump distances vary between 48 and 84 inches. Counts show the frequency of each distance in the data. The "average measure" column is an outcome of Rasch analysis. It shows averages of the linear measures of the students at each jump distance. The linear measures are inferred from the students' performance on the entire test. In general it is seen that measures on the whole test advance with distance on this particular item. But not always. Disordering in Table 3 is flagged by ,*' in the "average measure" column. Since increasing broad jump performance does not always correspond to increasing performance on the test overall, this disordering threatens the validity of the test as a

RaES: June 2000

measuring instrument. Of course, with so many possible observational levels, some disordering is expected by chance, but Table 3 shows a pattern. It indicates that at least a 5 inch, and perhaps a 10 inch, advance in broad jump performance is required before there is a noticeable advance in performance on the overall test. A practicable set of 5 inch advances are shown in Table 3 as "ordered categories". Recoding the observations on the item from their original inches to the ordered "5 inch" categories removes disordering and so increases the interpretability and validity of the observational basis of the test. A similar investigation is performed on each item. This process is repeated several times in order to maximize the cooperation among the items in generating one overall

Table 3. Validityof Broad Jump distances. Broad Jump Distance 48

Original Data Count 2

After Recoding

Average Measure -.32

2

-.24

-.27*

U

2

-.20

7

-.12

70 71 72 73

8

6

-.08

-.07

3

.13*

2

.06

2

.26

-.54

5 5 5 5 .5. 6 6 6 6

.29

.96

7 7 7 7

1.60

1

U 80 81 82 83 84

4 4 4 4

2

LA 75 76 77 78

Z

~

U

U

-2.17

3.

U

65 66 67 68

2 2 2 2 3 3 3 3

55 56 57 58 60 61 62 63

Average Measure -3.31

1

U 50 51 52 53

Ordered Category 1

3

.41

.31*

8 8 8 8 86

3.01

133

Downloaded by [NUS National University of Singapore] at 00:13 06 November 2015

Linacre

meaning and measure for each test score. The recoding of observations into categories has empirical, practical and theoretical aspects. Empirically, the frequency of each observational value and its average measure inform us as to the functioning of that level of the item for this sample. Practically, the recategorization must be simple to effect and easy to interpret. Substantive theory of physical activity guides us as to what are likely to be reasonable increments in observations before differences become noticeable. An inch or a tenth of a second are not likely to constitute noticeable differences in fitness task performance. The last two columns of Table 2 show the resulting numbers of categories and their definitions for the 7 items. The last column of Table 3 shows that the resulting average measures for the BroadJump now exhibit unmistakeable advances along the latent measurement variable. It might be argued that reducing the original 379 observational levels shown in the third column of Table 2 to the 45 categories shown in the fifth column

must throwaway useful informative distinctions in the data and so reduce test reliability (Nunnally, 1967, p. 521). In fact, after this recoding, the raw score reliability has risen slightly from .89 to .90! The measure reliability is .91. Stratifying the data has removed observational noise and clarified the relationship between the item and the overall test. Figure 4 enables us both to examine the construct validity of this test in greater detail and to perceive the value of a linear measurement system. From Rasch analysis of the ordered categorical data, a map of the items, categories and student performances has been constructed in one frame of reference. Figure 4 depicts this with none of the vagueness typical of reports based on average ratings and p-values. On this Figure's left side is the linear measurement scale. The printed section of this infinitely long equal-interval scale is set to have the range 0-100 for ease of comprehension. 0 and 100 are not the end-points of the measurement

Planning and Diagnostic KeyForm of the AAHPER Youth Fitness Test Measures

Students

Situps

Pullups

Shuttle run

16-20

9.5-9.9

Broad jump

Ball throw

600-yard run

50-yard dash

100

90

X X X

170-189 6.5-6.9 90-99

11-15

80-84

80

70

XX X XX

10.0-10.4

150-169 7.0-7.4 75-79 100-109

6-10 60

XXXXXX XXX XX XXXXX

70-74 130-149

10.5-10.9

110-119

50

40

30

7.5-7.9

XX XXXX X XXX X X XX X

2-5

65-69

80-100 110-129

120-129

11.0-11.4

8.0-8.4 60-64 130-139 55-59

90-109

8.5-8.9

40-79 11.5-11.9

20

9.0-9.4 70-89

140-149

50-69

150-159

50-54 10

X 0

1-39

0

12.0-12.4

45-49

10.0-10.4

Figure 4. Planning and diagnostic KeyForm constructed from the AAHPER data.

134

ROES: June 2000

Downloaded by [NUS National University of Singapore] at 00:13 06 November 2015

Linacre

scale, but, as with 0 and 100 on the Celsius temperature scale, convenient reference markers. The linear measurement framework facilitates the comparison of performance levels across items and within items. For instance, looking across items, it can be seen that a shuttle run of 10.5 seconds is equivalent to a dash of 7.5 seconds. On the other hand, looking within the 600-yard run item, it is seen that time is non-linear with performance. A central range is relatively insensitive to performance level. Fast and slow times are highly indicative of high and low performance levels overall. The second column of Figure 4 gives the student distribution. In practice, we would have background information about the students, and could use this to confirm the validity and functioning of the test for our sample. Glancing over to the right from each student's "X", we can see what students at the performance level of that student can be expected to achieve (item cat-

egories below such students) and what those students' next objectives are (the item categories next above such studen ts). This gives a criterion reference for students measures that can be used to plan fitness programs. Figure 4 can also provide immediate measurement of a student's performance level, as well as diagnostic insight into that student's strengths and weaknesses. This map provides the useful, even powerful, local information that raw scores and percentiles obscure. Merely circle what the student has achieved, and an estimate of the student's measure is a horizontal line through the middle of the circles. This is illustrated in Figure 5 with underlines instead of circles for Student 24 of the data set.

Diagnostic AAHPER Keyform for Student 24 Measures

Students

Situps

Pullups

Shuttle run

16-20

9.5-9.9

Broad jump

Ball throw

600-yard run

50-yard dash

100

90

170-189 6.5-6.9

X X X

90-99 80-84

11-15

80

70

150-169

10.0-10.4

XX X XX-24-

7.0-7.4 75-79 6-10

60

100-109 70-74

XXXXXX XXX XX XXXXX

130-149

10.5-10.9

110-119

50

40

30

7.5-7.9

XX XXXX X XXX X X XX X

65-69

2-5 80-100

110-129

120-129

11.0-11.4

8.0-8.4 60-64 130-139 55-59

90-109

8.5-8.9

40-79 11.5-11.9

20

9.0-9.4 70-89

140-149

50-69

150-159

50-54 10 X 0

1-39

0

12.0-12.4

45-49

10.0-10.4

Figure 5. Diagnostic KeyForm for a studenttakingthe AAHPER test..

ROES: June 2000

135

Linacre

Conclusion

theory and the measurement of motor behavior. Research

Downloaded by [NUS National University of Singapore] at 00:13 06 November 2015

Quarterly for Exercise and Sport, 60, 325-335.

The new approaches to determining validity and reliability are not simply methods for obtaining more defensible tests or mathematically better numbers. The new approaches enable both the test user and the test developer to obtain deeper insight into what is being measured and what those measures mean. The history of scientific development consistently manifests the paradigm that better measurement leads to better substantive theory, and better substantive theory leads to better measurement. In social science as a whole, measurement has not advanced noticeably in 50 years. Replicable advances in theory, of the order of those reported frequently in physics and biology, have also been lacking. The new approaches, with their departure from the vague, global and distorted, to the specific, local and linear, are the road along which real progress in social science can be made.

References AAHPER. (1976). Youth Fitness Test Manual. Washington, DC: Author. Arthur, G. (1947). A point scale of performance tests. New York: Psychological Corp. Baumgartner, T.A, & Jackson, AS. (1982). Measurement for evaluation in physical education (2nd ed.). Dubuque, Iowa: Wm. C. Brown. Choppin, B.H.L. (1985). Lessons for psychometrics from thermometry. Evaluation in Education, 9(1), 9-12 Fischer, G.H. (1995). Derivations of the Rasch model. In G.H. Fischer & I.W. Molenaar, Rasch models: Foundations, recent developments, and applications. New York: Springer Verlag. Knox, HA (1914). A scale based on the work at Ellis Island for estimating mental defect. Journal of the American Medical Association, 62, 741-747. Ku, H.H. (1969). Precision measurement and calibration. Washington D.C.: National Bureau of Standards. Kubiszyn, T., & Borich, G. (2000). Educational testing and measurement (6th ed.). New York:John Wiley & Sons. Looney, M.A (1997). Objective measurement of figure skating performance. Journal ofOutcome Measurement, 1(2), 143-163. Micheels, Wj., & Karnes, M.R. (1950). Measuring educational achievement. New York: McGraw-Hill. Nunnally,J.C. (1967). Psychometric theory. NewYork:McGraw-Hill. Safrit, Mj., Cohen, AS., & Costa, M.G. (1989). Item response

136

Safrit, Mj., Zhu, w., Costa, M.G., & Zhang, L. (1992). The difficulty of sit-up tests: An empirical investigation. Research Quarterly for Exercise and Sport, 63, 277-283. Spray,J.A. (1987). Recent developments in measurement and possible applications to the measurement of psychomotor behavior. Research Quarterlyfor Exerciseand Sport, 58, 203-209. Spray,J.A (1990). One-parameter item response theory models for psychomotor tests involving repeated, independent attempts. Research Quarterlyfor Exerciseand Sport, 61, 162-168. Wright, B.D., & Stone, M.H. (1979). Best test design. Chicago: MESA Press. Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press. Zhu, W. (1995). Rasch modeling of motor performance. In '95 KNUPE International Symposium (Proceedings of the 1995 Korean National University ofPhysical Education, October 910,1995, pp. 11-42). Seoul, Korea: The Research Institute of Physical Education & Sports Science, Korean National University of Physical Education. Zhu, W. (1996). Should total scores from a rating scale be directly used? Research Q:mrterly for Exercise and sport, 67, 363-372. Zhu, W. (1997). Psychometric problems and possible solution in assessing physical activity using questionnaires. In Better Quality ofSport and Physical Education for All: The '97 Seoul International Sport Science CongressProceedings (pp. 732-754). Seoul, Korea: Korean Alliance for Health, Physical Education, Recreation and Dance. Zhu, W. (1998). Test equating: What, why, how? Research Quarterlyfor Exercise and Sport, 69, 11-23. Zhu, w., & Cole, E.L. (1996). Many-faceted Rasch calibration of a gross-motor instrument. Research Quarterly for Exercise and Sport, 67, 24-34. Zhu, W., Ennis, C.D., & Chen, A. (1998). Many-faceted Rasch modeling experts' judgement in test development. Measurement in Physical Education and Exercise Science, 2, 21-39. Zhu, w., & Kurz, K. A (1994). Rasch Partial Credit analysis of gross motor competence. Perceptual and Motor Skill, 79,947961. Zhu, w., & Kang, S.:J. (1998). Cross-cultural stability of the optimal categorization of a self-efficacy scale: A Rasch analysis. Measurement in Physical Education and Exercise Science, 2, 225-241. Zhu, w., & Safrit, Mj. (1993). The calibration of a sit-ups task using the Rasch Poisson Counts model. The CanadianJournal ofApplied Physiology, 18,207-219.

ROES: June 2000

New approaches to determining reliability and validity.

New approaches to determining reliability and validity. - PDF Download Free
1MB Sizes 0 Downloads 10 Views