The use of global ratings in OSCE station scores.

Advances in Health Sciences Education 1: 215-219, 1997. ( 1997 Kluwer Academic Publishers. Printedin the Netherlands.

215

The Use of Global Ratings in OSCE Station Scores* ARTHUR I. ROTHMAN 1, DAVID BLACKMORE 2 , W. DALE DAUPHINEE 2 and RICHARD REZNICK 3 1

Departmentof Medicine, University of Toronto, 585 UniversityAvenue, BW4-658, Toronto, Ontario, CanadaM5G 2C4 fax: +1 (416) 9784568; E-mail: [email protected]; 2 The Medical Council of Canada, 2283 St. LaurentBlvd., Ottawa, Ontario, CanadaKIG 3H7 fax: +1 (613) 521 9417; 3 Department of Surgery, The Toronto Hospital - GeneralDivision, 200 Elizabeth Street, E9-242, Toronto, Ontario, Canada M5G 2C4

Abstract. The Medical Council of Canada makes use of examiners' pass/borderline/fail judgments of candidates' performances in OSCE stations in defining cutting scores for these stations. This process assumes that there is consistency in the judgments of different examiners used in the same stations at different testing sites. This assumption was tested using the results of the fall 1994 administration of part 2 of the Medical Council of Canada Qualifying Examination. The Council anticipated using the examiner based global ratings as part of the OSCE station scores in the fall 1995 administration of the examination. In this study, the fall 1994 results were used to estimate to what extent test reliability would increase with the addition of the global ratings. Key words: OSCE, global ratings, examiners

Introduction Part 2 of the Medical Council of Canada's Evaluating Examination is an objective structured clinical examination (OSCE) utilizing standardized patients (SPs) and qualified physician examiners (Reznick et al., 1993). Since the fall 1993 administration, judgments of the examiners as well as Angoff estimates provided by the test committee have been used in setting the test's cutting score. The use of examiner judgments, expressed in global ratings of candidates' performances in each station, was based on work done by Norcini et al. (1993) and Rothman et al. (1993, 1996). In the fall 1993 administration, for each station, score distributions associated with 'pass' and 'fail' global judgments were defined, and the station cutting score defined at the point of intersection of the two normalized distributions. The stability of these station cutting scores across different examiners was demonstrated. In the fall 1994 administration, examiners rated candidates' performances as 'pass', 'borderline' or 'fail', and for each station the cutting score was set as the mean of the score distribution associated with the 'borderline' ratings. In contrast to the approach used in 1993, no assumption of normality in the score * This research was done with the support of the Medical Council of Canada

216

ARTHUR I. ROTHMAN ET AL.

distributions associated with the different levels of global ratings was required. Again, confident use of this process assumed that there was consistency in the judgments of examiners employed in the same stations at different sites. In the current study, this assumption was tested using the results of the Day 1 form of the Examination and replicated using the results of the equivalent Day 2 form. For the fall 1995 administration, The Council also anticipated using the examiner based global ratings as part of the OSCE station scores, thus potentially increasing the effective test length and hence the test reliability. In this study, the results of the fall 1994 administration were used to estimate the extent to which test reliability could be increased, and also to determine, in the calculation of station scores, the relative weighting of the global ratings that would provide maximum test reliability. Methods This study used the results of the fall 1994 Day 1 and Day 2 administrations of Part 2 of the Medical Council of Canada Qualifying Examination. 970 candidates were examined on Day 1 examination, 702 on Day 2. The two forms of the examination were assumed to be equivalent. Each form consisted of 20 ten minute stations; 10 were couplet stations (five minute focused encounters with SPs requiring demonstration of history taking and physical examination skills, followed with five minute written exercises testing data interpretation and medical diagnosis skills); 10 were ten minute encounters with SPs, 6 of these tested interviewing skills, or history and physical examination skills; 4 assessed communications skills and employed separate content and process checklists. Physician examiners observed all patient encounters, completed the checklists and global ratings, as well as marking the responses to the linked written exercises. In the fall 1994 administration, on day 1 the examination was administered in 25 half-day sessions at 9 sites and on day 2, in 18 half-day sessions at 8 sites. At each session one examiner served in each patient encounter station and one marker was assigned to each post-encounter written exercise. Therefore, the two days' test results provided either 25 or 18 sets of pass/borderline/fail score distributions for each station. For each, the mean values of the three score distributions were calculated; the standard deviations of these means across the 25 or 18 sessions were also calculated. These latter values described the extent of variation between the ratings of the different examiners assigned to each station. On all checklists and written exercise answer sheets, after completing the checklist items or grading the written answers, examiners rated candidates' global performances on a six point scale (5 to 0); reflecting high and low divisions within the pass, borderlineandfail categories. To test the effect of blending the global ratings into the station scores, station scores were calculated with progressively larger weighting of the global ratings, from 0 to 0.75. in 0.05 steps. Test reliability values were calculated at each step and these values were plotted as a function of the global weightings.

217

THE USE OF GLOBAL RATINGS IN OSCE STATION SCORES Table I. Means (%) and standard deviations of score distributions associated with fail/borderline/pass classifications of candidates by station type and total test - day 1 and day 2 forms Pass Mean

Stns

Fail Mean

sd

Borderline Mean sd

Day 1 10-h/p C-cnt C-pro 5-h/p pep Total

6 4 4 10 10 34

46.2 35.4 28.6 39.0 28.7 35.6

10.7 2.0 4.3 6.6 10.1 7.5

64.9 56.9 57.1 56.5 56.3 58.0

7.6 3.1 3.0 7.5 7.8 6.6

79.5 73.1 80.5 71.0 77.4 75.8

9.0 4.0 2.9 8.2 3.7 5.9

Day 2 10-h/p C-cnt C-pro 5-h/p pep Total

6 4 4 10 10 34

41.6 36.9 30.2 34.1 34.3 35.4

9.3 7.6 2.7 9.1 5.4 7.1

61.7 56.4 56.0 54.7 59.2 57.6

8.3 10.4 4.3 8.2 8.7 8.2

76.7 73.4 79.5 69.5 77.1 74.7

10.0 9.8 3.7 9.0 5.4 7.6

sd

Stns: stations mean: mean station scores (%) sd: between examiner values (%) 10-hp: ten minute history/physical stations C-cnt: communications skills checklist C-pro: communications skills rating form 5-hp: 5 minute history/physical stations pep: 5 minute post-encounter written exercises

Results Table I presents, for the two days' tests, and for the pass, borderline and fail categories, the mean station (or station part) scores and the between examiner standard deviations for the stations grouped by type: ten minute history and physical (10-h/p), communications content checklist (C-cnt), communications process (C-pro), five minute history or physical (5-h/p), the linked post encounter written exercises (pep), and for the total test. The table also lists the correlations between the global ratings and station scores (r). For all groupings, in the results for both days the pass/borderline/fail distinctions in the score distributions were clearly defined. And, although there was considerable variation between groupings and between the two days' results in the magnitudes of the between-examiner standard deviations, in all cases differences between the category mean scores exceeded the respective standard deviations. The day 1 and day 2 test reliability values using station scores alone were 0.72 and 0.76 respectively, total test global score reliabilities were 0.76 and 0.78

218

ARTHUR I. ROTHMAN ET AL. 0.85

0.825 0.8 0.775 - Day 1 Day 2 0.675 0.65 -

0.625 0.60.6

o

I

I

I

-

W

0

I

I

N6

e

o6

I

I

I

0

°

l m

I

I o °

I

l

I

q

W

I U)

o

Global Weight In Station Score

Figure1. Day 1 and day 2 test reliabilities by weights of global rating in station scores.

respectively. Figure 1 shows the plots of the test reliability values as a function of the weights given the global ratings in the calculation of station scores. Both curves reached maximum values (day 1 - 0.768; day 2 - 0.795) at a global rating weight of 0.6. The maximum day 1 gain in test reliability was 0.05 and the day 2 gain was 0.04. Discussion and Conclusions The first part of this study tested the assumption that there would be consistency across different examiners' definitions of pass/borderline/fail performance in individual stations. The results from the fall 1994 administration of Part 2 of the Medical Council of Canada Qualifying Examination provided the opportunity to compare the pass/borderline/fail definitions of performance, for each station or station part, provided by 25 examiners on the day 1 form of the examination and by 18 examiners on the day 2 form. The means of the scores associated with the pass, borderline and fail judgments represented the best estimates of their definitions. The standard deviations of the distributions of mean scores for each station and each category, described the extent of variation in these definitions among the examiners. Overall, the mean sd's of the different groups of stations were small relative to the differences between the respective pass/borderline/fail mean score values. These results provided evidence of the consistency in the judgments of different examiners employed in the same stations at different sites. The results of the second part of the study demonstrated that a modest increase in the reliability of the test scores could be achieved by including the global ratings in the calculation of station scores. Because the reliabilities of the aggregated

THE USE OF GLOBAL RATINGS IN OSCE STATION SCORES

219

total test global ratings were somewhat higher than the aggregated station scores, the maximum gains in reliability ( 7%) were achieved with the global ratings weighted >0.5 in the station score calculations. Gains of about 5% were achieved with the global ratings weighted 0.25. The opportunity for replication was provided by the availability of results from the administration of two forms of the examination on two successive days. And, consistency in the values of the mean scores of the different station groupings in the three rating categories, and in the magnitudes of mean station score Vs global rating correlations were observed in the two days' results. However, candidates were not assigned at random to the two days' administrations and the makeup of these groups, particularly in the proportions of international medical graduates involved, differed considerably. The group differences were reflected in the relatively consistent difference in the heights of the reliability curves in Figure 1, and probably were also the cause of the inconsistencies in the two days' values of the betweenexaminer standard deviation values. Consequently, definitive results relating to the impact of station formats on the consistency of examiners' judgments in the same stations were not realized. References Norcini, J.J., Stillman, P.L., Sutnick, A.I., Regan, M.B., Haley, H.L., Williams, R.G. & Freedman, M. (1993). Scoring and Standard Setting with Standardized Patients. Evaluation & the Health Professions 16: 322-332. Reznick, R.K., Blackmore, D., Cohen, R., Baumber, J., Rothman, A., Smee, S., Chalmers, A., Poldre, P., Birtwhistte, R., Walsh, P., Spady, D. & Berard, M. (1993). Academic Medicine 68 (October Supplement): S4-S6. Rothman, A.I., Poldre, P., Cohen, R. &Ross. J. (1993, April). StandardSetting in a Multiple Station Test of Clinical Skills. Paper presented at the Annual Meeting of the American Educational Research Association (AERA), Atlanta, GA. Rothman, A.I., Blackmore, D., Cohen, R. & Reznick R. (1996). The Consistency and Uncertainty in Examiners' Definitions of Pass/fail Performance on OSCE Stations. Evaluation & the Health Professions19: 118-124.

The risks of thoroughness: Reliability and validity of global ratings and checklists in an OSCE.

Improving the validity of global ratings.

Peer-assisted teaching student tutors as examiners in an orthopedic surgery OSCE station - pros and cons.

Weighting checklist items and station components on a large-scale OSCE: is it worth the effort?

Predictors of perceived functional ability in early-stage dementia: self-ratings, informant ratings and discrepancy scores.

The relationship between personality inventory scores and self-ratings.

Overall scores as an alternative to global ratings in patient experience surveys; a comparison of four methods.

Activity level: a comparison between actometer scores and observer ratings.

Medical students review of formative OSCE scores, checklists, and videos improves with student-faculty debriefing meetings.

The use of student ratings in multiinstructor courses.

Effects of prognostic information on global ratings of psychotherapy outcome.

Misconceptions and the OSCE.

Telephone OSCE.

Medication use by U.S. crewmembers on the International Space Station.

CPQ validity: the relationship between children's personality questionnaire scores and teacher ratings.

Promoting student confidence in the OSCE process.

Interpreting change from patient reported outcome (PRO) endpoints: patient global ratings of concept versus patient global ratings of change, a case study among osteoporosis patients.

Investigating disparity between global grades and checklist scores in OSCEs.

The use of z scores in probabilistic sensitivity analyses.

National hospital ratings systems share few common scores and may generate confusion instead of clarity.

Correlation between injury severity scores and subjective ratings of injury severity: a basis for trauma audit.

A pilot study of the relationship between experts' ratings and scores generated by the NBME's Computer-Based Examination System.

The Psychiatry OSCE: a 20-year retrospective.

Use of Prehire Minnesota Multiphasic Personality Inventory-2-Restructured Form (MMPI-2-RF) Police Candidate Scores to Predict Supervisor Ratings of Posthire Performance.