The risks of thoroughness: Reliability and validity of global ratings and checklists in an OSCE.

Advances in Health Sciences Education 1: 227-233, 1997. ) 1997 Kluwer Academic Publishers. Printedin the Netherlands.

227

The Risks of Thoroughness: Reliability and Validity of Global Ratings and Checklists in an OSCE J.P.W. CUNNINGTON, A.J. NEVILLE and G.R. NORMAN

McMaster University, HSC 3U2, 1200 Main Street West, Hamilton, Ontario, CanadaL5N 3Z5 E-mail: [email protected]

Abstract. Objective: To compare checklists against global ratings for student performance on each station in an OSCE without the confounder of the global rating scorer having first filled in the checklist. Method: Subjects were 96 medical students completing their pre-clinical studies, who took an 8 station clinical OSCE. 39 students were assessed with detailed performance checklists; 57 students went through the same stations but were assessed using only a single global rating per station. A subset of 39 students were assessed by two independent raters. Results: Inter-rater and inter-station reliability of the global rating was the same as for the checklist. Correlation with a concurrent multiple choice test was similar for both formats. Conclusion: The global rating was found to be as reliable as more traditional checklist scoring. A discussion of the validity of checklist and global scores suggests that global ratings may be superior. Key words: OSCE, reliability, validity, medical student assessment, checklists, global scores

One appeal of the Objective Structured Clinical Examination (OSCE) is the apparent objectivity that results from the use of detailed yes/no checklists. Indeed, this may be a major contributor to the popularity of OSCEs in clinical performance assessment. Intuitively we feel that if the criteria for assessment are worked out beforehand, are based on detailed specifications of behaviors which can be identified as present or absent, and are applied by an impartial observer then the test will be less susceptible to bias, more objective, and therefore fairer. But does this type of scoring truly result in fairer, more reliable, more valid evaluation than other, apparently more subjective approaches to scoring such as global ratings? Or does the checklist simply give the appearance of objectivity by measuring thoroughness, an element easy to measure, but which is a poor surrogate for the qualities, skills and behaviours which concern us in assessing physician and medical student performance? Harden and Gleeson (1979) when they first described the OSCE in 1979 clearly thought it was more objective: 'the examination is not only more valid but more reliable. The use of a check-list by the examiner and the use of multiple choice questions result in a more objective examination.' But is it more objective?

John Cunnington MD and Alan Neville MD are associate professors of medicine; Geoffrey Norman PhD is professor of clinical epidemiology and biostatistics; All are at McMaster University.

228

J.P.W. CUNNINGTON ET AL.

Van der Vleuten et al. (1991) identify two types of objectivity in measurement. One incorporates the idea that the assessment must be free of the personal opinions and preferences of the examiner i.e.: value, or bias-free measurement. This is synonymous with reliability. The second meaning of objectivity, which they term "objectification", is the use of strategieswhich appear likely to reduce subjectivity, such as the clarity of the instructions and the degree to which specific behaviors have been identified and described; elements which are associated with detailed checklists. In both cases the ultimate issue is one of reliability; however, it remains to be shown that the application of the methods of "objectification" actually lead to measurable improvements in "objectivity" (i.e. reliability). Indeed, van der Vleuten et al. drew from a number of sources in concluding that "objectified" measures did not lead to consistently higher reliability. Since the publication of the van der Vleuten paper, there have been a number of additional investigations of ratings and checklists. Norcini et al. (1990) performed a study using 12 questions based on clinical cases requiring brief essay answers. Scoring involved either a checklist completed by non medical personnel given 14 hours of training in the use of the checklist, or global scoring done by untrained internists. Reliability for the 3 hour test was 0.36 for the checklist and 0.63 for the global score. In OSCEs Van Luijk and van der Vleuten (1992) compared the global ratings of physician raters with checklist scores obtained by the same rater prior to assigning the global rating. The inter-rater reliability of the checklist was higher (0.83) than the global rating (0.72), but the inter-case generalizability of the global rating was higher than the check list, resulting in similar test reliability. Cohen, Rothman et al. (1991) compared a checklist with global scores obtained after completion of the checklist by the same rater. The generalizability for the checklist was 0.84, and for two global scores, 0.83 and 0.75, comparable to the values obtained by the checklist. MacRae et al. (1995) compared checklists completed by standardized patients to global ratings completed by physician observers. Interrater reliability of the global ratings ranged from 0.65 to 0.93. The reliability of the checklist scores across a two station test was 0.34 while the reliability of the global ratings across stations was 0.59. Cohen, Colliver et al. (1996) compared a 17 point checklist with a global score obtained by the same observer after completing the checklist. The intercase reliability of the checklist was 0.65, and of the global score was 0.70. Thus, while it appears that the inter-rater reliability of global ratings may be lower than for checklists, the inter-case reliability is usually higher, resulting in a similar overall test reliability. None of these studies, however, result in a direct comparison of global ratings with checklist scores without the confounder of the global rater having already filled in the checklist (except the MacRae study where the result was confounded by differences in types of rater). To determine whether global ratings alone result in comparable reliability to checklists alone we undertook a direct comparison with inter-rater and inter-case reliability determined independently by examiners using either checklists or global ratings alone. In addition

OSCE: GLOBAL SCORING VS CHECKLISTS

229

the validity of these two scoring methods was estimated by comparison with a concurrent multiple choice test. Methods 96 students in the second year of a 3 year medical school participated in an OSCE to assess clinical skills prior to starting clerkship. The OSCE consisted of eight stations designed to assess a broad range of history taking, physical examination and communication skills. For logistical reasons the test was run concurrently in five separate circuits. Fifty seven of the students in three circuits were assessed using score sheets which listed the major performance issues but allowed only a single overall rating of competence on a seven point scale. Thirty nine students did the same stations in separate circuits and were assessed using score sheets which contained the same performance items as the global scores but in a checklist form with a three point scale for each item (done well, done poorly, not done). Each of the 40 stations had one faculty rater. In addition, 6 of the stations (3 in a checklist circuit and 3 in a global rating circuit) were provided with a second rater permitting an estimate of inter-rater reliability for the subset of 19 students in the checklist group and 20 students in the rating scale group. Raters were volunteer clinical faculty and senior residents from a wide variety of clinical disciplines. None of the raters had special training in assessment, but prior to the OSCE the assessors underwent a half hour orientation session. No formal training directed at inter-rater consistency was attempted. Within two weeks of completing the OSCE the students completed a 180 item Progress Test based on all of medicine. As well as an overall score on the Progress Test, sub-scores related to Biology (150 questions), Behaviour (20 questions) and Population (10 questions) were calculated. Because of the small number of questions in the Population section, no further analysis of this sub-score was attempted. The measurement characteristics of the Progress Test are discussed elsewhere (Blake, 1995); the test - retest reliability across three months is 0.70 and it demonstrates large increases in scores with time spent in the MD program. It has been shown to have predictive validity against a national licensing examination, with a correlation of about 0.65. Because the content of the OSCE and the Progress Test is dissimilar, it would be unreasonable to expect high correlations between the two measures. Reliability analysis was conducted by performing repeated measures analysis of variance using a standard statistical program (BMDP2V) and computing variance components and generalizability coefficients manually. Separate analyses were conducted for the checklist and global rating scale circuits. For the inter-rater subanalysis, this was a two factor ANOVA with raters (2 levels) nested within cases (3 levels). The overall analysis was a one factor ANOVA with cases (8 levels) as the single repeated measure. The correlation analysis was conducted with a standard statistical package (BMDP8V).

230

J.P.W. CUNNINGTON ET AL. Table 1. Means, standard deviations, variance components and G coefficients from subsample and complete Analysis

Descriptive statistics Mean Std. dev. Case means Minimum Maximum Variance component Subjects Sub x Case Sub x Rater:C

Checklist Subsample

Complete

Rating scale Subsample Complete

71.5% 15.4%

67.2% 17.0%

4.72 0.93

5.00 1.14

59.5% 77.6%

54.1% 86.6%

4.25 5.15

4.42 5.28

1.18 1.93 3.00

0.16 7.40 N/A

1.74 3.15 4.21

1.02 12.6 N/A

Table II. Correlation between rating scale and checklist test scores and multiple choice (progress test) scores and subscores Progress test score

Checklist

Rating scale

Total Biology Behaviour Sample size

0.24 0.30* 0.00 39

0.18 0.05 0.29* 57

Results MEANS, STANDARD DEVIATIONS, AND VARIANCE COMPONENTS

As we indicated previously, the overall rating scale data were derived from 57 students and the inter-rater reliability from a subsample of 20 students; the overall checklist data were obtained from a sample of 39 students, and inter-rater reliability from a subsample of 19 students. Table I shows the grand means and average standard deviations for the four data sets. In all analyses, there were significant differences between cases (that is, differences in case difficulty) and minimum and maximum case means are also shown. Table I also illustrates the variance components derived from the subsamples which had two raters (three stations) and the overall test of eight stations. RELIABILITY AND GENERALIZABILITY

Although the small samples result in uncertainty about the point estimates, there is no evidence that the use of rating scales resulted in a reduction of agreement


231

among raters; the inter-rater G coefficients for global rating scales and checklists were virtually identical at 0.49 and 0.51. Similarly, in the sub-sample used for the inter-rater reliability study, inter-case correlations were nearly identical for the two formats at 0.20 and 0.19, resulting in very similar test reliability (projected for an eight station test) for the three case test of 0.66 and 0.65. The analysis based on the whole sample showed smaller subject variance in general, which reduced the inter-case correlations and test reliability. There was no indication that the use of the single global rating scale resulted in a loss of information and there was some suggestion that the test reliability of the rating scales was higher than the checklists based on the complete sample analysis (0.39 vs. 0.15), consistent with the findings of the van Luijk study (1992). VALIDITY

Correlation between the OSCE checklist and global rating scale scores and the concurrent multiple choice test were of similar magnitude (r = 0.18 for the rating scale; r = 0.24 for the checklist). Neither achieved statistical significance. However, examining the relation to the subtests, the rating scale correlated significantly with the behavioral science subscore (r = 0.30), while the checklist correlated with the biology subscore (r = 0.29). Discussion The results confirm the findings of other investigations that in the scoring of OSCEs the use of global rating scales does not result in any significant loss of measurement information when compared with detailed checklists. Compared to previous studies this study has the advantage that it was designed to provide a headto-head comparison of a single global rating score against a detailed checklist score without the confounder of the rater having first completed the checklist. The study has the shortcoming of small sample sizes, particularly for the computation of interrater coefficients, leading to some uncertainty in the interpretation of similarities and differences between formats. The lack of rater training may be viewed as a handicap and the relatively low inter-rater coefficients may reflect this, although one might expect that rater training would have a greater impact on improving global rating scores than checklist scores. If global ratings perform as well as detailed checklists, why do checklists predominate? One possibility is thoroughness. Faculty frequently complain that students are superficial in their assessments. They believe that if students are going to be successful they need to become more thorough. And at least thoroughness is something which can be reliably measured. But is thoroughness really a characteristic of the competent physician? A growing body of evidence suggests not. Studies of the clinical methods of physicians and medical students have shown no correlation between expertise and thoroughness of data gathering, and no relation

232

J.P.W. CUNNINGTON ET AL.

between thoroughness and diagnostic accuracy (Barrows et al., 1982; Neufeld et al., 1981). Similarly studies of PMPs showed that the scores of proficiency, efficiency, and competence are all highly correlated and strongly related to thoroughness, but that final year residents do no better than first year (Miller, 1968; Marshall, 1977). Thus while thoroughness may be next to godliness in the minds of some of our faculty, it is not a measure of the validity of our evaluation efforts. Clearly we should not measure our medical students by a standard which is irrelevant to the practising physician. While we may not be sure of all the skills which characterize the expert physician we know that comprehensiveness and thoroughness are not central amongst them. In OSCEs marked by expert observers these observations about the reliability and validity of global scoring have a number of implications. Global scoring in OSCE examinations makes it easier to design the questions because detailed scoring keys are not required. There is also a reduction in the complexity of the marking of the test as the mean score for each station does not need to be calculated by summating columns of figures. It is well established that examinations exert a profound steering effect upon the curriculum but the extent to which this can be the case in OSCE based assessments is not generally appreciated. Van der Vleuten et al. (1989) compared performance on an OSCE with MCQ assessed knowledge about the clinical skills required to succeed on the OSCE. They found a near perfect correlation between the written and the performance tests. This finding was largely explained when Van Luijk et al. (1990), working at the same medical school, surveyed student opinions and behaviour about the evaluation process. They found that a major student approach to preparation for the OSCE was not practising the requisite skills, but rather it was memorization of the checklists, an activity that contributes little to the educational value of the test, nor to its validity as a measure of student competence in clinical skill performance. With global scoring students will not be tempted to prepare for the exam by memorizing the checklists; the emphasis remains on skill demonstration. Nor when taking a history in an OSCE will students be tempted to ask as many questions as they can think of in the hope that their questions appear somewhere on the checklist. The shotgun approach to data acquisition was precisely one of the weakness that contributed to the demise of the PMP, and it reflects neither clinical competence, nor what physicians actually do when they gather data from a patient. Global scoring provides the opportunity for capturing behaviour that does not appear on the checklist, for example, whether there is a coherent order in the gathering of information. It can also allow for weighting of the observed strengths and weakness. As a result it can capitalize on the experience of expert observers in arriving at a rating of overall performance on the elements central to clinical competency. One caveat must be mentioned. The advantages of global scoring as described in this work apply only to ratings made at the time of the observation of the


233

performance. They can not be extended to ratings of global performance over a protracted period of time, such as is comonly done in in-training evaluations of blocks of clinical exposure. Here the unreliability of memory (and often the paucity of observation) make these ratings little more than a popularity contest. In summary there is compelling evidence of the reliability of both checklists and of global scores in OSCE assessment, however doubt exists as to the validity of checklist scoring for these exercises. A case is made that for OSCE scoring by expert observers, rather than a predefined list of performance expectations, global scoring is more likely to capture the elements that constitute clinical competence. References Barrows, H.S., Norman, G.R., Neufeld, V.R. & Feightner, J.W. (1982). The Clinical Reasoning Process of Randomly Selected Physicians in General Medical Practice. Clinicaland Investigative Medicine 5: 49-56. Blake, J.M., Norman, G.R., Keane, D.R., Mueller, C.B., Cunnington, J.P.W & Didyk, N. (1996) Introducing Progress Testing in McMaster University's Problem Based Medical Curriculum: Psychometric Properties and Effect on Learning. Academic Medicine 71: 1002-1007. Cohen, D.S., Colliver, J.A., Marcy, M.S., Fried, E.D. & Swartz, M.H. (1996). Psychometric Properties of a Standardized-Patient Checklist and Rating-Scale Form Used to Assess Interpersonal and Communication Skills. Academic Medicine (suppl.) 71: s87-89. Cohen, R., Rothman, A.I., Poldre, P. & Ross, J. (1991). Validity and Generalizability of Global Ratings in an Objective Structured Clinical Examination. Academic Medicine 66: 545-548. Harden, R.M. & Gleeson, F.A. (1979). Assessment of Clinical Competence Using an Objective Structured Clinical Examination (OSCE). Medical Education 13: 41-54. MacRae, H.M., Vu, N.V. & Graham, B. et al. (1995). Comparing Checklists and Databases with Physicians' Ratings as Measures of Students History and Physical Examination Skills. Academic Medicine 70: 313-317. Marshall, J. (1977). Assessment of Problem Solving Ability. Medical Education 11: 329-334. Miller, G.E. (1968). The Orthopedic Training Study. Journal of the American Medical Association 206: 601-606. Neufeld, V.R., Norman, G.R., Barrows, H.S. & Feightner, J.W. (1981). Clinical Problem-Solving by Medical Students: A Longitudinal and Cross-Sectional Analysis. MedicalEducation15: 315-322. Norcini, J.J., Diserens, D., Day, S.C., Cebul, R.C., Schwartz, J.S., Beck, L.H., Webster, G.D., Schnabel, T.G. & Elstein, A.S. (1990). The Scoring and Reproducability of an Essay Test of Clinical Judgement. Academic Medicine 65(suppl.): S41-S42. van der Vleuten, C.P.M., Norman, G.R. &de Graaff, E.D. (1991). Pitfalls in the Pursuit of Objectivity: Issues of Reliability. Medical Education 25: 119-126. van Luijk, S.J. & van der Vleuten, C.P.M. (1992). A Comparison of Checklists and Rating Scales in Performance-Based Testing. In Hart, I.R., Harden, R.M. & Des Marchais, J. (eds.) Current Developments in Assessing Clinical Competence, 357-382. Can-Heal Publications: Montreal. van der Vleuten, C.P.M., van Luijk, S.J. & Beckers, H.J.M. (1989). A Written Test as an Alternative to Performance Testing. Medical Education 23: 97-107. van Luijk, S.J., van der Vleuten, C.P.M. &van Schelven, S.M. (1990). Observer and Student Opinions About Performance-Based Tests. In Bender, W., Hiemstra, R.J., Scherpbier, A.J.J.A. &Zwierstra, R.P. (eds.) Teaching and Assessing Clinical Competence, 497-502. Boekwerk Publ.: Groningen.

Clinically discriminating checklists versus thoroughness checklists: improving the validity of performance test scores.

The use of global ratings in OSCE station scores.

Improving the validity of global ratings.

Reliability and validity in binary ratings: areas of common misunderstanding in diagnosis and symptom ratings.

Standardised clients as assessors in a veterinary communication OSCE: a reliability and validity study.

Reliability of the OSCE for Physical and Occupational Therapists.

A systematic review of validity evidence for checklists versus global rating scales in simulation-based assessment.

A team format for the Global Assessment Scale: reliability and validity on an inpatient unit.

Reliability is Necessary but Far From Sufficient: How Might the Validity of Pain Ratings be Improved?

The Job Embeddedness instrument: an evaluation of validity and reliability.

The Agoraphobia Scale: an evaluation of its reliability and validity.

Medical students review of formative OSCE scores, checklists, and videos improves with student-faculty debriefing meetings.

The self-assessment Global Quality of Life scale: Reliability and construct validity.

Validity and reliability testing of the Scoliometer.

Reliability and validity of an extended clinical examination.

Validity of family informants' ratings of psychiatric patients: general validity.

Reliability and validity testing of an archery chronometer.

Reliability and validity of nasality ratings between a monolingual and bilingual listener for speech samples from English-Spanish-Speaking children.

Reporting Guidelines and Checklists Improve the Reliability and Rigor of Research Reports.

The effect of timing on the validity of student ratings.

Validity and generalizability of social dance performance ratings.

The Validity and Reliability of Autism Behavior Checklist in Iran.

Reliability and validity of the psychiatry resident in-training examination.

Reliability and validity of instrumented soccer equipment.