2014, 1–5, Early Online

Examiners are most lenient at the start of a two-day OSCE DAVID HOPE & HELEN CAMERON The University of Edinburgh, UK

Med Teach Downloaded from informahealthcare.com by Tulane University on 10/18/14 For personal use only.

Abstract Background: OSCEs can be both reliable and valid but are subject to sources of error. Examiners become more hawkish as their experience grows, and recent research suggests that in clinical contexts, examiners are influenced by the ability of recently observed candidates. In OSCEs, where examiners test many candidates over a short space of time, this may introduce bias that does not reflect a candidate’s true ability. Aims: Test whether examiners marked more or less stringently as time elapsed in a summative OSCE, and evaluate the practical impact of this bias. Methods: We measured changes in examiner stringency in a 13 station OSCE sat by 278 third year MBChB students over the course of two days. Results: Examiners were most lenient at the start of the OSCE in the clinical section ( ¼ 0.14, p ¼ 0.018) but not in the online section where student answers were machine marked ( ¼ 0.003, p ¼ 0.965). Conclusions: The change in marks was likely caused by increased examiner stringency over time derived from a combination of growing experience and exposure to an increasing number of successful candidates. The need for better training and for reviewing standards during the OSCE is discussed.

Introduction The Objective Structured Clinical Examination (or OSCE) is widely used in medical education (Harden & Gleeson 1979) and can generate reliable and valid marks (Chesser et al. 2009; Schoonheim-Klein et al. 2009; Walsh et al. 2009; Pell et al. 2010). Commonly used as high-stakes assessment, OSCE performance contributes to progression decisions and, therefore, requires evaluation (Pell et al. 2010). While OSCEs can show high levels of reliability, they do not always do so. One review of internal consistency showed that OSCEs varied considerably in acceptability, between 0.41 and 0.88 (Turner & Donkoski 2008). Low internal consistency indicates an inability to identify candidate true scores, and values above 0.7 are acceptable (Pell et al. 2010). Many factors lower the reliability of OSCEs, thereby weakening the relationship between the candidate scores and their true scores. Perhaps the most well known is the hawk–dove effect where some examiners mark harshly, and others generously (Harasym et al. 2008). Over 10% of the variance in the mark can, in some cases, be explained by hawk–dove effects (McManus et al. 2006). Assessor background is important as well: experienced examiners often mark differently from those with less experience (Chesser et al. 2009). Examiners familiar with the students are more generous than those who are not (Stroud et al. 2011).

Practice points  





Examiners are most lenient at the start of the OSCE. While the effect is small, this may affect students on the pass–fail boundary and students in contention for awards. Even when using checklists examiners are influenced by the performance of candidates they have recently observed. Asking examiners to regularly review standards during examination may reduce the impact of the bias.

This list is not exhaustive and many sources of error remain to be explored. One significant source of error is the tendency for examiners to become more hawkish with increased exposure to candidates. In a study of MRC (UK) exams, McManus et al. (2006) found a small but significant association between the lifetime number of candidates assessed and the stringency of the examiner (r ¼ 0.08). Such effects imply greater leniency at the start of the assessor’s career and may be due to greater confidence with the mark scheme or a better understanding of standards. Assessors may use themselves as a reference point

Correspondence: Dr. David Hope, Centre for Medical Education, The Chancellor’s Building, The University of Edinburgh, College of Medicine and Veterinary Medicine, 49 Little France Crescent, Edinburgh, EH16 4SB Scotland, UK. Tel: 44 131 242 9403; E-mail: [email protected] ISSN 0142-159X print/ISSN 1466-187X online/14/000001–5 ß 2014 Informa UK Ltd. DOI: 10.3109/0142159X.2014.947934

1

Med Teach Downloaded from informahealthcare.com by Tulane University on 10/18/14 For personal use only.

D. Hope & H. Cameron

(Kogan et al. 2011) which can lead to assessors changing their ratings as they become experienced. Even more seriously, research on clinical evaluations has demonstrated that exposure to good candidates causes examiners to rate borderline or failing students more harshly (Yeates et al. 2012). Subsequent work has shown that viewing the performance of candidates of any level can bias subsequent judgments (Yeates et al. 2013). This topic has not been well studied in OSCEs. To our knowledge, only one study, McLaughlin et al. (2009), has evaluated time effects in OSCEs. They found that examiners became more lenient over time and that the effect was most pronounced for difficult stations. They attributed this to examiner fatigue causing examiners to overlook key failings. This finding contradicts that of McManus et al. (2006). McLaughlin’s study was based on a small (n ¼ 14) sample of a formative OSCE and did not test any summative component. The present paper describes a study to test time effects in a large summative OSCE where examiners were exposed to dozens of students in a single day. If examiners are influenced by previous exposure to students, and behave differently with experience, this has significant implications for the fairness, reliability, and validity of pass–fail decisions in OSCEs. Examinees tested at different points in an OSCE may be marked differently. Essentially, undue lenience at the start of an OSCE caused by a lack of experience, undue lenience at the end of the OSCE due to fatigue, or rating candidates in comparison to those who have already been examined, would represent variance which is irrelevant to the candidate’s true score. Examinees who sat the assessment later would be unfairly penalised in a way irrelevant to their genuine performance – or, conversely, candidates who sat early on would be unfairly advantaged. As OSCEs are costly and complex, it is difficult to control for potentially confounding bias which may cause examinee marks to decline over the course of an OSCE. The order of candidates may reflect academic performance, even inadvertently. Alternatively, leakage – where students reveal examination content to later participants – may play a role, although some prior research on OSCEs has suggested this is not a significant concern (Niehaus et al. 1996). We test time effects in a large, summative third year OSCE using a design that limits the possibility of confounding effects. We compare marks on an interactive component marked by examiners against marks on an online OSCE component. The online component is machine marked and reflects standards agreed in advance. As candidates sat for the online component at the same time, it was not possible for the candidate’s time position to genuinely impact their performance on that component. However, it forms a useful check of the validity of the hypothesis as if the time position does predict performance in the online OSCE component, it demonstrates that some other, confounding variable (such as prior ability) causes the effect, not time position itself. While non-experimental data must be treated cautiously, we predict that examinee marks will decline according to time elapsed in the clinical stations, but not in the online stations. 2

Methods Participants All participants were from the third year of a five-year MBChB programme. About 278 students took the OSCE. They were required to pass to progress. After converting to a percentage mark, the candidate mean mark was 75.16% (SD ¼ 5.9%) with a pass mark of 60%. Approval for the general use of educational results in research was granted by the college ethics committee.

Structure of the OSCE The OSCE was part of an end of year synoptic assessment in the 2011–2012 diet. The examination took place at the end of the third year of the MBChB. All students had spent three years studying medicine. This involved two years studying principles for practice – an overview of the fundamentals of medicine, health, ethics and society, problem based learning and a student-selected project. This was followed by one year studying the cardiovascular, gastrointestinal & liver, locomotor, and respiratory systems along with psychiatry, communication and practical skills, and some student-selected work. Year three involved a mix of lectures, tutorials, case-based learning, and bedside teaching. Additionally, some students had spent one year between the second and third years undertaking an intercalated degree in a subject relevant to their interests before returning to the third year. The OSCE had 13 stations in total. Of these, eight stations tested interactive clinical competences in the following specialties: cardiovascular, gastrointestinal, locomotor, respiratory, first aid and resuscitation, and psychological aspects of medicine. The five online stations tested clinical decision making and related knowledge within cardiovascular, gastrointestinal, locomotor, and respiratory specialties using MCQ questions, with data, images, and photographs. The online stations had previously been administered as non-interactive OSCE stations, but were transferred to an electronic format to reduce the length of the interactive OSCE and thus the required resources and risk of leakage. Adding MCQ-type stations has been shown to improve the reliability of OSCE assessment (Verhoeven et al. 2000). The OSCE ran over two days while the online section was sat by all students simultaneously two days after the interactive section. There were four parallel circuits operating at the same time. Some examiners were course tutors, while others were external and not involved in teaching. Examiners were required to participate in a training event. A t-test indicated candidates performed better in the interactive element than the online element of the examination (t ¼ 16.35, df ¼ 277, p ¼ 0.001) and the effect was large (Cohen’s d ¼ 1.96). Six candidates failed. Candidates were assigned into 36 individual groups. This was done alphabetically so that when sorted in alphabetical order, the first name would enter group one, the second, group two, and so on. Due to uneven numbers, 10 groups had only seven participants, but this did not affect the organization of the exam. Four individual groups took the interactive OSCE during the same time slot which lasted one hour 15 min.

Examiner leniency in OSCEs

This meant that there were nine time positions each with 31 or 32 candidates. Candidates from the morning of each day were corralled and unable to communicate with students sitting later. Parallel form tests were used on the second day. Examiner placement was determined by convenience and no restrictions were placed on the length of time they could examine. Examiner gender was recorded in all cases.

Med Teach Downloaded from informahealthcare.com by Tulane University on 10/18/14 For personal use only.

Statistical analysis All statistics were run using R (Ihaka & Gentleman 1996). A range of diagnostics should routinely be applied to OSCEs, but for brevity are not discussed here (Pell et al. 2010). Cronbach’s alpha was acceptable (0.75) and towards the high end of the usual range for OSCEs (Turner & Donkoski 2008). An inspection of the correlation matrix indicated most station marks correlated positively and significantly with each other (r range ¼ 0.02 to 0.35). Male and female examiners exhibited no significant differences in marking behavior. Assessor variability was similar across the eight stations.

Results We correlated interactive station and online marks to see if they were comparable. Marks for each of these components correlated significantly (r(276) ¼ 0.52, p ¼ 0.001) and the effect size was large. In order to test the hypothesis that time elapsed would affect interactive station, but not online marks, we

conducted two linear regression models with time elapsed as a predictor (scored from 1 to 9 with higher numbers indicating later times) and examination mark as the predicted variable. The first model used the mark of the interactive section of the examination, while the second used the mark of the online section of the examination. For the first model only, we tested whether or not examiner gender had an impact on the model. The result was nonsignificant and is not reported on further. Controlling for academic ability – by including the mark from online stations as a predictor – did not affect the model. To test the effect of station difficulty, we reran the model for the four easiest stations, then the four most difficult stations. No differences based on the station difficulty were observed. We, therefore, reverted to the simplest model, with time elapsed as the sole predictor and mark as the predicted variable for each model. Time elapsed significantly predicted performance in the interactive section of the examination ( ¼ 0.14, p ¼ 0.02). The model fits the data well (F1,276 ¼ 5.68, p ¼ 0.02, adjusted R2 ¼ 0.17) although the effect size was small. See Figure 1 for a scatter plot of the data. By contrast, time elapsed did not significantly predict performance in the online component ( ¼ 0.003, p ¼ 0.97). Overall model fit statistics were poor (F1,276 ¼ 0.02, p ¼ 0.97, adjusted R2 ¼ 0.003). The values presented here are not adjusted for multiple comparisons. Adjusting for using two models would not affect

Figure 1. The association between student mark (percent) and time elapsed in the examination.

3

D. Hope & H. Cameron

the significance of the first model. The impact of the effect was relatively small (a 3.27% decline from sitting in the last group compared with the first group) but would have altered pass– fail decisions. We modeled the effects of adjusting time elapsed on marginal students, estimating predicted marks based on the student sitting at the very start or very end of the examination. Two late-sitting failing students were predicted to have passed if they had sat in the first time period. Four earlysitting marginal (D-grade candidates) were predicted to have failed had they sat at the last time period. Finally, five A-grade candidates were predicted to have obtained B-grades if they had sat in the last time period.

Med Teach Downloaded from informahealthcare.com by Tulane University on 10/18/14 For personal use only.

Discussion Students sitting the interactive stations of an OSCE early on in the examination receive better marks than those sitting it later. While the effect is small – around 3.27% from first to last group – it may impact students close to the pass–fail borderline, or those in contention for awards. In the present study, a number of students would experience important grade changes and significantly this is in spite of the fact that the unaffected online component contributed a large proportion of the mark. In purely examiner-marked observed stations, the practical effect would be larger. This study, uniquely, contrasts interactive and online elements of the same examination to show that leakage, examinee preparation, or genuine ability differences are unlikely to explain the change in mark as all such factors would be expected to influence both the interactive and online marks. This finding expands upon research showing experience increases hawkishness (McManus et al. 2006). It also supports research showing that even when using checklists examiners adapt their judgments based on candidates they have recently seen (Yeates et al. 2012, 2013). Significantly, it contradicts the finding of McLaughlin et al. (2009), who found examiners were most hawkish at the start of the OSCE. Given the difference in sample size (14 vs. 278) and the fact that the McLaughlin study was on a formative, not summative OSCE, it is difficult to compare the two directly. Research comparing examiner performance in formative vs. summative contexts would be especially helpful in clarifying these findings. As few students failed the OSCE, and as overall performance was high, examiners will have dealt with an increasing number of successful candidates as time elapsed. This may partially explain the increased stringency as the OSCE continued. This is the first study, to our knowledge, to show this happening in a summative OSCE. The research expands upon the already known sources of examiner bias in OSCEs, which include training, familiarity, experience, and hawkishness (Harasym et al. 2008; Chesser et al. 2009; Stroud et al. 2011). Such research helps to explain the broad range of internal consistency found in OSCEs (Turner & Donkoski 2008). Some OSCEs will have more error than others, and some may avoid sources of error by coincidence. Critically reviewing OSCE metrics is important (Pell et al. 2010) as even apparent logistical changes – such as shifting the examination 4

to take place over several days – may introduce new sources of error. Reviews of OSCE metrics should not be limited solely to the content of the assessment. It is likely that other sources of error remain to be identified. The present study had a number of advantages. It was a large summative clinical examination which covered a range of abilities and topics and had acceptable psychometric characteristics. In contrast, an online component against an interactive component it was able to evaluate alternative explanations to increased hawkishness among the examiners. The study exhibited some limitations. Most importantly, these data are not from a controlled experiment, but derived from correlations of data from a live examination. Consequently, there were challenges in controlling for potential confounds, and strong causal inferences should not be drawn from it. Examiners may be more hawkish on an individual basis as time goes by (marking later examinees more severely) and prone to adapt their judgments based on recent examinees, but it is also plausible that examiners are more confident that potential problems with the examination have been removed as time passes, so are more willing to fail students later on. There were no time restrictions on examiner participation, and better control of participation would be useful in any replication. The use of parallel form tests on the two days may present a source of error, although the linearity of the model shows performance does not simply drop between day one and two. Lastly, this study presents one cohort from one institution – replication would be needed to confirm the generalizability of the decline in marks over time. The results should be treated cautiously as, although they are informative, an experimental design planned in advance would be necessary to make strong causal inferences about this trend. It must be emphasized, however, that the cost and logistical challenges of OSCEs make running or adjusting OSCEs purely for research extremely difficult, and so research on live examinations represents a key source of data on OSCEs. The study suggests some practical solutions. Examiner training is crucial (Chesser et al. 2009) and making examiners aware of sources of bias may help. It may be necessary to standardize the amount of time examiners mark for, and to ensure that any one student meets experienced and inexperienced examiners in equal numbers. Periodically benchmarking examiners during the OSCE by revisiting training material may counteract the effect of recent observations. Alternatively, following McManus et al. (2006), it is possible to correct biased marks post hoc, although it is unclear how widely such techniques are used. Where OSCE assessment committees can run such models they should do so routinely to provide the best possible estimates of error and explore means of correcting them. Revisiting the marks of borderline students who sat the OSCE early on may increase confidence in pass/fail decisions. Further research should address whether this phenomenon is consistent across non-clinical assessment – particularly essays, portfolios, and vivas. Evaluating the possibility of experience and exposure bias in assessment will ultimately lead to more accurate and useful assessment results.

Examiner leniency in OSCEs

Glossary Leniency: The fact or quality of being more merciful or tolerant than expected; clemency. Oxford English Dictionary – www.oxforddictionaries.com Standard: Something used as a measure, norm, or model in comparative assessments. A required or agreed level of quality or attainment. Oxford English Dictionary – www.oxforddictionaries.com

Med Teach Downloaded from informahealthcare.com by Tulane University on 10/18/14 For personal use only.

Notes on contributors DAVID HOPE, MA, MSc, PhD, is a Psychometrician who specializes in the study of individual differences in human cognitive ability and behaviour. Duties include investigating the academic and personal correlates of feedback satisfaction and designing feedback mechanisms that improve satisfaction and performance. HELEN CAMERON, BSc, MBChB, is the Director of the CME and an experienced medical educationalist. Her main academic interest is the investigation of assessment strategies to promote learning, ensure competence, and enhance the undergraduate curriculum.

Declaration of interest: The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the article.

References Chesser A, Cameron H, Evans P, Cleland J, Boursicot K, Mires G. 2009. Sources of variation in performance on a shared OSCE station across four UK medical schools. Med Educ 43:526–532. Harasym PH, Woloschuk W, Cunning L. 2008. Undesired variance due to examiner stringency/leniency effect in communication skill scores assessed in OSCEs. Adv Health Sci Educ: Theory Pract 13:617–632. Harden RM, Gleeson FA. 1979. Assessment of clinical competence using an objective structured clinical examination (OSCE). Med Educ 13:39–54.

Ihaka R, Gentleman R. 1996. R: A language for data analysis and graphics. J Comput Graph Stat 5:299–314. Kogan JR, Conforti L, Bernabeo E, Iobst W, Holmboe E. 2011. Opening the black box of clinical skills assessment via observation: A conceptual model. Med Educ 45:1048–1060. Mclaughlin K, Ainslie M, Coderre S, Wright B, Violato C. 2009. The effect of differential rater function over time (DRIFT) on objective structured clinical examination ratings. Med Educ 43:989–992. Mcmanus I, Thompson M, Mollon J. 2006. Assessment of examiner leniency and stringency (‘hawk-dove effect’) in the MRCP (UK) clinical examination (PACES) using multi-facet Rasch modelling. BMC Med Educ 6: 42. Niehaus AH, Darosa DA, Markwell SJ, Folse R. 1996. Is test security a concern when OSCE stations are repeated across clerkship rotations? Acad Med 71:287–289. Pell G, Fuller R, Homer M, Roberts T. 2010. How to measure the quality of the OSCE: A review of metrics – AMEE guide no. 49. Med Teacher 32: 802–811. Schoonheim-Klein M, Muijtjens A, Habets L, Manogue M, Van Der Vleuten C, Van Der Velden U. 2009. Who will pass the dental OSCE? Comparison of the Angoff and the borderline regression standard setting methods. Eur J Dental Educ 13:162–171. Stroud L, Herold J, Tomlinson G, Cavalcanti RB. 2011. Who you know or what you know? Effect of examiner familiarity with residents on OSCE scores. Acad Med 86:S8–S11. Turner J, Donkoski M. 2008. Objective structured clinical exams: A critical review. Fam Med 40:574–578. Verhoeven BH, Hamers JGHC, Scherpbier AJJA, Hoogenboom RJI, Van Der Vleuten CPM. 2000. The effect on reliability of adding a separate written assessment component to an objective structured clinical examination. Med Educ 34:525–529. Walsh M, Bailey PH, Koren I. 2009. Objective structured clinical evaluation of clinical competence: An integrative review. J Adv Nurs 65: 1584–1595. Yeates P, O’Neill P, Mann K, Eva K. 2013. ‘You’re certainly relatively competent’: Assessor bias due to recent experiences. Med Educ 47: 910–922. Yeates P, O’Neill P, Mann K, Eva K. 2012. Effect of exposure to good vs poor medical trainee performance on attending physician ratings of subsequent performances attending physician ratings of performance. JAMA 308:2226–2232.

5

Examiners are most lenient at the start of a two-day OSCE.

OSCEs can be both reliable and valid but are subject to sources of error. Examiners become more hawkish as their experience grows, and recent research...
142KB Sizes 0 Downloads 3 Views