Factors Affecting the Reliability of Ratings of Students' Clinical Skills in a Medicine Clerkship JAN D. CARLINE, PhD, DOUGLAS S. PAAUW, MD, KEITH W. THIEDE, MEd, PAUL G. RAMSEY, MD

Objective: To d e t e r m i n e the overall reliability a n d f a c t o r s that m i g h t affect the reliability o f ratings o f students" clinical skills i n a medicine clerkship. Design: A nine-item i n s t r u m e n t w a s u s e d to eva!t:~te students" clinical skills. R a t e r s were also a s k e d to p r o v i d e a g r a d e o f each student's overall clinical p e r f o r m a n c e . Generalizability studies were p e r f o r m e d to estimate the reliability o f the ratings. The effects o f r a t e r experience a n d clerkship setting were investigated by regression analysis. Setting: Teaching hospitals a n d community-based sites in three Northwestern

states.

Participants: AH students (328) w h o h a d completed the 12w e e k clerkship in i n t e r n a l medicine a t o n e medical school

d u r i n g the academic y e a r s 1987-1989. Raters i n c l u d e d att e n d i n g p h y s i c i a t ~ c h i e f r e s t d e n ~ a n d o t h e r residents. Results: Seven observations w e r e n e e d e d to p r o v i d e a reliable r a t i n g o f the overall clinical grade. M o r e observations w e r e n e e d e d to o b t a i n reliable ratings f o r i n d i v i d u a l items, r a n g i n g f r o m seven o b s e r v a t i o n s n e e d e d f o r the r a t i n g o f data g a t h e r i n g skills to 27 observations needed f o r the rati n g o f i n t e r p e r s o n a l relationships with patients. R a t e r exp e r i e n c e a n d clerkship setting (~e., teaching hospitals vs. community-based clinics) were f o u n d , in genera~ n o t to affect significantly the ratings received by students. Conclusions: Reliable ratings o f $~:dents" overall clinical skills, including overall clinical grades, c a n be a c ~ by collecting a m i n i m u m o f s e v e n o b s e r v a t i o n s . M o r e o b s e r v a t i o n s ave needed to m e a s u r e reliably the i n t e r p e r s o n a l aspects o f clinical p e r f o r m a n c e . These f i n d i n g s s u p p o r t the use o f p e r f o r m a n c e ratings to evaluate clinical skills a n d k n o w l e d g e o f students in clerkship settings. Key words: medical students; clinical clerkship; reliability; e d u c a t i o n a l measurement. J GEN IN'rtmN MED 1992;

7:506-510. PERFORMANCE RATINGS are collected frequently in the evaluation of medical students during clerkships. The results of these ratings are used primarily to d e t e r m i n e students' grades in m a n y medical schools, t Although the reliability of these ratings has b e e n questioned, m a n y faculty m e m b e r s favor the use of this m e t h o d to evaluate students' clinical skills. 2, 3 O t h e r m e t h o d s of clinical skills assessment, such as a standardized patient examination or the Objective Structured Clinical Examination,~, s have b e e n used to evaluate the performReceived from the Departments of Medical Education (JDC, KWT) and Medicine (DSP, PGR), University of Washington School of Medicine, Seattle, Washington. Dr. Ramsey is a Henry J. Kaiser Family Foundation Faculty scholar in General Internal Medicine. Address correspondence and reprint requests to Dr. Carline: Department of Medical Education, SC-45, University of Washington School of Medicine, Seattle, WA 98195. 506

ance of medical clerks b u t are very expensive to mount, considering the faculty t i m e r e q u i r e d to design the examination and evaluate the results. In contrast, ratings of p e r f o r m a n c e are relatively inexpensive and familiar to faculty and residents w h o have responsibility for teaching students. Several p r o b l e m s w i t h p e r f o r m a n c e ratings have b e e n described. Intrastudent variability within ratings given b y m u l t i p l e observers 6' 7 suggests either that ratings are unreliable or that raters are seeing different types of p e r f o r m a n c e f r o m the same student. Several factors may potentially affect the ratings of clerks. The interests o f raters with different backgrounds and the types of interactions b e t w e e n clerks and different types of observers m a y affect ratings. A student w h o is a hard w o r k e r and always available on a ward m a y receive a higher rating from a resident w h o a p p r e c i a t e s this type of student than from an attending physician w h o does not n e e d the assistance of the clerk. The attending physician's rating of the clerk m a y be based on a m o r e limited interaction in rounds or conferences and m a y be focused on the estimation of the clerk's intellectual abilities .8 Differences in the types of training e n v i r o n m e n t (e.g., teaching hospital vs. private clinic) may affect h o w students are rated b y observers. The specific skill and attitude r e q u i r e m e n t s for each setting m i g h t result in different ratings for the same student. The severity of illness of patients in a hospital setting m a y challenge a student to e x e m p l a r y p e r f o r m a n c e , w h i l e the m o r e c o m m o n p r o b l e m s e n c o u n t e r e d in a c o m m u n i t y clinic setting may not motivate the same student to excellent work. The slow and m e t h o d i c a l student, w h o takes extended histories and does t h o r o u g h physical examinations and w h o needs a significant a m o u n t of t i m e to organize a presentation, may a p p e a r to b e a stronger student in a hospital setting than in a c o m m u n i t y clinic setting. The n e e d for relatively q u i c k interviews and examinations and rapid decision making in the c o m m u nity clinic setting may be reflected in the skills v a l u e d b y faculty m e m b e r s . The individual contributions of patient differences, practice characteristics, and even the interests of faculty m e m b e r s drawn to different types of settings are difficult to quantify separately. Because these factors m a y be related to the type of teaching setting, they will b e s u b s u m e d b y this single variable.

JOURNALOFGENERALINTERNALMEDICINE, Volume 7 (September/October), 1992

Differences in the times of year the students are e n c o u n t e r e d may also affect ratings. Increased skills on the part of students in the later rotations of a year m a y result in higher ratings than those received b y students in the earlier part of the year. Alternatively, faculty expectations a b o u t student p e r f o r m a n c e may differ as the academic year progresses. This study was designed to address several questions about the m e a s u r e m e n t characteristics of ratings of student performance. Are ratings of students reliable? H o w m a n y observations are n e e d e d to obtain reliable ratings? Do factors such as the specific time in the academic year w h e n a clerkship is taken, the experie n c e of the rater, and the clerkship setting affect the reliability of ratings?

METHODS The ratings given to 328 University of Washington students w h o had c o m p l e t e d the 12-week clerkship in internal m e d i c i n e during the academic years 1 9 8 7 1989 w e r e included in this study. Standard rating forms (the Student Performance Evaluation form) w e r e distributed to all attending physicians, chief residents, and residents w h o had had contact with a student. A total of 3,557 evaluations w e r e collected. Each evaluation form received was c o u n t e d as one observation of student p e r f o r m a n c e . Most students (76%) had c o m p l e t e d the clerkship in one of five teaching hospitals in Seattle. Another 11% had c o m p l e t e d the clerkship in a Veterans Administration Medical Center (VAMC) in Boise, Idaho. The remaining 13% of students had c o m p l e t e d the first six weeks of the clerkship at a teaching hospital in Seattle, and the last six weeks at a c o m m u n i t y - b a s e d clinic site (WAMI') in one of three small cities in the states of Washington and Montana. Students had b e e n taught b y attending physicians, c h i e f residents, and residents at Seattle and Boise hospitals, and had b e e n taught b y clinical faculty alone at the WAMI sites. The Student Performance Evaluation f o r m consists of nine items organized into three categories. The category of clinical k n o w l e d g e and skills consists of four i t e m s - - d a t a skills, clinical p r o b l e m solving, technical skills, and k n o w l e d g e in subject area. Interpersonal relationships consists of three items m relationships w i t h patients, professional relationships, and educational attitudes. T w o items w initiative and interests, and attendance and d e p e n d a b i l i t y - - make u p the category of personal/professional characteristics. Evaluators w e r e asked to rate a student on each of these items using a four-point scale, with descriptions of specific behavior e x p e c t e d at each point. The descriptions •WAMI= Washington-Alaska-Montana-Idaho. The acronym refers to the school's program that sends students for training outside of the immediate environs of Seattle, into these four states.

507

of the low and high scale points are included in Table 1 for each item. The scale ranged from 1 (unsatisfactory) to 4 ( e x c e l l e n t ) . Raters w e r e allowed a rating of " n o t observed or a p p l i c a b l e " (NA). In addition, raters w e r e asked to r e c o m m e n d a clinical grade using the following scale: 1 = not satisfactory, 2 = b e l o w e x p e c t e d p e r f o r m a n c e for level, 3 = as e x p e c t e d for level, 4 = above e x p e c t e d p e r f o r m a n c e for level, and 5 = honors. Ratings w e r e r e v i e w e d and s u m m a r i z e d b y clerkship coordinators at the end of each rotation. Final clerkship grades w e r e based u p o n these ratings and p e r f o r m a n c e on a written examination of k n o w l e d g e that included a m u l t i p l e - c h o i c e section and a patientm a n a g e m e n t - p r o b l e m section. The correlations of the s u m m a r i z e d clinical grade, the m u l t i p l e - c h o i c e test score, and the p a t i e n t - m a n a g e m e n t p r o b l e m score w i t h the final grade w e r e 0.67, 0.37, and 0.40, respectively, for data collected in this study. Grades w e r e s u b m i t t e d to the dean's office along w i t h final ratings for each item on the Student Performance Evaluation form. Grades w e r e r e p o r t e d as not satisfactory, satisfactory, or honors. 9 Generalizability studieslO, tt w e r e done to estimate the reliability of the ratings. Generalizability theory allows the estimation of the reliability of a measure using various n u m b e r s of observations. The generalizability coefficient, a near equivalent of the reliability coefficient, was calculated and used in this study to predict the n u m b e r of observations n e e d e d to obtain a reliable measure of student p e r f o r m a n c e . Multiple regression analyses w e r e used to calculate the u n i q u e variance, or the a m o u n t of variability attributable to one variable in a relationship b e t w e e n t w o or m o r e variables and a d e p e n d e n t variable. Unique variance was calculated to estimate the effects of factors such as the specific level of the rater or the type of clerkship location on the p e r f o r m a n c e ratings received b y students. The effects of rater status and location of the clerkship on ratings normalized within individual students w e r e investigated by regression analysis. Rater status was d u m m y - c o d e d to represent three g r o u p s - faculty, chief residents, and all other levels of residents. Location of the clerkship was d u m m y - c o d e d to represent three groups m teaching hospitals in Seattle, the Boise VAMC, and sites split b e t w e e n Seattle and the WAMI sites. The academic quarter in w h i c h the student had taken the clerkship was also d u m m y - c o d e d . These codes w e r e then regressed on the normalized ratings.

RESULTS An average of 12 observations w e r e received for each student, with a m i n i m u m of five and a m a x i m u m of 19. Sixty-four p e r c e n t of the observations w e r e submitted by residents, 10% by c h i e f residents, and 26% by faculty members. Only 10% of the students received seven or f e w e r observations. The NAand missing ratings

Carline et aL, RELL~BILn'YOF RATINGSOF STUDENTS

508

accounted for only 4.5% of possible responses across all ten items. The largest percentage of missing or NA ratings was received for the rating of technical skills, 17.8%. Eleven percent of the observations submitted by resi-

dents had missing ratings for technical skills, 39% of observations submitted by chief residents had missing ratings for technical skills, and 26% of observations submitted by faculty had missing ratings for technical skills. The number of missing ratings for technical skills

TABLE 1 Low and High Scale Point Descriptions for the Student Performance Evaluation Form Description of High Point

Description of Low Point

Scale Item Clinical knowledge and skills Data skills

Unsatisfactory. Needs work on acquiring, recording, and analyzing the database.

Database and assessment skills are outstanding. Excellent case presentations.

Clinical problem solving

Has difficulty identifying the key problems. Demonstrates little independence. Uses time inefficiently.

Identifies major and minor problems in perspective. Superior grasp of information. Very efficient use of laboratory and other services.

Technical skills

Unable to demonstrate basic skills of interview/ physical examination/bedsideprocedures appropriate to clerkship level.

Demonstrates superior mastery of basic skills. Performs far in advance of clerkship level.

Knowledge in subject area

Shows inadequate knowledge of medical principles and pathophysiology related to the patient's problems.

Shows superior knowledge of the basic medical principles relating to the patient's problems.

Often discourteous and/or nonempathetic with patients. Puts personal convenience above patient's needs.

Consistently courteous and empathetic. Gives patient's needs priority even with unpleasant or hostile patients.

Professional relationships

Behavior interferes with satisfactory performance. Discourteous to nurses and/or residents. Hostile or uncooperative.

Works very well with others. Consistently courteous. Has admiration and respect of co-workers.

Educational attitudes

Is often sullen, hostile, or argument~ive. Unresponsive to suggestions. Reacts poorly to criticism.

Excellent participation. Eager to learn and be evaluated. Stimulates the learning process.

Not well motivated. Avoids "doing" when possible. Appears disinterested. Never volunteers.

Works exceptionally hard. Active leader/ participant. Seeks new learning experiences.

Consistently absent or late to conferences and/or patient rounds. Not prepared for didactic or patient care activities.

Consistently prompt and prepared at scheduled conference/rounds. Assumes added responsibilities for patient care.

Interpersonal relationships Relationships with patients

Personal/professional characteristics Initiative and interest

Attendance and dependability

TABLE Z Generalizability of Ratings of Students' Clinical Skills in an Internal Medicine Clerkship Number of Observations Needed to Obtain a Coefficient of:

Generalizability Coefficient*

0.7

0.8

0.9

Data skills Clinical skills Technical skills Knowledge

0.86 0.85 0.79 0,82

4.22 4.53 6.79 5.62

7.23 7.76 11.64 9.63

16.27 17.47 26,20 21.66

Patient relations Professional relations Educational attitudes

0.62 0.74 0.76

15.97 8.97 8.13

27.38 15.37 13.93

61.59 34.59 31.35

Initiative Attendance

0.80 0.77

6.47 7.47

11.09 12.81

24.95 28.82

Clinical grade

0.86

4.09

7,02

15.79

*Coefficients were calculated on an average of 12 observations per student.

509

JOURNALOFGENERALINTERNALMEDICINE,Volume 7 (September~October), 1992

TABLE 3 Contribution of Rater Level on Student Ratings and Grades in an Internal Medicine Clerkship--Unique Variance Attributed to Each Level

Resident

Chief Resident

Faculty

Total R Square

Data skills Clinical skills Technical skills Knowledge

O. 11 * 0.06" 0.07' 0.08*

0.01 t 0.06" 0.04* 0.01 t

0.04" 0.03" 0.03* 0.05"

0.54* 0.58* 0.48* 0.45*

Patient relations Professional relations Educational attitudes

O. 16* O. 11 * O. 15"

0.01 0.03* 0.01

0.03* 0.02* 0.04"

0.30* 0.39* 0.45"

Initiative Attendance

0.07" O. 12*

0.05* 0.04*

0.06" 0.04*

0.51 * 0.44*

Clinical grade Final grade

O. 1O* 0.04*

0.04* 0.03*

0.03* 0.02*

0.70* 0.39*

* p > 0.001.

tp > 0.05.

* p > 0.01.

from faculty raters varied by location of the clerkship. Only 5% of the observations from faculty at c o m m u n i t y clerkship sites had missing ratings for technical skills, while 29% of the observations from faculty at Boise and the Seattle teaching hospitals had missing ratings for technical skills. This pattern of differences by level of rater in the percentage of technical skills ratings submitted may be related to the opportunities that the raters had had to observe the students demonstrating technical skills. The lower-level resident w h o had w o r k e d most closely with the student on a day-to-day basis and the faculty m e m b e r in the c o m m u n i t y clinical setting w h o had observed the student in his or her own practice may have had the greatest opportunity to observe the student in this area. Chief residents and faculty members at teaching hospitals may have had less opportunity to observe the technical skills of the students. The average rating for items using the four-point scale was 3.42, with a standard deviation of 0.57. Raters used all points on the scale, although the values of 1 and 2 accounted for only 9.4% of the possible ratings. The lowest average rating was for data skills, 3.21, and the highest average rating was for patient relationships, 3.73. The average rating for clinical grade was 3.18, with a standard deviation of 0.93.

Reliability of the Ratings The generalizability coefficients, reported in Table 2, equivalent to reliability coefficients, 8 ranged from a low of 0.62 for patient relationships to a high of 0.86 for data skills and clinical grade. These results indicate that 12 observations w o u l d be n e e d e d to ensure an adequately reliable rating (0.8) for clinical knowledge and skills items, 27 observations n e e d e d for interpersonal relationship items, 14 observations n e e d e d for personal/professional characteristics items, and seven observations n e e d e d for the clinical grade.

Effects of Rater Status, Clerkship Location, and Academic Quarter on Ratings The unique variance related to the potential confounding factors of rater status, clerkship location, and academic quarter on student ratings was minimal. Using Bonferroni's correction for multiple comparisons, 12 the only statistically significant contribution was found for rater status on the ratings for clinical grade (unique variance ----0.13, p = 0.001). Post hoc analysis indicated that faculty and chief residents had given higher clinical grades than had residents: the most diverse groups were approximately 0.25 standard deviations different from each other. Although differences were found, the absolute level of the differences for this confound was small.

Relationships of Potentially Confounding Variables to Ratings of Clinical Skills and Grades One-way analyses of variance were c o m p l e t e d using the nine ratings, the clinical grade, and the final grade as d e p e n d e n t variables, with the quarter and the location of the clerkship as independent variables. With Bonferroni's correction for multiple comparisons, only the ratings of technical skills were found to differ by academic quarter (F = 5.13, p ~ 0 . 0 0 2 ) , with ratings given in autumn quarter lower than those given in spring quarter (3.13 vs. 3.39, Scheffe post hoc analysis). Only the ratings of data skills were found to differ by location of the clerkship (F = 6.00, p = 0.003), with ratings received at WAMI sites lower than those received from the Seattle teaching hospitals (3.02 vs. 3.32, Scheffe post hoc analysis). The relationships of ratings from residents, chief residents, and faculty members with the final ratings and grades were determined in the following fashion. Mean scores for each student on the ratings were calculated for each level of rater. These were then regressed on the actual ratings and grades received by the stu-

Carline eta/., RELIABILITYOF RATINGS OF STUDENTS

510

dents. The u n i q u e variance associated with each level for the ratings is reported in Table 3. In all cases the ratings of the residents contributed the largest u n i q u e variance to the regression of rater status on the final ratings.

DISCUSSION Acceptable reliability (0.8) of ratings of students in a basic clerkship was obtained with seven to 14 observations for student evaluation items, including data skills, clinical skills, technical skills, knowledge, educational attitudes, initiative, attendance, and overall clinical grade. More observations were n e e d e d to obtain acceptable reliability for items related to relationships with other professionals and patients. Potential confounding factors, such as the academic quarter, the site for a clerkship, and the level of the rater, w e r e found to have little effect on the ratings themselves, although residents were found to give slightly lower final clinical grades than were either chief residents or faculty members. The academic quarter and the type o f clerkship location were not found to be related to differences in final ratings or grades received by students e x c e p t in two of 22 comparisons. In contrast, the ratings given b y residents seemed to contribute more to the final ratings and grades of the students than did the ratings given by chief medical residents and faculty members. The reliability of the ratings found in this study stands in sharp contrast to those of previously reported results.6, 7 Reliabilities of 0.7 to 0.8 were found in this study, while other studies found reliabilities of only 0.33 or less. Reliable ratings of data-gathering skills, clinical skills, knowledge, and technical skills can be obtained from the ratings of students contributed by faculty and residents in a 12-week medicine clerkship. The overall rating of clinical grade was also f o u n d to have an adequate reliability if at least seven observations had been collected. While it is probably inappropriate to d e p e n d on only one source of information to determine the final clerkship grade of a student, such ratings do meet the basic r e q u i r e m e n t o f reliability for use in establishing a grade. Some aspects of clinical performance, such as relationships with patients, can be reliably measured only with a very large n u m b e r of observations probably unattainable in most clerkships. If assessment of this type of skill, or other similar skills necessitating an unattain-

able n u m b e r of observations for reliable ratings, is d e e m e d important in the grading of a clerk, then another m e t h o d of observation must be included in grading systems. Alternatively, more careful training of faculty and residents in the observation and rating of students might increase the reliability o f ratings for these skills. The average n u m b e r of ratings received for each student in this study may be impossible to duplicate in shorter clerkships or in settings where students do not work with teams of residents and attending faculty physicians. This, plus the four-point scale used at the University of Washington and the restriction of range of most ratings, may limit the applicability of these results to other schools or academic situations. Despite these potential limitations, however, o u r data suggest that reliable ratings of students' clinical skills, including clerkship grades, can be achieved by the collection of a minimum of seven observations. These findings support the use of performance ratings to evaluate clinical skills and knowledge of students in the clerkship setting.

REFERENCES 1. Magarian GJ, Mazur DJ. Evaluation of students in medicine clerkships. Acad. Med. 1990;65:341-5. 2. Tonesk X, Buchanan RG. An AAMC pilot study by 10 medical schools of clinical evaluation of students. J Med Educ. 1987;62:707-18. 3. Hunt DD, Carline JD, Tonesk X, et al. Types of problem students encountered by clinical teachers on clerkships. J Med Educ. 1989;23:14-8. 4. Harden RM, Gleeson FA. Assessment of clinical competence using an Objective Structured Clinical Examination (OSCE). Med Educ. 1979;13:41-54. 5. Ainsworth M_A,Rogers LP, Markus JF, Dorsey NK, Blackwell TA, Petrusa ER. Standardized patient encounters: a method for teaching and evaluation. JAMA. 1 9 1 ; 2 6 6 : 1 3 9 0 - 6 . 6. Maxim BR, Dielman TE. Dimensionality, internal consistency and interrater reliability of clinical performance ratings. Med Educ. 1987;21:130-7. 7. Hull AL. Medical student performance: a comparison of house officer and attending staff as evaluators. Eval Health Prof. 1982;5:87-94. 8. Carline JD, Cook CE, Lennard ES, Coluccio GM, Norman NL. Resident and faculty differences in student evaluations: implications for changes in a clerkship evaluation system. Surgery. 1986; 100:88-94. 9. Ramsey PG, Shannon NF, Fleming L, et al. Use ofobjective examinations in medicine clerkships. Am J Med. 1986;81:669-74. 10. Shavelson RJ, Webb NM. Generalizability theory: a primer. Newbury Park, CA: Sage Publications, 1991. 11. Winer BJ. Statistics: principles in experimental design (2nd ed). New York: McGraw-Hill, 1971. 12. O'Brien PC, Shampo MA. Statistical considerations for performing multiple tests in a single experiment. Mayo Clin Proc. 1988;63:816-20.

Factors affecting the reliability of ratings of students' clinical skills in a medicine clerkship.

To determine the overall reliability and factors that might affect the reliability of ratings of students' clinical skills in a medicine clerkship...
505KB Sizes 0 Downloads 0 Views