How well do internal medicine faculty members evaluate the clinical skills of residents?

How Well Do Internal Medicine Faculty Members Evaluate the Clinical Skills of Residents? Gordon L. Noel, MD; Jerome E. Herbers, Jr., MD; Madlen P. Caplow; Glinda S. Cooper, MS; Louis N. Pangaro, MD; and Joan Harvey, MD

• Objective: To determine the accuracy of faculty evaluations of residents' clinical skills and whether a structured form and instructional videotape improve accuracy. • Design: Randomized, controlled trial. • Setting: Twelve university and community teaching hospitals. • Participants: A total of 203 faculty internists. • interventions: Participants watched a videotape of one of two residents performing new patient workups. Participants were assigned to one of three groups: They used either an open-ended evaluation form or a structured form that prompted detailed observations; some participants used the structured form after seeing a videotape showing good evaluation techniques. • Main Outcome Measures: Faculty observations of strengths and weaknesses in the residents' performance were scored. An accuracy score consisting of clinical skills of critical importance for a competent history and physical examination was calculated for each participant by raters blinded to the participants' hospital, training, subspecialty, and experience as observers. • Results: When observations were not prompted, participants recorded only 30% of the residents' strengths and weaknesses; accuracy among participants using structured forms increased to 60% or greater. Faculty in university hospitals were more accurate than those in community hospitals, and general internists were more accurate than subspecialists; the structured form improved performance in all groups. However, participants disagreed markedly about the residents' overall clinical competence: Thirty-one percent assessed one resident's clinical skills as unsatisfactory or marginal, whereas 69% assessed them as satisfactory or superior; 48% assessed the other resident's clinical skills as unsatisfactory or marginal, whereas 52% assessed them as satisfactory or superior. Participants also disagreed about the residents' humanistic qualities. The instructional videotape did not improve accuracy. • Conclusions: A structured form improved the accuracy of observations of clinical skills, but faculty still disagreed in their assessments of clinical competence. If program directors are to certify residents' clinical competence, better and more standardized evaluation is needed.

Annals of Internal Medicine. 1992;117:757-765. From Dartmouth Medical School, Hanover, New Hampshire; the Uniformed Services University of the Health Sciences, Bethesda, Maryland; Walter Reed Army Medical Center, Washington, DC; and the University of North Carolina, Chapel Hill, North Carolina. For current author addresses, see end of text.

x atients, physicians, and licensing bodies trust that physicians who complete approved residency programs have acquired the clinical skills needed to practice medicine competently. Certification of physicians by the American Board of Internal Medicine (ABIM) requires passing an extensive test of knowledge, data interpretation and synthesis, and clinical judgment, but the responsibility for ensuring that graduating internal medicine residents possess the clinical skills and humanistic qualities needed to practice competently rests with residency programs and their directors. Program directors rely on the observations of faculty members made during inpatient and clinic rotations, on their own observations of the residents during conferences such as morning report, and on formal exercises such as the clinical evaluation exercise. In a clinical evaluation exercise, a faculty member observes a resident performing a comprehensive evaluation of a new patient. The resident takes a complete history, does a physical examination, discusses the patient with the faculty member, and then counsels the patient, making recommendations for diagnosis and treatment. The faculty observer is asked to provide feedback to the resident and to complete a form that reports observations and an assessment of the resident's clinical competence to the program director. Despite the ABIM's request that residents be observed during a clinical evaluation exercise, evidence exists that faculty observations of internal medicine residents' clinical skills occur infrequently and that the ratings of residents' clinical skills by faculty members fail to discriminate among levels of clinical competence or to document problems (1-10). Few residents are reported to have clinical skills below the threshold for taking the ABIM's written examination. We designed the present study to determine the ability of faculty internists in twelve teaching hospitals to evaluate accurately the clinical skills of residents using a simulated clinical evaluation exercise. We also conducted a randomized, controlled trial to determine whether two simple interventions would improve the accuracy of faculty observations. We tested four hypotheses: 1) faculty members in both university and community hospitals using an open-ended form would fail to note many of the strengths and weaknesses displayed by a resident during a clinical evaluation exercise; 2) compared with an open-ended form, a structured form would increase both the quantity and accuracy of faculty observations; 3) a brief videotape explaining the purposes of the clinical evaluation exercise and suggesting methods for observing and documenting the performance of the resident would further improve accuracy; and 4) in their global assessments, faculty members would vary substantially in what they regarded as an acceptable level

1 November 1992 • Annals of Internal Medicine • Volume 117 • Number 9

Downloaded From: http://annals.org/ by a University of California San Diego User on 01/03/2017

757

of performance by a post-graduate year (PGY)2 resident. Methods Selection and Allocation of Participants Six university hospitals and six community hospitals were chosen by the investigators and the senior staff of the ABIM. The hospitals were representative of various sizes and types of internal medicine teaching programs. Participants in the study included both general and subspecialty internists who had been scheduled to serve as clinical evaluation exercise evaluators during either the previous or the current year, as well as members of the departmental clinical competence committee. Participation in the study was voluntary; informed consent was obtained from all participants. At each hospital, participants selected the most convenient of three different times to participate. Each time constituted a different study group. Participants in group 1, the control group, used an open-ended form provided by the ABIM to evaluate the residents during a clinical evaluation exercise; group 2 participants used a structured form designed for this study; group 3 participants first viewed a videotape that explained the purpose of the clinical evaluation exercise and then used the structured form. The order of the groups in each hospital was systematically rotated to control for any effect that time of day might have on observations; participants chose times without knowledge of any differences in the groups. Simulated Clinical Evaluation Exercises Two clinical evaluation exercise simulations were videotaped in a professional television studio. In each, an internist recently out of training portrayed a resident and a professional actor portrayed a patient. Although the residents performed many skills well, they were scripted to omit or perform incorrectly enough critically important interviewing and physical examination skills that participants could question whether these residents were too marginal for the program director to certify them as clinically competent. In tape A, the patient was a man in his late fifties with adult-onset diabetes mellitus treated with oral agents and who had poor dietary compliance. He also had hypertension controlled with a diuretic, symptoms of impotence, lower-extremity sensory neuropathy, and the recent onset of degenerative joint disease in one knee. The resident made clear errors in the history, physical examination, and management of the patient: failure to assess the patient's diet and drug compliance despite repeated cues of noncompliance; a poor sexual history in the face of complaints of diminishing sexual function and a history of illnesses and medication often associated with impotence; and failure to pursue symptoms of peripheral neuropathy and a totally inadequate sensory examination. The resident examined only the cervical lymph nodes, felt for the thyroid gland 2 inches too high, used excessive jargon, and failed to respond to numerous, clear verbal and nonverbal cues suggesting underlying psychosocial issues. In tape B, the patient was a man in his late twenties with a history of the sudden onset of bloody diarrhea for 1 week and a previous episode several years earlier. He also had a past history of conjunctivitis and work-related stress. The resident's most serious errors involved problems in synthesis and data analysis, including premature closure on an unlikely diagnosis: He decided within 30 seconds that the patient's diarrhea was probably due to hyperthyroidism, which led to a lengthy search for symptoms of hyperthyroidism. Not until late in the history did he ask questions to ascertain whether the patient could have inflammatory bowel disease, and he learned only by chance that the patient's diarrhea was bloody. During the physical examination he omitted joint and lymph-node examinations, thus failing to screen for causes that could have involved both systems. Although he was preoccupied with thyrotoxicosis, his thyroid examination was grossly inadequate. He also failed to assess the severity of the patient's blood loss: 758

Despite positive stool blood tests, he accepted the nurse's blood pressure without checking for orthostatic hypotension, and he failed to check the patient's hematocrit before sending him home. He ordered tests for many causes of diarrhea usually not associated with hematochezia or improbable in this patient. Participants were told that both residents were in their second year of residency. Both appeared to be the usual age of medical residents and were articulate, confident, and welldressed white men. Each tape showed the entire exercise and lasted approximately 50 minutes; every aspect of the history, physical examination, and final counselling could be clearly seen and heard. Questionnaires Questionnaire 1: Forms Used To Record Evaluations Participants used one of two forms to evaluate the residents. Participants in group 1 used an open-ended form suggested by the ABIM for use in the clinical evaluation exercise (11). This three-page form is divided into six sections (history, physical examination, clinical judgment and synthesis, humanistic qualities, medical care, and overall clinical competence). Each section includes a description of a group of clinical skills and asks for a global rating of those skills. Within each section, three lines are provided for written comments. Participants in groups 2 and 3 used a structured, eight-page form that we developed in conjunction with members of the ABIM staff. The four left-hand pages are devoted to the evaluation of specific clinical skills, and the four right-hand pages provide a large amount of space for comments. A four-level scale is used to evaluate skills throughout the form: "poor, must be improved"; "fair, room for some improvement"; "good, adequate skills"; "excellent, superior skills." The history section includes 20 items on general interviewing skills (for example, use of open-ended questions, avoiding jargon) and specific details of the medical history and psychosocial history (for example, exploration of the patient's compliance with medical regimen, alcohol history). The physical examination section includes ratings of specific skills (for example, cardiovascular, abdominal, neurologic examinations). The humanism section asks for the evaluation of five items (for example, empathy, respect). The clinical judgment section contains four items relating to the resident's ability to analyze and present clinical data. The medical care section includes eight items that relate to the resident's formulation of a diagnostic and management plan. Questionnaire 2: Detailed Follow-up Evaluation We designed a questionnaire for each simulated clinical evaluation exercise that asked for assessment of clinical skills that were particularly relevant for the specific patient. The scale used was identical to the scale used in the structured form. By comparing responses on questionnaire 2 to those on the openended and structured forms, we were able to determine whether participants who failed to record strengths and weaknesses that were not prompted by the forms actually did not recognize them or simply did not write down what they had correctly observed. Other Questionnaires A preliminary demographic questionnaire was designed to obtain information on the participants' training, clinical and teaching responsibilities, and experience with the clinical evaluation of residents. A final questionnaire assessed the participants' perceptions of the realism and difficulty of the simulated clinical evaluation exercise they had viewed. The Explanatory Videotape We produced a 15-minute explanatory videotape that demonstrated the use of the structured form: Participants were encouraged to make detailed observations, to rate individual skills using the full range of ratings, to make notes of particular strengths and weaknesses, and to provide clear and critical feedback on interviewing and physical examination skills. The videotape emphasized humanistic skills and stressed the use-



fulness of the clinical evaluation exercise as a method for observing clinical skills and providing accurate, detailed feedback to residents. Structure of a Study Day After completing the demographic questionnaire, participants viewed the simulated clinical evaluation exercise. Group 1 used the open-ended form and groups 2 and 3 used the structured form to record their observations; group 3 saw the explanatory videotape before seeing the simulated clinical evaluation exercise. Participants in all groups were encouraged to record comments while watching the exercise, and the videotape was stopped for 12 minutes after the history and for 6 minutes after the physical examination so that they could record their observations. After participants viewed the closing section of the tape in which the resident discussed his findings and recommendations for diagnosis and management with the patient, they had unlimited time to record their observations. After the forms were completed, participants filled out questionnaire 2 and the final questionnaire regarding the realism of the clinical evaluation exercise simulation. Scoring and Statistical Analysis of Accuracy We created scoring criteria for each of the simulated clinical evaluation exercises to determine how many of the residents' strengths and weaknesses were recorded by each participant. The scoring system was revised until raters achieved greater than 80% simple agreement in pilot readings. All 203 forms were then read and scored by three raters blinded to the participant's identity, institution, or group assignment. Two raters scored each form independently, without knowledge of the other rater's evaluation. The third rater compared the two forms and resolved any disagreements. We developed an accuracy score based on a subset of items that met the following criteria: The skill was performed unequivocally incorrectly or correctly by the resident; the performance could be clearly heard or seen in the videotapes; and the skill, performed correctly, was of central importance to the evaluation of this patient (for example, thyroid examination) or an important part of the evaluation of every patient (for example, response to verbal and nonverbal cues). Twenty-three items for tape A and 18 items for tape B met all three criteria. The accuracy score was calculated as a percentage of the total possible points for each tape if all items included in the accuracy score were correctly evaluated. We calculated frequencies and summary statistics for each questionnaire item and for the accuracy score, stratified by tape, study group, type of hospital (community or university), and the participants' training (general medicine or subspecialty). Statistical comparisons between subgroups were made with chi-square tests (questionnaire items) and analysis of variance (accuracy scores). The Tukey procedure was used to adjust for multiple comparisons. A weighted kappa score was calculated to assess agreement for individual items (10, 12). The kappa score for each skill was calculated for all observers using a weight of 1.0 for pairs selecting the same rating and 0.75 for pairs selecting ratings within one category (for example, poor-fair, satisfactory-superior) on the same end of the scale. Results Distribution of Participants into Study Groups Of 207 faculty members who volunteered for the study, only four did not appear or were called away. Faculty members from three university hospitals and three community hospitals viewed each of the simulated clinical evaluation exercises. Eighty-six participants (42%) were in community hospitals and 117 (58%) were in university hospitals. Fifty-seven (28%) participants were in group 1, 77 (39%) in group 2, and 69 (34%) in group 3. Similar percentages of participants in all three

groups participated in a morning, noon, or afternoon session. Ninety-six participants saw tape A, and 107 participants saw tape B. The proportions of participants from community and university hospitals in groups 1, 2, and 3 were nearly identical. Faculty Characteristics Almost all (n = 193, 96%) participants were certified by the ABIM. The remaining nine were either graduates of British training programs or were chief residents. There were no differences in the distribution of participants' subspecialty fields in community or university hospitals, among participants in the three study groups, or between participants who viewed tape A or tape B. Fifty-one percent of participants in community hospitals had been observed doing a history or physical examination three or more times during their training, whereas only 32% of participants in university hospitals had been observed three or more times (P = 0.03). Seventy-four percent of participants had served as clinical evaluation exercise evaluators in the last 5 years; 40% had observed five or more exercises in the last 5 years. Participants in community and university hospitals differed with respect to the percentage of time they spent in various aspects of their professional careers: Community hospital participants spent more time in clinical practice (P < 0.001), and university hospital participants spent more time in teaching (P = 0.02) and research (P < 0.001). Forty-four percent of community hospital faculty were in private practice compared with 14% of university hospital participants. At the conclusion of the study, 96% of all participants assessed the simulated exercise they had seen as comparable to a real clinical evaluation exercise; 55% thought the patient to be less complex and 44% as typical of the patients they had used for the exercise. Ninety-three percent felt that the clinical evaluation exercise was a good way to evaluate a resident's history taking skills, and 86% felt that the exercise was a good way to evaluate physical examination skills. Faculty Accuracy and Variability in Assessing Clinical Skills Analysis of Questionnaire 1 Forms: Effect of the Structured Form The strengths and weaknesses used to develop the accuracy scores for the residents in tapes A and B are shown in Tables 1 and 2, respectively. In general, participants using the open-ended form commented on few of the residents' strengths or weaknesses; for those items mentioned, they included little description of what made the performance good or poor; a typical form contained fewer than a dozen observations. Of the 23 skills used to create the accuracy score for the resident in tape A, 13 items were not included on the structured form: Faculty members had to make and record observations without prompting (see Table 1). For these 13 items, participants in groups 2 and 3 who used the structured form performed no better than those using the open-ended form (group 1): The average accuracy of all participants was 32% (range, 30% to 33%).



759

Accuracy was better than 50% for only two items: Seventy-three percent and 51% noted that the sensory examination and the diabetic history were poor, respectively. Only 16% and 19% of participants commented on incomplete examinations of the liver and spleen, only 23% commented on the resident's skimpy evaluation of the hypertension history, and only 25% commented on the failure of the resident to follow up on numerous suggestions of noncompliance. Performance among the groups was strikingly different for 10 items prompted by the structured form. Group 1 participants achieved an accuracy of 40% on these items, whereas participants in groups 2 and 3 were significantly more accurate (64% and 66%, respectively, (P < 0.001). For example, in group 1 only 7% of participants noted the resident's failure to assess the patient's hypertension and diabetic drug compliance as compared with 53% and 65% in groups 2 and 3; only 30% noted the total omission of the lymph-node examination compared with 63% and 48% of groups 2 and 3; and only 26% noted the resident's inadequate diet history compared with 63% and 71% of groups 2 and 3. Of the 18 skills used to create the accuracy score for the resident in tape B, 8 were not included on the structured form (see Table 2). For these unprompted items participants who used the structured form performed no better than those who used the open-ended

form: Overall, accuracy for these items was 30% (range, 28% to 33%). Despite the prominence of thyroid disease in the resident's differential diagnosis, only 35% of faculty commented on the totally inadequate thyroid examination. Only 23% of participants noted the absence of a joint examination, and only 17% noted the resident's failure to check the patient's blood pressure. Performance among the groups was again strikingly different for 10 items prompted by the structured form. Group 1 participants achieved an accuracy of only 32%, whereas those in groups 2 and 3 achieved accuracies of 63% and 64% (P < 0.001). Participants in groups 2 and 3 commented more frequently on the resident's use of open-ended questions (54% and 50% compared with 7%) and the resident's thorough sexual history (54% and 66% compared with 3%). Participants in groups 2 and 3 more often noted the incomplete lymph-node examination, the resident's failure to involve the patient in diagnostic decisions, his failure to explore the patient's social support systems, and his incorrectly prioritized differential diagnosis. When all 203 participants were ranked by their accuracy scores, 19 of the 20 most accurate evaluators (accuracy, 57% to 73%) used the structured form; of the 20 least accurate evaluators (accuracy, 0% to 18%), 15 used the open-ended form, and 5 used the structured form.

Table 1. Faculty Accuracy in Observing the Strengths and Weaknesses of Resident A: Effect of Structured Form Item

Resident's Strength or Weakness*

1 Open-ended Form

Group 2 Structured Form

3 Structured Form and Educational Videotape

%_ /o Items not prompted by structured form Asks leading questions Interrupts patient Uses transition statements Explores symptoms of neuropathy Assesses diabetic history Assesses compliance with diabetic regimen Takes an adequate hypertension history Examines the thyroid Examines the spleen Examines the liver Examines for neuropathy Performs cerebellar examination Counsels the patient about diabetic diet Totals Items prompted by structured form Uses jargon Responds to verbal and nonverbal cues Assesses drug compliance Takes adequate alcohol history Assesses patient's diet Assesses patient's sexual function Examines the abdomen Does neurologic examination Examines lymph nodes Involves the patient in diagnostic and therapeutic decisions Totals

Contraceptive counseling by general internal medicine faculty and residents.

Residents' and faculty members' views of and skills in patient education.

A procedural skills OSCE: assessing technical and non-technical skills of internal medicine residents.

How well do final year undergraduate medical students master practical clinical skills?

How do research faculty in the biosciences evaluate paper authorship criteria?

How and how well do pediatric radiology fellows learn ultrasound skills? A national survey.

Optimum number of procedures required to achieve procedural skills competency in internal medicine residents.

Internal Medicine Residents' Retention of Knowledge and Skills in Bedside Ultrasound.

Clinical research is a priority for emergency medicine but how do we make it happen, and do it well?

Recognition of depression by internal medicine residents.

An orientation to wellness for new faculty of medicine members: meeting a need in faculty development.

Do paediatric residents have the skills to 'lead' newborn resuscitations?

Faculty development of an OSCE in an internal medicine clerkship.

How do SAGES members rate its guidelines?

The use of an objective structured clinical examination to assess internal medicine residents' transfusion knowledge.

How to do well in the OSPHE: examiner comments about candidate performances in the Faculty membership examination Part B (OSPHE).

Hospitalist workload influences faculty evaluations by internal medicine clerkship students.

Internal medicine residents' computer use in the inpatient setting.

A blueprint for transitioning pharmacy residents into successful clinical faculty members in colleges and schools of pharmacy.

How do Ontario family medicine residents perform on global health competencies? A multi-institutional survey.

U of T program aims to broaden skills of medical school's faculty members.

Wound care: how well do you dress?

Commentary: Clinical skills teaching in UK medical education as exemplified by the BM5 curriculum, Faculty of Medicine, University of Southampton.

Clinical and Educational Outcomes of an Integrated Inpatient Quality Improvement Curriculum for Internal Medicine Residents.