Teaching and Learning in Medicine, 27(2), 163–173 Copyright Ó 2015, Taylor & Francis Group, LLC ISSN: 1040-1334 print / 1532-8015 online DOI: 10.1080/10401334.2015.1011654

The IDEA Assessment Tool: Assessing the Reporting, Diagnostic Reasoning, and Decision-Making Skills Demonstrated in Medical Students’ Hospital Admission Notes Elizabeth A. Baker Department of Internal Medicine, Rush University, Chicago, Illinois, USA

Cynthia H. Ledford Departments of Internal Medicine and Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio, USA

Louis Fogg Department of Psychology and College of Nursing, Rush University, Chicago, Illinois, USA

David P. Way Office of Evaluation, Curricular Research, and Development, The Ohio State University College of Medicine, Columbus, Ohio, USA

Yoon Soo Park Department of Medical Education, University of Illinois at Chicago, Chicago, Illinois, USA

Construct: Clinical skills are used in the care of patients, including reporting, diagnostic reasoning, and decision-making skills. Written comprehensive new patient admission notes (H&Ps) are a ubiquitous part of student education but are underutilized in the assessment of clinical skills. The interpretive summary, differential diagnosis, explanation of reasoning, and alternatives (IDEA) assessment tool was developed to assess students’ clinical skills using written comprehensive new patient admission notes. Background: The validity evidence for assessment of clinical skills using clinical documentation following authentic patient encounters has not been well documented. Diagnostic justification tools and postencounter notes are described in the literature1,2 but are based on standardized patient encounters. To our knowledge, the IDEA assessment tool is the first published tool that uses medical students’ H&Ps to rate students’ clinical skills. Approach: The IDEA assessment tool is a 15-item instrument that asks evaluators to rate students’ reporting, diagnostic reasoning, and decision-making skills based on medical students’ new patient admission notes. This study presents validity evidence in support of the IDEA assessment tool using Messick’s unified framework, including content (theoretical framework), response process (interrater reliability), internal structure (factor analysis and internal-consistency reliability), and relationship to other variables. Results: Validity evidence is based on results from four studies conducted between 2010 and 2013. First, the factor analysis (2010, n D 216) yielded a three-factor solution, measuring patient story, IDEA, and completeness, with reliabilities Correspondence may be sent to Elizabeth A. Baker, Office of Medical Student Programs, 5th Floor Academic Facility, Rush University Medical Center, 600 S Paulina St, Chicago, IL 60612, USA. E-mail: [email protected]

of .79, .88, and .79, respectively. Second, an initial interrater reliability study (2010) involving two raters demonstrated fair to moderate consensus (k D .21–.56, r D.42–.79). Third, a second interrater reliability study (2011) with 22 trained raters also demonstrated fair to moderate agreement (intraclass correlations [ICCs] D .29–.67). There was moderate reliability for all three skill domains, including reporting skills (ICC D .53), diagnostic reasoning skills (ICC D .64), and decision-making skills (ICC D .63). Fourth, there was a significant correlation between IDEA rating scores (2010–2013) and final Internal Medicine clerkship grades (r D .24), 95% confidence interval (CI) [.15, .33]. Conclusions: The IDEA assessment tool is a novel tool with validity evidence to support its use in the assessment of students’ reporting, diagnostic reasoning, and decision-making skills. The moderate reliability achieved supports formative or lower stakes summative uses rather than high-stakes summative judgments. Keywords

clinical reasoning, medical student, assessment, clinical documentation review

INTRODUCTION Teaching and assessing medical students’ clinical skills are major responsibilities of medical schools. A primary method for assessing such skills is through clinical performance ratings gathered by direct observation and a review of oral presentations and written documentation. Clinical documentation review is a recommended assessment method as evidenced by its inclusion in the MedBiquitous Curriculum Inventory.3 But although clinical performance ratings are almost universally

163

164

E. A. BAKER ET AL.

used by clerkship directors as part of the grading process,4 and clinical documentation assessment is recommended by the Association of American Medical Colleges, the validity evidence for such assessments has not been well documented. Such measurements represent the highest level of assessment, the “does” of Miller’s pyramid;5 validity evidence to support their use is needed. In this article we review validity evidence gathered in support of a practical instrument used to rate medical students’ clinical skills as reflected in their written comprehensive new patient admission notes, sometimes colloquially termed H&Ps. The supporting validity evidence is summarized based on Messick’s unified validity framework, using five sources of validity: content, response process, internal structure, relationship to other variables, and consequences.6 Medical student new patient admission notes were chosen as the source for rating clinical skills for several reasons. Having written documentation is convenient and makes assessment outside the clinical setting possible. Students commonly write independent admission notes with an assessment section in which an explanation of clinical reasoning, including diagnostic justification, should be articulated. Assessing diagnostic reasoning is an integral part of the integrated clinical encounter component of the United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills Examination, highlighting the importance of this skill for all medical students, as well as the importance of valid diagnostic reasoning assessments. Finally, establishing validated documentation standards would contribute to the body of work in clinical skills assessment and has been recognized as an outstanding need in a recent study of the required written history and physical.7 The potential for using written admission notes to rate diagnostic reasoning skills was assessed in an initial study8 (1999). The vast majority of notes in this study lacked the documentation necessary for an assessment of diagnostic reasoning skills, thus severely limiting the assessment potential of the written note. The interpretive summary, differential diagnosis, explanation of reasoning, and alternatives (IDEA) framework and assessment tool9 was developed in response to this limitation.

The IDEA Framework and Assessment Tool The IDEA framework outlines a simple organizational strategy to guide complete documentation of a physician’s clinical assessment, with the intended purpose of encouraging better articulation of diagnostic reasoning. The IDEA framework asks students to organize the assessment section of their written notes into a simple paragraph form, including four elements (Table 1): interpretive summary, differential diagnosis, explanation of reasoning, and alternatives. In the interpretive summary, students summarize the key features of a patient’s history, the examination findings, and initial studies through use of semantic qualifiers to offer a

TABLE 1 Elements of the IDEA framework IDEA Framework I D E A

Interpretive summary Differential diagnosis with commitment to the most likely diagnosis Explanation of reasoning in choosing the most likely diagnosis Alternative diagnoses with explanation of reasoning

complete yet concise representation of the patient’s problem. This summary is followed by a short list of most likely diagnostic possibilities and a commitment to one diagnosis as most likely. The data from the interpretive summary, as well as knowledge about diseases, are then used to defend the choice of the most likely diagnosis, with alternative diagnoses being compared to the most likely diagnosis and the patient’s problem representation. The IDEA mnemonic serves as a scaffold to encourage documentation of diagnostic reasoning, specifically the reasoning that is used to explain the choice of a most likely and alternative diagnoses, and serves as the anchor for the IDEA assessment tool. The IDEA framework and assessment tool have been used at Rush Medical College since 2003. The current IDEA assessment tool contains 12 items and three summative skill ratings (see the Appendix). Items are rated based on the presence of described features (none to minimal, some to many, most to all). Rating of documentation of the history of present illness is based on four characteristics—level of detail, descriptiveness, chronology, and contextualization. Three items rate the comprehensiveness of the past medical, social, and family history and the relevance and completeness of the physical examination. Next, evaluators are asked to rate the diagnostic reasoning articulated in the assessment section of the note using the four components of IDEA. The last item rates the documentation of the diagnostic and therapeutic plan. Raters then provide a global rating of the reporting, diagnostic reasoning, and decision-making skills documented by the student.

Organizing Frameworks Structural semantics and reporter, interpreter, manager, educator (RIME) serve as the organizing frameworks for the IDEA assessment tool.10,11 According to structural semantics, discourses (written or oral case presentations) can be categorized based on the organization and content of the discourse:12,13 (a) reduced, or lacking in content; (b) dispersed, or having abundant content yet disorganized information; (c) elaborated, or using semantic qualifiers (acute, constant, sharp) to actively compare pertinent diagnostic possibilities, the

IDEA ASSESSMENT TOOL

problem space, and arrive at the correct diagnosis; and (d) compiled, or semantically rich but highly synthesized content. Bordage and others have found that successful diagnosticians had more thorough and relevant problem representations and did more simultaneous comparing and contrasting of diagnoses than did unsuccessful diagnosticians, thus supporting the content validity of structural semantics as representing categories of diagnostic reasoning.14–16 Durning and others more recently incorporated structural semantics into a postencounter form for 2nd-year medical students completing an Objective Structured Clinical Examination station, with the purpose of measuring clinical reasoning skills.1 Validity evidence was presented to support the use of this form for the evaluation of clinical reasoning in this setting, including feasibility, interrater reliability, and comparison to other components of the Objective Structured Clinical Examination station. The RIME framework, proposed by Pangaro and others, is a second organizing framework that serves as a basis for the IDEA assessment tool. RIME uses four categories to describe overall student clinical performance: reporter, interpreter, manager, and educator.11 The RIME framework is used nationally with a number of studies that describe its validity and reliability in describing overall student performance.17,18 The global ratings of reporting, diagnostic reasoning and decision-making skills on the IDEA assessment tool are derived from the reporter, interpreter, and manager/educator RIME categorizations.19 Having as a foundation two accepted organizing frameworks supports the content validity of the IDEA instrument.

METHODS Validity evidence for the IDEA framework and assessment tool was gathered over several years as part of an ongoing effort to improve medical student educational activities and assessments. Multiple studies are presented that each contribute to the body of evidence supporting the IDEA framework and assessment tool, based on Messick’s unified validity framework.20 The studies described in this paper were reviewed and exempted by Rush University Medical Center’s Institutional Review Board (ORA # 03090304).

Factor Analysis In 2010, an exploratory factor analysis of the IDEA assessment tool was performed to assess the validity evidence supporting its internal structure. Two hundred sixteen new patient admission notes from 128 students (private midwestern U.S. medical school, 44% female, 9% underrepresented minority) were collected during the Internal Medicine Clerkship as part of the students’ usual educational activities; one of nine evaluators rated each write-up using the IDEA assessment tool. The exploratory factor analysis methods involved conservative alpha factor analysis for extraction, promax rotation (k D 4.0)

165

for simplifying the factor structure, and use of eigen values of greater than 1 for determining the factor solution. Pearson correlation coefficients were used to evaluate the relationship between the factors identified on factor analysis and the three global skills ratings (i.e., reporting, diagnostic reasoning, and decision making) as a measure of concurrent validity. Data were analyzed using SPSS (IBM SPSS Statistics for Windows, Version 19.0, IBM Corporation, Armonk, NY, USA, 2010).

Establishing Interrater Reliability: Initial Two Rater Study To identify the interrater reliability of the instrument and provide validity evidence with regard to the instrument’s response process, we analyzed data collected from two independent raters (EB and CL), who rated a convenience sample of students’ deidentified new patient admission notes (N D 30) in 2010. Specifically, we were interested in whether the item and skill descriptive anchors used in the instrument represented distinct categories for raters, and whether raters agreed on the ordinal quality of item and skill ratings. Notes were written by medical students from the same class described in the previous study. EB was the author of the instrument, and CL was a clinician educator who was formally trained by the first rater. Rater agreement was examined using the kappa and spearman’s rho statistics.21 Kappa was chosen to provide a consensus estimate of interrater reliability. Spearman’s rho was chosen to provide a consistency estimate of interrater reliability, given that the intervals between categories in our instrument were unknown and potentially unequal, and we were limiting our analysis to pairwise comparisons between two raters. Data were analyzed using SPSS 17.0 (SPSS Inc, Chicago, IL, USA). A qualitative analysis of the data was then conducted to identify and resolve reasons for systematic differences in ratings. The two raters read a sample of new patient admission notes with discordant ratings, then described and coded the hypothesized reason for the rating disagreements thematically. The raters then discussed each rating disagreement from the sample, came to consensus about the reason for the rating disagreement and appropriate standards for each rating, and made minor changes to the form to provide clarification on item rating and prevent future rating discordance. The knowledge gained from this review was used to inform and develop training materials for raters in the subsequent study.

Establishing Interrater Reliability: 22-Rater Study A larger study of interrater reliability was conducted in 2011. A training manual, including a test book of 15 new patient admission notes, was composed. Admission notes were chosen by the expert raters based on the level of documentation, to include notes with poor, good, and excellent skill documentation, selected from extant student notes from two midwestern U.S. medical schools—one private and one state

166

E. A. BAKER ET AL.

funded. Twenty-two faculty members from three institutions (Rush University Medical Center, Ohio State University, and the John H Stroger Hospital of Cook County) were then trained to use the instrument during a series of 1- to 2-hour training sessions. Each faculty member completed the test book, rating the 15 new patient admission notes using the IDEA assessment tool. Interrater reliability was determined using intraclass correlation (ICC). ICC was chosen for the simultaneous analysis of multiple raters rating multiple ordinal items, using a relative ranking model. Data were analyzed using SPSS 17.0 (SPSS Inc, Chicago, IL, USA).

Comparison of IDEA Score and Final Grades To provide evidence of a relationship to other variables, students’ IDEA assessment tool scores were compared to final Internal Medicine Clerkship grades. Data for all students receiving a final grade for the clerkship (n D 398, private midwestern U.S. medical school, 51% female, 14% underrepresented minority) from 3 academic years (2010–2011, 2011– 2012, 2012–2013) were aggregated for the analysis. This was done to determine whether IDEA ratings measure skills in a manner consistent with other assessment measures. All students completed and submitted a deidentified new patient admission note of their choice at the end of the Internal Medicine Clerkship that was rated by one of three raters (instrument author and two trained faculty) using the IDEA assessment tool, and an IDEA score was calculated (0–1.0) that contributed 5% to the final grade for the clerkship. Possible final grade categories included pass, high pass, and honors. Mean IDEA scores were compared between pass, high pass, and honors groups using a linear regression model (with Helmertcontrasting groups coding to compare effect size between pass and high pass and between high pass and honors). Correlation between IDEA rating scores and the final Internal Medicine Clerkship grades were calculated to examine the magnitude of association between the two scores; IDEA scores were removed in the calculation of the correlation with the final Internal Medicine Clerkship grades. Data were analyzed using Stata 12 (StataCorp, College Station, TX, USA).

RESULTS Internal Structure The factor analysis yielded a three-factor solution (patient story, IDEA, and completeness factors). Table 2 shows the factor analysis results including factor loadings and scale statistics. The original three-factor solution (factor structure) accounted for 65.9% of the variance in reconstructing the original correlation matrix. The Promax orthogonal rotation allowed factors to correlate (IDEA factor with patient story: r D .67; IDEA factor with completeness factor: r D .63; and patient story factor with completeness factor: r D .68). Most

item-factor loadings (85.7%) were greater than .60, ranging from .65 to .90. Only two items had small factor loadings, lower than .60: HOPI-5 (Rest of History), which loaded on completeness factor (.55), and chief complaint, which loaded on patient story (.29). Given the insufficient load of the chief complaint on any of the three factors, we performed the factor analysis without this item. The resulting factor structure was similar to the original with one exception—the “I” item moved from Factor 1 (the IDEA factor) to Factor 3 (the completeness factor). Cronbach’s alpha reliability coefficients (see Table 2) provide additional support for the internal structure of the instrument, with reliability coefficients (alpha) of .88, .79, and .85 for the patient story, IDEA, and completeness factors, respectively. Most item-total scale (factor) correlations were moderate to high (r D .56–.82), and remained consistent across both Analysis 1 and 2. All three scales had slightly improved reliabilities after dropping the chief complaint item (see Table 2). Pearson correlation coefficients were used to evaluate the relationship between the three factors identified on factor analysis and the three global skills ratings (i.e., reporting, diagnostic reasoning, and decision making) as a measure of concurrent validity. The patient story factor correlated most closely with reporting skill ratings (Pearson r D C.86, p < .001). The modified IDEA factor (without the I) correlated with both diagnostic reasoning and decision-making skill ratings (Pearson r D C.75 for diagnostic reasoning and r D C.88 for decision making, p < .001). The completeness factor correlated with all three skill ratings, suggesting that completeness contributes to each of the three skills being rated (reporting: Pearson r D C.77; diagnostic reasoning Pearson r D C.66; and decision making Pearson r D C.55; all ps < .001). These results support the construct that the structural semantic, RIME, and IDEA concepts embedded in the instrument behave and relate to each other in the way we expected, providing validity evidence supporting its internal structure.

Response Process Establishing interrater reliability: Initial two rater study. Interrater reliabilities for the item and skill ratings for the 2010 two rater study are listed in Table 3. The level of agreement for individual items and skills ranged from a kappa of .02 to .56, with 7/15 items (47%) achieving a kappa of more than .40. In interpreting the level of agreement indicated by kappa values, values of .41 to .60 can be considered moderate,21 so that for both individual items on the instrument, and for skills levels, there were fair to moderate levels of agreement. Using Spearman’s rho, the agreement of individual items on our checklist ranged between .42 and .79. The internal consistency was moderately high for all three skill ratings, including reporting (Spearman’s r D .63), diagnostic reasoning (Spearman’s r D .74), and decision-making (Spearman’s r D

167

IDEA ASSESSMENT TOOL

TABLE 2 Results of the original (Time 1) and subsequent (Time 2) exploratory factor analyses and Cronbach’s alpha reliability analyses of the IDEA assessment tool items: Includes factor, factor loading, and scale statistics: Item-total scale score correlation and Cronbach’s alpha if item is deleted from scale (N D 216, except for Problem List (PROB) N D 82) Factors Item

Item Wording

1

Loadings 2

1

2

Item-Total r

a if Deleted

1

1

2

A Alternative diagnoses 1 1 .90 .89 .80 .82 .84 E Explains reasoning 1 1 .86 .86 .81 .81 .84 D Differential diagnosis 1 1 .76 .82 .76 .78 .85 I Interpretive summary 1 3 .65 .68 .58 .76 .89 PLAN Plan 1 1 .73 .69 .66 .63 .87 Factor 1 (IDEA) summary Eigenvalues: T1 D 6.77, T2 D 6.70; Cronbach’s a: T1 D 0.88, T2 D 0.89 HPI1 Detailed 2 2 .85 .79 .70 .70 .71 HPI2 Descriptive 2 2 .76 .78 .67 .69 .72 HPI3 Chronologic 2 2 .71 .77 .67 .70 .72 HPI4 Complete 2 2 .66 .64 .56 .55 .76 CC Chief complaint 2 X .29 X .27 X .83 Factor 2 (Patient Story) summary Eigenvalues: T1 D 1.32, T2 D 1.28; Cronbach’s a: T1 D 0.79, T2 D 0.83 PE1 Complete 3 3 .79 .76 .74 .75 .78 PE2 Key findings 3 3 .87 .87 .80 .79 .75 PROB Problem list 3 3 .77 .77 .62 .72 .85 HPI5 Complete comprehensive history 3 3 .55 .58 .61 .59 .84 Factor 3 (Completeness) summary Eigenvalues: T1 D 1.14, T2 D 1.13; Cronbach’s a: T1 D 0.85, T2 D 0.88

2 .83 .84 .85 .85 .90 .77 .77 .77 .83 X .85 .84 .86 .88

Note: Data from 216 new patient admission notes from 128 students (private midwestern U.S. medical school, 44% female, 9% underrepresented minority) were collected during the internal medicine clerkship as part of the students’ usual educational activities; one of nine evaluators rated each write-up using the interpretive summary, differential diagnosis, explanation of reasoning, and alternatives (IDEA) rating instrument.

TABLE 3 Results of initial interrater reliability study (2010) results (two raters), using kappa and Spearman’s rho Item

Item Description

Agreement between ratings—individual items 1 History of present illness (HPI) Detail 2 HPI—Description 3 HPI—Chronology 4 HPI—Context 5 Rest of history complete 6 PE—complete 7 PE—pertinent 8 I—interpretive summary 9 D—differential 10 E—explanation 11 A—alternatives 12 Plan Agreement between skills ratings 13 Reporting skills 14 Diagnostic reasoning skills 15 Decision-making skills

Kappa

CI

Spearman’s Rho

CI

.02 n/a .44 .23 .21 .32 .50 .26 .56 .41 .46 .29

–.22, .27 n/a .21, .67 –.02, .48 –.04, .46 .02, .62 .25, .75 .03, .49 .36, .76 .18, .64 .23, .68 .06, .52

.50 .42 .58 .50 .40 .45 .68 .64 .80 .70 .77 .65

.24, .76 .14, .69 .30, .86 .23, .76 .08, .72 .14, .76 .47, .88 .42, .86 .63, .97 .50, .90 .60, .94 .45, .85

.44 .49 .35

.19, .69 .25, .73 .11, .56

.63 .74 .61

.40, .86 .61, .87 .40, .82

Note: Results based on a convenience sample of student deidentified new patient admission notes, N D 30. All students were from a single private midwestern U.S. medical school, 44% female, 9% underrepresented minority. CI D confidence interval.

168

E. A. BAKER ET AL.

.61) skills. So, in the initial effort to establish interrater reliability, fair to moderate consensus between the two raters was demonstrated, as well as moderately high internal consistency of skill ratings. A qualitative analysis was conducted on a sample of patient notes and ratings in which there was rating discordance. Three categories of disagreement emerged—completeness, pertinence, and stringency. For example, there was disagreement on the level of documentation required to meet the “most or all elements” category for detailed history of present illness (HPI), as well as key physical examination findings, with one rater consistently more stringent than the other. This particular difference was coded to stringency. An example of difference coded to “pertinence” was the rating of the item “uses high level evidence to support diagnostic testing and treatment plans,” where one rater accepted evidence for less pertinent tests and plans and the other sought the presence of evidence for the most pertinent decisions. Other differences related to how many elements were necessary to achieve completeness or pertinence at the levels of “some” or “most or all,” particularly when these two concepts were combined in a single item, where increased completeness may be associated with decreased pertinence. As a result of this review, several changes to the rating instrument were made. Under “Written History,” the name of the item covering the past, family, and social history within the HPI was changed from “complete-pertinent,” a phrase resulting in different interpretations, to “contextualized,” with a more detailed explanation of the documentation expected. Under “Written Physical Examination Findings,” the second item, originally listed as “Pertinent positives and negatives,” was changed to “Key physical examination findings” and included a more detailed explanation of the documentation expected. For “Decision-making skills,” the definition of excellent was changed from “uses high level evidence to support diagnostic testing and treatment plans” to “uses evidence to support most important diagnostic testing and treatment plans.” Finally, because of factor analysis results and significant differences in the use of the problem list and interpretation of an acceptable chief complaint statement, a decision was made to remove these two items from the assessment tool. Establishing interrater reliability: 22-rater study. In 2011, a larger study involving 22 raters was conducted. Intraclass correlations, using a relative ranking model, were calculated for the item and skill ratings of the 15 notes using the IDEA assessment tool. Intraclass correlations for the IDEA assessment tool items were fair to moderate, ranging from .29 (rest of history item) to .67 (reasoning for alternative diagnoses item), as can be seen in Table 3. Intraclass correlations for the IDEA items ranged from .35 to .67; the interpretive summary (I) was the only item in IDEA with an intraclass correlation (ICC) less than .4. There was higher but still moderate

reliability for all three skill domains, including reporting skills (ICC D .53), diagnostic reasoning skills (ICC D .64), and decision-making skills (ICC D .63). The accuracy of raters was assessed. Individual rater accuracy was calculated by comparing rater scores for each rater (N D 22) to the “correct” answers as determined by the two expert raters. Rater accuracy across all cases varied between 53% and 75%, with two raters having accuracies below 60% and eight raters having accuracies of more than 70%. The degree of accuracy for each note across raters was also assessed. Accuracy varied from 55% to 86% for each note, across all items and raters. There were five notes with an overall rating agreement over 75% for all items and raters. Four out of five patient notes with a high level of accuracy demonstrated a student performance level for written documentation and reasoning that was uniformly poor or excellent. For the three notes with overall percentage agreement less than 60%, all had variability in the quality of documentation or clinical reasoning within the note. In two of these three notes there was a long HPI that was detailed but not always relevant to the differential diagnosis; the assessment contained a long list of potential diagnoses that were not ordered most to least likely, and plans were not listed based on the differential diagnosis. Evaluators were not in agreement as to the history or physical examination ratings for these notes, using all three rating categories. In rating skills for these notes, raters were fairly evenly split between early (1) and good (2) for reporting, diagnostic reasoning and decision-making skills, which demonstrates that despite variable item rating, evaluators were able distinguish broadly between the highest level and lower level documentation of skills and suggests that further standard setting might help raters to distinguish between the early and good skill categories.

Relationship to Other Variables Comparison of IDEA score and final grades. Table 4 shows the descriptive statistics of the IDEA rating scores and the final grades. Correlations between IDEA rating scores and final grades are also shown in Table 4. The overall correlation was .24, 95% confidence interval [.15, .33]. There were significant differences in mean IDEA rating scores by students in the honors, high pass, and pass groups, F(2, 395) D 20.54, p < .001. Students receiving high pass had significantly higher IDEA rating scores than students receiving a pass (5.4% difference, p D .013), and students receiving honors had significantly higher IDEA rating scores than students receiving high pass (6.6%, p < .001). DISCUSSION To our knowledge the IDEA assessment tool is the first published tool that uses medical students’ H&Ps to rate

169

IDEA ASSESSMENT TOOL

TABLE 4 IDEA rating score and final internal medicine clerkship grades: Descriptive statistics and correlations Correlation

Group Ms and SDs

Academic Year

n

IDEA Rating M and SD

Final Grade M and SD

Correlation

95% CI

Pass

High Pass

Honors

2010–2011 2011–2012 2012–2013 Combined

126 143 129 398

.77 (.16) .76 (.16) .77 (.15) .77 (.16)

.71 (.08) .72 (.09) .68 (.07) .70 (.08)

.23 .27 .26 .24

[.06, .39] [.11, .42] [.09, .42] [.15, .33]

.72 (.17) .72 (.16) .74 (.14) .73 (.16)

.80 (.14) .79 (.15) .78 (.16) .79 (.15)

.85 (.15) .85 (.13) .85 (.12) .85 (.13)

Note: Interpretive summary, differential diagnosis, explanation of reasoning, and alternatives (IDEA) scores range from 0 to 1.0. Final grade categories include pass, high pass, and honors. Data from all students completing the internal medicine clerkship from a private midwestern U.S. medical school (51% female, 14% underrepresented minority) from 3 academic years (2010–2011, 2011–2012, 2012–2013). CI D confidence interval.

students’ clinical skills, specifically reporting (reporter), diagnostic reasoning (interpreter), and decision-making (manager/ educator) skills. We have provided validity evidence to support the use of the IDEA assessment tool, including content, internal structure, response process, and relationship to other variables.20 This tool has the potential to be a useful addition to our current armamentarium. It is a simple and practical instrument that can be used in the context of patient care, using real patient notes. In addition, the basis of assessment is the written record, an often underutilized, readily available resource for teaching and feedback. Our discussion focuses on what we view are the key insights, including practical recommendations for use, following by a discussion of limitations and future directions.

Diagnostic Reasoning and Justification Use of the IDEA framework allows for the teaching and assessment of clinical reasoning linked to student documentation. The work of Cianciolo and others suggests that biomedical knowledge and clinical cognition are two distinct and complementary tools that contribute to the emergence of effective diagnostic reasoning strategies.22 The IDEA assessment tool reinforces clinical cognition (purposeful data gathering and reporting of data) and may help build diagnostic justification. Similar tools have been applied to postencounter notes written after standardized patient encounters.1,2 Williams and Klamen’s diagnostic justification tool2 focused on similar aspects of diagnostic reasoning as the IDEA items, namely, differential diagnosis, recognition and use of key findings, thought processes, and clinical knowledge utilization. They reported intraclass correlations between .64 and .75, using nine standardized patient encounters rated by two raters. These higher intraclass correlations might be explained by the limited number of raters and more controlled case environment, compared to our study using more raters and authentic patient care environments.

The IDEA assessment tool contains content related to the diagnostic justification section USMLE Step 2 CS integrated clinical encounter note, a new and evolving licensure measure. Student performances as measured using IDEA assessments have the potential to inform educators about student readiness for USMLE Step 2 CS. In addition, by requiring students to articulate their thoughts, IDEA assessments have the potential to give us insight into their thoughts, and thus to enhance our ability to better teach, assess, and study students’ clinical skills related to diagnostic justification in patient care settings.

Interrater Reliability and Rater Accuracy This is one of the first studies to comprehensively examine rater characteristics, including interrater reliability, for a patient note involving diagnostic reasoning. The interrater reliability of the instrument was only moderate, which is not uncommon for a performance measure, and is felt to be acceptable in a low stakes environment.21 A higher level of reliability is needed before this instrument will be useful for higher stakes assessments. A qualitative review demonstrated that judgments about relevance and completeness of information lead to many disagreements in ratings. In addition, in the 22rater study, we found that accuracy was quite variable despite training. These results should be considered by those developing rubrics for patient notes as they make decisions about length and type of faculty development planned. The higher level of reliability reached in Williams and Klamen’s study suggests that reliability can be improved by limiting the number of raters to a group of faculty who are well trained and agree on the goals and expectations. Whether comparable results can be attained in authentic patient care settings should be the subject of future studies. The interrater reliability varied considerably from item to item on the form, with some individual items having greater reliability between raters than others. In particular, there was less agreement regarding the interpretive summary than other

170

E. A. BAKER ET AL.

items. Although clerkship directors and medical educators might agree that an excellent interpretive summary is synthetic and requires a complete yet concise representation of the patient’s problem, the low interrater reliability suggests that clinicians are not in strong agreement about how best to represent patient problems in a written assessment. It is possible that more clearly defining the characteristics of an excellent problem statement and providing additional faculty development will attenuate this finding. Future studies are needed to explore this possibility. It is notable that in both the two-rater study and the 22-rater study, with one exception, all of the global skill levels items (reporting [reporter], diagnostic reasoning [interpreter], decision making [manager/educator]) achieved moderate rater agreement. Similar to our findings, Goeverts and others found that raters achieved higher reliability for workplace-based assessments rating global skills items than in rating items related to specific behaviors.23 Our instrument, like other workplace-based assessments, serves a dual purpose—a lower stakes summative purpose, perhaps best met by the more global skills items that demonstrate a higher level of reliability, and a formative purpose, one that guides learning, that may be better met by the more specific behavioral items. We found that rater accuracy varied depending on the quality of the patient note, with extremes of skill documentation rated much more consistently than moderate or variable skill documentation. Discrimination in these areas is a complex task, requiring raters to use more complex reasoning and decision-making strategies than when judging a consistently high or low performance.23 Qualitative studies of rater discordance are needed to analyze the cognitive processes that raters use to make these judgments, with the goal of identifying underlying reasons for discordance and educating raters to attenuate these disagreements. We were deliberate in creating a form based on existing organizing frameworks, for use in authentic settings, with reasonable training time. This should be considered when deciding how best to integrate this and other types of assessments into the educational program. Based on the higher reliability of overall skill ratings, and support from the literature, our educators currently use the global skill ratings on the IDEA form for low-stakes summative purposes, using a limited number of raters.

Relationship of IDEA Ratings and Final Clerkship Grades The relationship of IDEA ratings with final clerkship grades demonstrated a significant association. This finding supports the assertion that IDEA ratings measure a dimension of a student’s clinical performance. Correlation of IDEA ratings with other measures of student skills performance is needed to gain insight into the relationship between the skills demonstrated in notes and other measures of performance.

Limitations Currently, the IDEA assessment tool is used in settings in which the IDEA framework is taught as part of clinical reasoning. Results may differ significantly in settings in which the IDEA framework is not taught; this comparison should be studied. A larger multi-institutional study is needed to assess the performance of the instrument in diverse settings with diverse groups of learners and raters. This article reflects a series of related studies attempting to improve the teaching and assessment of clinical reasoning through clinical documentation review gathered from patient care activities of medical students. The instrument evolved from 2010–2013, informed by the results of earlier studies. The instrument used in 2013 was not identical to the form used in the factor analysis and two-rater reliability study. Specifically, the 2013 form did not have items related to chief complaint and problem list, and other items had minor wording changes. The factor analysis was not repeated on the new form before its use in the 22-rater study, based on an assumption that the changes made were relatively minor in regard to construct and content, but this was not proven. An exploratory factor analysis using the new form would confirm or refute the assumption that the practical changes made in the form did not significantly alter the construct or content. We recognize that a student’s note may not be a completely independent work; the work of or coaching by other members of the healthcare team may influence the quality of a student’s note without necessarily resulting in enduring advancement of the student’s skills. As a result of this limitation we might observe less agreement between measures of note quality and other measures of student performance. The significant relationship between final clerkship grades and IDEA ratings, however, supports the assumption that the student’s written documentation is, at least in part, an accurate reflection of student skill level. Finally, although a significant relationship between IDEA rating and internal medicine (IM) clerkship grade was found, the variability in final grade rating for the IM clerkship was lower than expected. Further investigation will be needed in future studies to investigate this lack of variability and reevaluate how IDEA ratings are weighted and used in practice. Future Directions Of great interest to our group is to further explore the educational impact of the use of this rating instrument on learning, thus providing evidence of the positive consequences of its use. Using the IDEA assessment tool has the potential to improve the amount and type of feedback students receive regarding their write-ups. It is possible that use of this instrument will lead to an increased emphasis on teaching and learning around diagnostic reasoning and decision making, and learners who have deficits in these areas will be identified and remediated. Whether this occurs has not yet been determined and should be the focus of future studies. Further study is also needed to

IDEA ASSESSMENT TOOL

determine whether the use of the IDEA assessment tool will result in improved performance on measures such as the clinical encounter portion of the USMLE Step 2 CS examination. We hypothesize that additional faculty training with standard setting for item rating stringency, particularly those involving relevance and completeness of information, will increase the reliability of the instrument ratings and rater accuracy. It has been demonstrated that disagreements in the assessment of performance at times stem from disagreements about values rather than facts and that raters use internal as well as external standards to judge performance.24 The IDEA assessment tool forces raters to be specific in their recording and documentation of performance, thus having the potential, with rater training and the internalization of the rating standards, to increase accountability and consistency when compared to other clinical performance assessments (i.e., clinical ratings). Future studies are planned to study the effect of training and rater drift, as well as associations and comparisons to with other assessments.

4.

5. 6. 7.

8.

9.

10. 11.

CONCLUSIONS Valid assessments of clinical skills are needed. In this article we have presented validity evidence in support of the IDEA assessment tool and have outlined plans for additional study to support its use. We hope that medical educators will find the IDEA assessment tool to be a valuable addition to the currently available tools and that it will be used to help educators to improve both the learning and assessment of these vital clinical skills in our students.

12. 13. 14.

15.

16.

ACKNOWLEDGMENTS Preliminary data on the item analysis portion of this study was presented at the Central Group on Educational Affairs annual meeting in 2011. We thank the general internal medicine and hospitalist faculty from the Departments of Internal Medicine at Rush University Medical Center, the John H Stroger Hospital of Cook County, and Ohio State University for their participation in this study. FUNDING This work was supported by a grant from the Central Group on Educational Affairs, Association of American Medical Colleges. REFERENCES 1. Durning SJ, Artino A, Boulet J, La Rochelle J, Van der Vleuten C, Arze B, et al. The feasibility, reliability, and validity of a post-encounter form for evaluating clinical reasoning. Medical Teacher 2012;34:30–7. 2. Williams RG, Klamen DL. Examining the diagnostic justification abilities of fourth-year medical students. Academic Medicine 2012;87:1008–14. 3. MedBiquitous Curriculum Inventory Standardized Vocabulary Subcommittee. Curriculum inventory standardized instructional methods, assessment methods and resource types. Washington, DC: Association of

17.

18.

19.

20.

21.

22.

23.

24.

171

American Medical Colleges. Available at: http://www.medbiq.org/curricu lum_inventory. Accessed September 1, 2012. Hemmer P, Papp K, Mechaber A, Durning S. Evaluation, grading, and use of the RIME vocabulary on internal medicine clerkships: Results of a national survey and comparison to other clinical clerkships. Teaching and Learning in Medicine 2008;20:118–26. Miller GE. The assessment of clinical skills/competence/performance. Academic Medicine 1990;65(9 Suppl):S63–7. Messick S. Standards of validity and the validity of standards in performance assessment. Education Measurement 1995;14:5–8. Ratcliffe TA, Hanson JL, Hemmer PA, Hauer KE, Papp KK, Denton GD. The required written history and physical is alive, but not entirely well, in internal medicine clerkships. Teaching and Learning in Medicine 2013;25:10–4. Baker EA, Connell K, Bordage G, Sinacore J. Can diagnostic semantic competence be assessed from the medical record?. Academic Medicine 1999;74(10 Suppl):S13–5. Baker E. Challenging students to expose their thoughts in write-ups: The IDEA method. Journal of General Internal Medicine 2003:18 (Supp 1):235. Bordage G, Lemieux M. Semantic structures and diagnostic thinking of experts and novices. Academic Medicine 1991;66(9 Suppl):S70–2. Pangaro L. A new vocabulary and other innovations for improving descriptive in-training evaluations. Academic Medicine 1999;74:1203–7. Lemieux M, Bordage G. Propositional versus structural semantic analysis of medical diagnostic thinking. Cognitive Science 1992;16:187–204. Bordage G. Elaborated knowledge: A key to successful diagnostic thinking. Academic Medicine 1994;69:883–5. Connell K, Bordage G, Gecht M, Chang R. Assessing clinicians’ quality of thinking and semantic competence: A training manual. Department of Medical Education, University of Illinois at Chicago, 1998. Bordage G, Connell K, Chang R, Gecht M, Sinacore J. Assessing the semantic content of clinical case presentations: Studies of reliability and concurrent validity. Academic Medicine 1997;72(10 Suppl.):S37–9. Chang R, Bordage G, Connell K. The Importance of early problem representation during case presentations. Academic Medicine 1998;73(10 Suppl):S109–11. Battistone M, Milne C, Sande MA, Pangaro LN, Hemmer PA, and Shomaker TS. The Feasibility and Acceptability of Implementing Formal Evaluation Sessions and Using Descriptive Vocabulary to Assess Student Performance on a Clinical Clerkship. Teaching and Learning in Medicine 2002; 14;1: 5–10. DeWitt D, Carline J, Paauw D, Pangaro L. Pilot study of a RIME-based tool for giving feedback in a multi-specialty longitudinal clerkship. Medical Education 2008;42:1205–9. Baker E, Riddle J. IDEA in evolution: An attempt to use RIME to more accurately assess medical students’ admission write-ups. Journal of General Internal Medicine 2005;20(Supp 1): 157. Downing S, Haladyna T. Validity and its threats. In S. Downing, R. Yudkowsky (Eds.), Assessment in health professions education (pp. 21–55). New York, NY: Routledge, 2009. Axelson R, Kreiter C. Reliability. In S. Downing, R. Yudkowsky (Eds.), Assessment in health professions education (pp. 57–73). New York, NY: Routledge, 2009. Cianciolo AT, Williams RG, Klamen DL, Roberts NK. Biomedical knowledge, clinical cognition and diagnostic justification: a structural equation model. Medical Education 2013;47:309–1. Govaerts MJ, Schuwirth LW, van der Vleuten CP, Muitjens AM. Workplace-based assessment: effect of rater expertise. Advances in Health Science Education 2011;16:151–65. Govaerts MJ, van der Vleuten CP, Schuwirth LW, Muitjens AM. Broadening perspectives on clinical performance assessment: Rethinking the nature of in-training assessment. Advances in Health Sciences Education 2007;12:239–60.

172 APPENDIX

E. A. BAKER ET AL.

IDEA ASSESSMENT TOOL

173

Copyright of Teaching & Learning in Medicine is the property of Taylor & Francis Ltd and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

The IDEA Assessment Tool: Assessing the Reporting, Diagnostic Reasoning, and Decision-Making Skills Demonstrated in Medical Students' Hospital Admission Notes.

Construct: Clinical skills are used in the care of patients, including reporting, diagnostic reasoning, and decision-making skills. Written comprehens...
333KB Sizes 0 Downloads 6 Views