Journal of Medical Systems, Vol. 16, No. 5, 1992

Expert Systems and Expert Behavior Walton Sumner II and Edward K. Shultz

Iliad 4.0 and QMR 2.03 are computer-based diagnostic knowledge bases that can play many roles in decision support and other areas of medical practice, but neither appears ready to assume the role of an expert diagnostic consultant. In contrast to human experts, these programs have problems related to recognition of their own limitations, interpretation of continuous data, recognition of dependent findings, selection of tests, and description of the impact of certain tests. Suggestions to improve these aspects of knowledge bases are offered.

INTRODUCTION Faculty at Dartmouth Medical School would like to introduce students to computerized diagnostic support tools during their internal medicine inpatient clerkship, in anticipation of a growing role for computers in decision support, diagnostic consultation, and quality assurance. Two commercially available knowledge bases, designed to cover the domain of inpatient internal medicine, are Iliad, 1 from Applied Informatics Inc. and the University of Utah, and Quick Medical Reference, or QMR, 2 from Camdat Corporation and the University of Pittsburgh. Iliad was selected for this project because it is available for the Macintosh computers installed at the Dartmouth campus, and because some faculty are interested in teaching the probabilistic concepts explicitly incorporated in Iliad. In support of this project, we attempted to introduce Iliad 3.0 to the general internal medicine faculty and residents at the Dartmouth-Hitchcock Medical Center by using it as a consultant for weekly morbidity and mortality rounds. While students appreciated Iliad's reference features, a number of diagnostic difficulties arose, generating questions about Iliad's preparedness to serve as a general purpose diagnostic consultant. These difficulties prompted closer examination of Iliad and QMR to identify the sources of problems and possible solutions. The following observations and suggestions apply to Iliad 4.0 and QMR 2.03, both released this spring. Previous observations regarding independence of categorical findings, 3'4 the mapping of clinically uncertain findings, self explanatory features, and ternFrom the Program in Medical Information Science, Dartmouth Medical School, Hanover, New Hampshire 03755. 183 0148-5598/92/1000-0183506.50/0 © 1992 Plenum Publishing Corporation

184

Sumner and Shuitz

poral reasoning 5 remain pertinent to these knowledge bases. Some of the current comments restate principles suggested by early investigations in medical artificial intelligence, but which these programs do not yet fully incorporate. 6 In each problem area described, the expert system's behavior is unlike human expert behavior, and may lead to errors that clinicians will not anticipate or easily detect. We offer suggestions for improving problematic aspects and enhancing some existing features of these programs.

METAKNOWLEDGE Physicians have rated self-knowledge among the most important design features for computer based consultant systems. 6 Although human experts routinely excuse themselves from addressing some domains of medicine, neither Iliad nor QMR recognize their limitations during consultation, and propose differentials without notification that the knowledge base is unsuited to the problem at hand. QMR's broader range of common diseases and categorical limitation on age (16 is the youngest age possible) reduce the opportunity for grossly incomplete differentials. QMR includes a list of 24 classes of diseases absent from the knowledge base, but it can not determine when a patient might have a disease on that list. Warning profiles could provide a relatively simple means of dealing with an incomplete knowledge base. These "disease profiles" would become likely when a case strays from the known domain. For instance, Iliad might place a warning profile in the differential when a patient's age is less than 16, when a woman has abdominal pain (Iliad lacks most obstetrical and gynecologic diseases), or when a woman has manifestations of cancer (Iliad lacks breast and gynecologic cancers). Warnings could explain their appearance, and provide hints regarding diseases absent from the knowledge base (Figure 1). A warning profile, while non-trivial to assemble, requires far less effort than expanding the knowledge base. Because an incomplete set of warning profiles might lend false reassurance to a clinician consulting the knowledge base, these profiles should encompass most known deficits at their first introduction. Accurate Calculators

Another characteristic of human experts is the ability to explain their reasoning process. Physicians have identified this as their most important demand for consultation systems. 6 These programs can display findings recorded in a case, diseases related to a finding, or a critique of any selected disease in light of the known findings. Although helpful, none of these features explains the reasoning process that generated the differential. Iliad's disease description profiles can display Bayes/Boolean calculators, however, where one can reproduce Iliad's derivation of the posterior probability of an illness-almost. The calculator uses Bayes' theorem for Bayesian profiles. It begins with the prior probability of a disease and sequentially adjusts the posterior probability using the sensitivity and specificity of each new finding entered. When Iliad generates a differential, however, it uses this slightly adjusted value: p_disease = (p_Bayes-prevalence)/(1-prevalence),

Expert Systems and Expert Behavior

I - F l ~

185

Domain Alert: Cancer in w o m e n

DOMAIN ALERT Apriori no( available

l:q|

Closeness to true = 1.0

•Status eCost

Findings : A. Sex is female

Yes

B. Signs o f Metastatic Cancer

0.17

Domain Alert : This version o f Iliad can not estimate the probabilitg of several important cancers in '~omen. Some common cancers that are not in the database are: Disease

Helpful diagnostic maneuvers (unordered)

Breast Cancer

Historg of breast mass Previous historg of breast cancer Familg historg of breast cancer Breast exam (axillarg or breast mass) Fine needle aspiration Mammogram

Biopsg Ovarian Cancer

Pelvic exam(adnexelmass) Ultrasound CT scan Biopsg

CerYical Cancer

Pelvic exam Pap smear

Coloosooou True if A and B>0.15

Figure 1. A warning frame would become a leading diagnosis when findings suggest diseases in an area that is not explicitly described by the knowledge base. The warning frame should provide some direction, if only to reference a more useful resource. In this case, the warning provides a plausible list of malignancies common in women, and common diagnostic maneuvers. A warning frame is easier to assemble than a disease frame, and more informative than ignoring gaps in the knowledge base. where p_disease is Iliad's proxy for posterior probability, p_Bayes is the Bayesian posterior probability, and prevalence is the prevalence of the disease in the population. This adjustment has the useful effect of keeping common diseases out of the differential when there is no direct evidence for them (p_disease = 0 when p_Bayes = prevalence). For instance, Left Sided Heart Failure has a 5% prevalence, and on that basis might be more likely than many illnesses more relevant to a case. The adjustment is also similar to Iliad's algorithm for deterministic disease profiles. As p_Bayes approaches 1.0, the values of p_disease and p Bayes are nearly the same. When p_Bayes is close to the prevalence, p_disease and p_Bayes are markedly different, and the calculator does not predict the posterior probability that Iliad will display. Iliad's Bayes/Boolean calculators could display both the p_Bayes and the adjusted p_disease values. The first value would still allow students to check their own use of test characteristics, while the second would allow students to reproduce Itiad's behavior. The following example illustrates why the calculator should follow the inference

186

Sumner and Shultz

engine's rules. Left Sided Heart Failure (LSHF) is a Bayesian disease profile that depends upon the values of three intermediate Bayesian profiles, including Left Ventricular Enlargement (LVE). Left ventricular enlargement on chest X-ray is a finding in the LVE profile. Test characteristics for the X-ray finding are displayed, but they pertain to LVE, not LSHF. What are the implications of this X-ray finding for the disease of interest, LSHF? This seems like a question to answer using Iliad's calculators (although a feature to automatically answer it would be welcome). If a student enters a negative finding in the calculator for the intermediate profile (LVE), then enters the new, lower than prevalence probability of LVE into the LSHF calculator, the probability of LSHF increases! The calculator assumed that the baseline probability of the embedded profile is zero, rather than its prevalence, and did not use the adjusted probability figure. QMR does not have a calculator, and uses an ad hoc scoring algorithm. Although it has a diagnosis critiquing function, QMR has no obvious means of recreating its scoring procedure.

Characteristics of Continuous Data Physicians, laboratories, and medical literature generally describe the characteristics of quantitative tests in terms of cut off values, and these knowledge bases use the same approach. For example, Iliad considers a CPK value above 194 IU/L to be evidence for an acute MI, with a true positive rate (TPR) = .50, false positive rate (FPR) = .20. QMR accepts two CPK values: above 200 U/L, and not above 200 U/L. CPK values above 200 have an evoking strength (EV, a measure of prevalence and test specificity) of 2 and a frequency (FR, analogous to sensitivity) of 4 for Myocardial Infarction. (EV varies from 0 to 5, and FR varies from 1 to 5; the number 2 suggests a value between .06 and .35, and the number 4 suggests a value between .63 and .97). These test characteristics do not help to interpret borderline values. An expert interpreting CPK values of 194 and 195 would consider the proximity of the values to the cut off, which raises appropriate uncertainty. Iliad and QMR differentials respond predictably to borderline values. Acute MI may rise to the top of a differential when the CPK is 1 unit above the cutoff, and may disappear if the CPK is 1 unit below. Interpretation of CPK values is still more complex, because very high values suggest Polymyositis, Rhabdomyolysis, and muscular dystrophies. Iliad's Polymyositis disease profile partially accommodates the wide distribution of CPK values in this disease by describing two value ranges for CPK. Values between 200 and 1000 have different test characteristics (TPR = .30, FPR = .01), compared to CPK's over 1,000 (TPR = .60, FPR = .001). QMR includes the elevated CPK finding in a Polymyositis/ Dermatomyositis profile, with EV = 1 and FR = 4. Finally, although any value above 194 suggests Acue MI, there may be physiologic constraints on CPK levels in patients who survive an Acute MI. Since the skeletal muscle mass contains a larger reservoir of CPK, a sufficiently elevated CPK level might suggest Rhabdomyolysis, but not an isolated Acute MI. These problems are illustrated by the hypothetical frequency distribution functions in Figure 2. CPK values in normal populations form a skewed curve with a long narrow tail. 7

Expert Systems and Expert Behavior

187

Normal

oqooncy AS: Rhabdomyolysis

/ A

B

C

CPK value

Figure 2. The probability of different diseases changes as measured CPK values increase, as shown in this hypothetical probability distribution function. At CPK value A, only 1% of individuals have Acute MI. At value B, about 50% of individuals have MI's, and at point C, about 80%. At higher values, the likelihood of Acute MI falls, while the likelihood of polymyositis and rhabdomyolysis rise.

CPK values in Acute MI presumably have a skewed distribution, with some values below 200, and a small tail of values exceeding 1,000. CPK values in Rhabdomyolysis probably form a third skewed distribution with nearly all values above 1,000. From this graph, one can derive the likelihood ratio positive of any CPK value (x) for detecting any of these health states: LR + = frequency of this disease at or beyond value x/frequency of all other health states at value x. Incorporating frequency distributions in knowledge bases may require significant re-engineering. Many findings would need sets of equations, or a large number of value ranges, to describe test characteristics of quantitative findings. Users might also need a means of visualizing the distribution of findings when they browse through disease profiles (e.g., a frequency distribution graph).

DESCRIBING DEPENDENCE UPON CONTINUOUS VARIABLES Early versions of Iliad and Internist-1 treated some closely related findings as independent events. This would "double count" elements of clinical constellations and would thus overestimate the probability of a specific disease. Iliad 4.0 has eliminated many of these by using logical " o r " operations to choose the most influential item from a group of related findings, or by putting clusters of related findings into intermediate profiles (e.g., LVE, discussed above). Nevertheless, Iliad's definition of TPR as "the prevalence of that finding in the disease," is not necessarily the same as the empiric TPR, "the prevalence of that finding in the disease, given the current information. ,,s Two examples follow where independence assumptions probably do not hold, and in which functions may model reality more closely than other possible representations (Figure 3). Chest Pain and Diabetes (Figures 3a and b): Painless cardiac ischemia is more common in diabetics than in others, therefore, chest pain has a lower than average sensitivity for detecting cardiac ischemia in diabetics. 9 Iliad could represent this information by creating an intermediate profile called "Chest Pain with Diabetes Mellitus,"

Sumner and Shuitz

188

B) Chest Pain in AMI

A) Chest Pain in AMI 1,0

C) Stress test for CAD

• •

3.0

TPR

TP

2-.0 LR 1.0

LR, w

Diabetes

No Diabetes

Years of Diabetes

v

1.0 prior probability of CAD

Figure 3. These are examples where independence of findings is unlikely, and where functions may best describe the dependence of the y-axis variable on the x-axis value. (A) and (B) Diabetics have silent ischemic

events more frequently than non-diabetics. In (A) this is shown as two point estimates of sensitivity, while (B) shows sensitivity of chest pain as a function of time. (C) shows likelihood ratios for positive (LR + ) and negative (LR-) stress tests among patients with normal EKG's, as a function of prior probability of coronary artery disease (from CASS data). which would be present if a patient has diabetes and chest pain. The Acute M I profile might include these lines: Chest pain . . . . . . . . . . . . . . . . . . . . TPR = .95 . . . or • Chest pain with diabetes mellitus . . . . . TPR = .85 . . . (the bullet indicates the intermediate profile). Iliad could present this knowledge, without the artificial intermediate profile, by borrowing an " i f . . . then . . . else . . . " statement from procedural programming languages: if (diabetic) then Chest p a i n . . . else Chest p a i n . . .

TPR = .85 . . . TPR = .95 . . .

A causal network representation of the knowledge base could also capture the level of detail described so far. However, if cumulative damage to the autonomic nervous system causes painless ischemia, sensitivity will be a function of the duration o f the patient's diabetes, a continuous value. Chest p a i n . . .

TPR = flduration of diabetes)

Stress Tests and Prior Probability (Figure 3c): Data from the coronary artery surgery study suggest that test characteristics of exercise stress tests improve as the prior probability of coronary artery disease (CAD) increases.t° A m o n g patients with normal electrocardiograms, sensitivity (fraction of the diseased which are correctly predicted) and likelihood ratios for positive test results increased, and likelihood ratios for negative results smoothly declined, by a factor of about four as the prior probability of coronary artery disease increased from 5% to 90%. However, these trends did not appear among patients who had abnormal electrocardiograms. Stress t e s t . . .

TPR = f(prior probability, E K G findings)

Expert Systems and Expert Behavior

189

These examples suggest that in general, diagnostic knowledge bases may need to generate test characteristics for a new finding after examining current information. As previously observed, use of conditional probabilities throughout a knowledge base requires a prohibitive number of values. 3'4 Nevertheless, a few guidelines might help to select a manageable number of findings to represent with conditional descriptions. Of course, an accurate conditional description must exist, or be discoverable when clinicians expect dependence. Second, conditional descriptions should improve diagnostic accuracy. This implies that a conditional value should sometimes differ substantially from the corresponding value under assumed independence. For instance, in the description of Acute MI findings, chest pain test characteristics are probably conditional on the duration of diabetes mellitus, but "serial EKGs consistent with an evolving MI" probably should not have conditional test characteristics. The definition of a "substantial change" is obvious in QMR: only changes in the value of EV or FR affect the scoring algorithm, so a useful conditional description would generate at least two values for at least one of these factors. The definition of a "substantial change" in Iliad is less clear, since any change in TPR or FPR alters posterior probabilities. Nevertheless, some threshold change in likelihood ratios might serve to select useful formulas.

DISCRIMINATING

TESTS

An insightful diagnostician can sometimes distinguish between two or more competing diagnoses using a single question or test. That is, rather than asking one question to rule out disease A, and another question to rule in disease B, the possibility exists of ruling out disease A and ruling in disease B, or vice versa, with one question. These knowledge bases have enough information to identify such discriminating questions, but neither has such a feature. Whether or not practicing physicians are interested in this ability, it would be an interesting and timely teaching feature. Both knowledge bases allow the user to select two or more diseases, and then answer questions focused on those entities. Neither currently selects questions for maximum impact on the difference in posterior probabilities of the diseases, although Iliad's "most useful information" algorithm is currently a focus of some investigation, and may evolve to use an algorithm derived from information theory. ~1 QMR can generate a list of findings pertaining to two diseases, a list of findings that rule in or rule out one disease, and a list of findings categorized by their relationship to a pair of diseases. The latter feature, "comparisons b e t w e e n . . . ," partially provides the desired function through three of the categories in the list: "findings unique to disease A , " "findings unique to disease B , " and "findings common to both but with different EV/FR numbers." However, QMR can not currently select the most distinguishing items from these lists. Iliad's "most useful information" features currently serve some slightly different purpose than distinguishing diseases. Given a 56-year-old male with severe chest pain, Iliad's differential includes Stable Angina and Acute MI. Selecting these diseases and asking for the most useful information, Iliad asks whether the patient has had any recent trauma. If there is a history of chest trauma, the probability of Chest Wall Pain jumps

190

Sumner and Shultz

from 2% to 76%, but the probabilities of Stable Angina and Acute MI remains 15 and 12%, respectively. If there is no recent trauma, their probabilities are still 15 and 12%. Thus, although providing this " m o s t useful information" may cause dramatic changes in the entire differential, it does not distinguish between the selected diseases at all, or even change their posterior probabilities, regardless of the response. This question should be straightforward for Iliad, because an Acute MI is evidence against the Stable Angina profile. Therefore, a finding that raises the likelihood of Acute MI without simultaneously raising the likelihood of Stable Angina can discriminate between these diseases. For instance, an E K G pattern indicative of acute MI would raise the probability of Acute MI to 80%, and lower the probability of Stable Angina to less than 1%. If the E K G is not indicative of acute MI, the posterior probability of Acute MI falls to 1% while the probability of Stable Angina remains 15%. Either result is more discriminating than the trauma question. In general, selecting a most discriminating finding is a computationally tedious task, potentially requiring the calculation of the posterior probability (or score) of every disease under consideration after a positive or negative response to each finding in any of the disease profiles. In practice, the number of diseases under consideration would need to be small, and they probably need to be compared in pairs, rather than simultaneously. Figure 4 illustrates a window to display tests to discriminate between pairs of diagnoses, given a group of three diagnoses. N e w V i e w s o f the D a t a b a s e

An expert can often list several ways in which one test, such as a complete blood count or chest X-ray, may shed light on a differential. Students using Iliad observed that they had difficulty predicting the differential implications of these tests and others which generate numerous findings. The consult modes in Iliad and QMR allow the user to browse by disease, or explore the implications of one finding (e.g., thrombocytopenia). Understanding the full impact of the multiple findings generated by tests such as a CBC,

=D

Most Distinguishing Test

I~-

Acute MI

Stable Angina

Stable Angina

EKG: pattern of Acute HI

N/A

Unstable Angina

EKG: pattern of Acute HI

History: Cheat psi n with leas exertion than in ~he recent past

Diacri rni nati ng teats for Acute I"11,Stable Angine, and Unstable Angi na

~h

Figure 4. The most distinguishing test matrix would list diseases that the user is considering as row and column labels. Each cell in the matrix displays the one test that best discriminates between these diagnoses. The user would ask a question by selecting its cell. Obvious extensions include a matrix of the most cost-effectively discriminating tests, or a 3 × 3 matrix where each cell displays a finding that can role-in the row label and rule-out the column label.

Expert Systems and Expert Behavior

191

CXR, CT, or SPEP, is almost impossible in either program, although both contain the desired information. These students want a view of the knowledge base to answer questions of this type: " I f I get an X-ray, what findings are possible, and how might they affect my differential?" This test profile could let the student view a list of findings related to a test, a list of diseases affected by the test (perhaps limited to diseases selected from the differential), and the implication of each finding on each disease. Iliad could display implications of findings, at users' discretion, as sensitivity and specificity, likelihood ratios for positive and negative findings,or as positive and negative predictive values (Figure 5). QMR could display EV and FR values, or scores that would follow positive and negative findings.

Summary of Suggestions Comparisons between behavior observed in expert systems and behavior anticipated from human experts suggest several opportunities to improve the expert systems. A program designed to fail gracefully at its limits of expertise could provide useful feedback -

l

-

l

~

Chest X-ray PE I

oPP+/View by: ®LR+/OTPR/FPR CONMON FINDINGS:

Prior Probability

Unilateral pleural effusion

eStatus eCos t

Pneumoni CHF I

eLR+ eLR-

eLR+ ILR-

eLR+ ILR-

20~

10W

15~

25.0 (1 .~2)]

Triangular, pleural based alveolar infiltrate

10.0

without air bronchograms

(1.04)

Localized pulmonary oligemia

5.00

(1.04) Alveolar infiltrate

2.50 (1.15)

47.5 (19.6)

Alveolar infiltratewith lobar or segmental

499

(1.99)

Consolidation

:

Alveolar infiltrate consistent with pulmonary edema

12.5 (1.30)

Pulmonary venous congestion

23.3 (3.2~)

Interstitial infiltrate and kerle9 lines

12O (2.48)

Left ventrlcular enlargement

18.0 (9.50) lOl~

f................................... '.'.~~ l l '

Figure 5. A test oriented view of the knowledge base would list all of the findings that a test can generate, and display the implications of the presence or absence of each finding on diseases selected from the differential, or manually selected by the user.

192

Sumner and Shultz

even when the knowledge base is incomplete. Failure to recognize limits of expertise may unsettle clinicians. An explicit description of the logic used to generate differentials can help physicians reconstruct the process and help students learn it. Tools such as Iliad's calculators are useful, but if the inference engine uses a modified algorithm to interpret findings, the calculator should also reproduce that algorithm. Many findings are incompletely described by a few sensitivity and specificity values. Categorization of continuous values inevitably discards information about a particular value. Clinicians also know of or suspect dependencies between some categorical findings. Although an exact understanding of continuous and dependent test characteristics is often lacking, it would be prudent to engineer knowledge bases that are able to apply that understanding as it develops. Experts and knowledge bases have enough information to resolve some diagnostic dilemmas with a few elegant questions. Tools to do so would add to the teaching potential of these programs. Finally, experts sometimes serve specialized reference functions that knowledge bases could emulate. Summary views of test consequences would further enhance these programs.

ACKNOWLEDGMENT The National Institutes of Health supported this work through training Grant NIH 5T15 LM07044. The Charles E. Culpeper foundation provided additional support.

REFERENCES 1. Bergeron, B., Iliad: A diagnostic consultant and patient simulator. MD Comput. 8(1):46-53, 1991. 2. Miller, R.A., Masarie, F.E., and Myers, J.D., 'Quick Medical Reference (QMR)' for diagnostic assistance. MD Comput. 5:34-49, 1986. 3. Szolovtis, P., and Pauker, S.G., Categorical and probabilistic reasoning in medical diagnosis. In (W.J. Clancey and E.H. Shortliffe, eds.), Readings in MedicalArtificiallntelligence--The First Decade. Addison Wesley Publishing Company, Reading, MA, 1984, pp. 210-240. (Originally from Artificial Intelligence, 11:115-144, 1978.) 4. Shortliffe, E.H., Buchanan, B.G., and Feigenbaum, E.A., Knowledge engineering for medical decision making: A review of computer-based clinical decision aids. In (W.J. Clancey and E.H. Shortliffe, eds.), Readings in Medical Artificial Intelligence--The First Decade. Addison Wesley Publishing Company, Reading, MA, 1984, pp. 35-71. (Originally from Proceedings of the IEEE, 1979, 67:1207-1224.) 5. Miller, R.A., Pople, H.E., and Myers, J.D., Internist-I, an experimental computer based diagnostic consultant for general internal medicine. N. Engl. J. Med. 307:468-476, 1982. 6. Teach, R.L., and Shortliffe, E.H., An analysis of physician attitudes regarding computer-based clinical consultation systems. Comput. Biomed. Res. 14:542-558, 1981. 7. Miller, W.G., Chinchilli, V.M., Gruemer, H.D., and Nance, W.E., Sampling from a skewed population distribution as exemplified by estimation of the creatine kinase upper reference limit. Clin. Chem. 30(1): 18-23, 1984. 8. Gammerman, A., and Thatcher, A.R., Bayesian diagnostic probabilities without assuming independence of symptoms. Methods Inf. Med. 30(1):15-22, 1991. 9. Naka, M., Hiramatsu, K., Aizawa, T., et al., Silent myocardial ischemia in patients with non-insulin

Expert Systems and Expert Behavior

193

dependent diabetes mellitus as judged by treadmill exercise testing and coronary angiography. Am. Heart J. 123(1):46-53, 1992. 10. Weiner, D.A., Ryan, T.J., McCabe, G.H., et al., Exercise stress testing: Correlations among history of angina, ST-segment response and prevalence of coronary artery disease in the coronary artery surgery study (CASS). N. Engl. J. Med. 301:230-235, 1979. 11. Guo, D., Lincoln, M.J., Hang, P.J., Turner, C.W., and Warner, H.R., Exploring a new best information algorithm for Iliad. Proceedings of the fifteenth annual symposium on Computer Applications in Medical Care. Washington, D.C., pp. 624-628, 1991.

Expert systems and expert behavior.

Iliad 4.0 and QMR 2.03 are computer-based diagnostic knowledge bases that can play many roles in decision support and other areas of medical practice,...
677KB Sizes 0 Downloads 0 Views