Broadening Our Approach to Evaluating Medical Information Systems Diana E. Forsythe and Bruce G. Buchanan Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260 Abstract

potential users. We suggest that this difficulty stems in part from some unintended side-effects of our conventional approach to evaluation.

Evaluation in medical informatics tends to follow the paradigm of controlled clinical trials. This model carries with it a number of assumptions whose implications for medical informatics deserve examination. In this paper, we describe the conventional wisdom on evaluation, pointing out some of its underlying assumptions and suggesting that these assumptions are problematic when applied to some aspects of evaluation. In particular, we believe that these assumptions contribute to the problem of user acceptance. W'e then suggest a broader approach to evaluation, offering some conceptual and methodological distinctions that we believe will be of use to the medical informatics community in rethinking this issue.

The conventional wisdom of medical informatics incorporates some tacit assumptions which--while not necessarily consciously embraced by its practitioners-seem to be consistently held and therefore worth examining. In previous work, we have explored the relationship between particular pieces of conventional wisdom in Al/ medical informatics and certain widely acknowledged problems in the areas of knowledge acquisition [2][3], physicians' information needs [4][5], expert systems project management [6], and the fragility of knowledgc-based systems [7]. In the present paper, we apply the same analytical concern to the problem of evaluation in medical informatics, and to the problem of user acceptance that motivates this concern.

Introduction It is widely believed that mcdical computer programs, like other medical innovations, can have a positive impact on health care when evaluators show that they are safe and efficacious and users adopt them into routine practice. Although the FDA considered the idea of supervising the evaluation of medical expert systems, the agency now intends neither to regulate mcdical expert systems nor to supervise their evaluation for safety and efficacy as long as a person with clinical competence intcrprets a program's output (or advice) before it is translated into action on a patient [1]. Nevertheless, the medical informatics community still expects that developers of medical software should evaluate the level of performance of their systems so that interested potential users can assess the risk (to themselves and their patients) of using them. Following the tradition of drug trials and evaluations of other therapeutic intcrventions, the current model of evaluation in medical informatics is some variant of controlled clinical trials (CCT's). Howcver, this approach reflects a narrow view of evaluation. Not only does it address only a shadow of the performance issue (that is, performance in a controlled environment), it completely fails to address the issue of whether users will adopt a system into routine practice.

We believe that the CCT model as it has becn adapted to evaluation in medical informatics is useful, but mainly in relation to the evaluation of system performance. We argue that if system developers broaden their approach to evaluation to include a concern for non-technical and nonmedical issues as well (e.g., users' perceptions of a system), then not only performance issues but also issues germane to acceptance into practice will be examined. In order to accommodate concerns of this sort, however, we will need to extend our methodological repertoire to include other, qualitative methods better suited than the CCT method to collecting and analyzing information on social context and subjective experience.

Background In all stages of system development, from initial problem definition through formal evaluation, developers of medical expert systems focus primarily on performance characteristics of their systems. This concern is clearly important: acceptable system performance is obviously a necessary condition for utility and user acceptance. However, problems arise when developers believe that measuring performance is also sufficient (see, for example, the discussion of the "better mousetrap" assumption underlying Mycin's initial development [8]), or when evaluations focus only on performance issues.

In trying to mect the difficulties with user acceptance of medical information systems, system developers have madc significant improvements in problem definition, interface design, architecture, and system performance. Nevertheless, user acceptance remains a problem: systems that have been judged acceptable from the designers' standpoint have not necessarily been viewed positively by 0195-4210/91/$5.00 C) 1992 AMIA, Inc.

Many criteria have been proposed for evaluating software in general, and knowledge-based systems in particular. While these do not ignore the question of user acceptance, they concentrate on technical aspects of the software. For

8

example, a recent summary of the literature on testing and evaluating knowledge-based systems focuses on two commonly discussed aspects of software evaluation: optimization (efficiency) and performance (correctness) [9]. Several criteria are listed under each category; characteristically, nearly all of them are technical. In keeping with this pattern, evaluations of medical systems generally include only statistics on technical performance.

also to have made it more difficult for them to deal with other problems, or other aspects of the same problems. While these assumptions are not necessarily explicit or even conscious, we argue that they exert a significant influence on the way in which practitioners approach such issues as evaluation. In order to clarify this point, we delineate below some of the assumptions of the CCT model as applied in medical informatics, pointing out ways in which these assumptions constrain the type of evaluation we tend to undertake.

Schoolman [1] suggests a minimal set of characteristics for developers to evaluate before putting medical expert systems on the market. The following four questions define these operating characteristics: (1) "Given that disease X is in its knowledge base and that an expert enters the findings of a real patient with disease X, how often will the black box conclude that the patient has disease X? (Sensitivity). (2) Given that disease X is in its knowledge base and that an expert enters the findings of a real patient who does not have disease X, how often will the black box conclude that the patient does not have disease X? (Specificity). (3) How are the sensitivity and specificity altered if the input to the black box is not by an expert? (Robustness) (4) How does the system respond when given the findings of a patient with a disease not in the system's knowledge base?" [1, ms. p. 4]. In contrast to other applications software, there is no software industry willing to transfer medical expert systems to the marketplace, and to take on the risk of user acceptance. For this reason, we believe that developers of medical systems must evaluate more than the minimal characteristics proposed by Schoolman if their software is actually to be used in health care.

#1: Technical Bias The approach to problem assessment, design, and evaluation characteristic of AI and medical informatics reflects a focus on technical factors. As noted above, we traditionally evaluate a system largely or entirely with respect to system performance (speed, accuracy, and (rarely) extensibility) and occasionally with respect to technical aspects of architecture (logic, complexity, exportability). While there is nothing wrong with technical evaluation, practitioners may not be aware of how much is being left out of evaluation that looks at technical factors alone. However, to a social scientist it is clear that the omission is considerable: the non-technical includes information about social and contextual issues that are crucial to the problem of user acceptance. Nontechnical issues include whether systems are compatible with users' needs and desires, and the way users understand and evaluate a system; the way the system fits into users' normal work patterns and processes [10][1 1], as well as into the organizational structure; and the way changes caused by the system are viewed by users, designers, and managers. A successful system is more than the instantiation of a design. It is not appropriate for experts and knowledge engineers to define what users want and need; our evaluation procedures need to be broadened to allow users to speak for themselves.

Conventional Wisdom: Some Tacit Assumptions in Medical Informatics Perhaps the most crucial difference between drug evaluation and software evaluation is that the users' conscious reactions to a computer system are an important determinant of whether that system will be incorporated into clinical care and thus have an impact on health care. For the purposes of medical informatics, therefore, evaluation procedures need to include methods of investigating such subjective reactions. In contrast, drug trials are set up to control for subjective perception through the use of "blinding" procedures: the CCT model explicitly excludes the possibility of collecting data on patients' subjective perception. In adapting the CCT model to the evaluation of medical information systems, the medical informatics community has thus adopted an approach that largely precludes our getting information about issues of user acceptance. For this reason, we argue that the CCT model alone is simply too narrow to provide the full range of information that should be collected when a medical information system is evaluated.

#2:

Deleting the Social: Evaluation

Decontextualized

Human decision-making is complex. Artificial intelligence and medical informatics have accepted the need to simplify the information we represent in order to build knowledge-based systems at all. Because such simplification has become routine in the design process, however, it seems to engender habits of thought that are carried over automatically to other areas--such as evaluation. Thus, when practitioners turn to evaluation, they tend to restrict it to questions that can be addressed using quantitative models such as CCT's.

Star has commented that computer science "deletes the social" [12]. The social is one major category of information that gets bracketed out through the simplification process described above. Developers tend to think of systems as isolated technical objects; as developers, they do not necessarily consider who will

There are costs and benefits to every paradigm. In medical informatics, the very assumptions that have enabled researchers to tackle certain problems productively seem

9

Such unobtrusive methods as participant observation and interviewing can provide systematic data on patterns of thought and behavior in natural workplace settings [17][10][1 1][18].

work with the system when it is fielded, or how that work will be accomplished. This deletion of the social is perpetuated in the conventional wisdom on evaluation. Systems are typically evaluated only in the laboratory, out of the context in which they will eventually be expected to function. Thus, evaluation does not usually include investigation of how systems fit into the everyday social and organization contexts in which they are to be fielded.

Social and psychological phenomena--especially when investigated in context--do not lend themselves to study using the model of controlled experimentation. Real-world settings are not easily controlled. Furthermore, it is precisely the uncontrolled, spontaneous user reaction that evaluation in context needs to pick up. Qualitative measures provide an understanding of subjective motive and meaning that quantitative ones may not [16]. This sort of information is precisely what is missing from conventional evaluations of medical information systems. At present we do not know why users do not use medical information systems, in part because we know too little about the contexts in which they are fielded, about the social and personal consequences of the use of such systems, or about the motives that contribute to people's desire to use or sabotage such systems.

This is a major omission. In actual use, automated systems rarely stand alone--they need to work with people. This means that people have the power to make a system work or to sabotage it: systems are only successful if people are willing to work with them. In other words, systems need people to "hold them up" [13][14]. This is certainly true of software in a social environment as complex as health care. Developers thus have a strong motive to investigate how normal users react to a given system, that is, to extend our procedures to include evaluation in context. Introducing technological change is bound to cause changes in the way people work. Designers and managers frequently give little thought to the effects of introducing systems on workers as individuals, on work processes, or on other people in the workplace [15]. However, to be successfully fielded, medical systems need to make sense in terms of these factors. To be acceptable, the changes brought about by a system need to fit into the normal patterns of work in the office, emergency room, etc. This proposition is more complex than it may at first appear, because workplaces don't necessarily function according to managers' ideal models of work. Thus, in order to fit in successfully, a system needs to harmonize with the way things really work, not just the way they are supposed to work. Furthermore, changes must be consistent with the aspects of their work that users value [16]; if they are not, we can expect workers to try to undermine the system.

#4: Formal Bias

Software design and implementation encourage a formal perspective on problems. Much of the computer science literature is highly formal, and successful programmers frequently have substantial training in mathematics, logic, or a physical science. Thus, it is not surprising that system designers and implementers bring a formal perspective to their task. However, while this bias may contribute to successful system building, it has costs as well--for example, in the areas of evaluation and user acceptance. Emphasis on formal rules, procedures, facts and relations seems to make it difficult for technical specialists to see the importance of events and relations that are not institutionalized or universal, such as the perceptions of individual users or social relations in a particular type of workplace. But the fact is that although a system may be designed according to generally-accepted principles, it will be fielded in a particular social and political context. While the particularities of a given local context are for the most part not amenable to characterization in formal terms, they will play a major role in determining whether or not a system is accepted in practice. Thus, we need to expand our evaluation repertoire to include investigation of the local, informal perceptions and procedures that both affect and help to illuminate the phenomenon of user acceptance.

#3: Quantitative Bias

The CCT model brings with it the tacit assumption that "science" equals "laboratory scicnce" and that "systematic study" requires an experiment. In medical informatics, this tends to be interpreted as meaning that to be both valid and useful, evaluation must be quantitative. For the technical aspects of evaluation, this bias does no appreciable harm: issues related to system performance can be investigated adequately using quantitative data. (We set aside here for the moment such questions as whose needs and perceptions should be used to set standards for measuring attributes of the system--the designer's or the

The insufficiency of evaluation on formal grounds alone is illustrated by Gaschnig et al., who show that a formal evaluation of a medical expert system may be misleading even when statistically sound [19]. Their case study involved careful selection of the variables used to design, build and test a program in one context. However, the system ran into trouble when fielded at a second site because "real life" did not duplicate the conditions under which the system had been evaluated.

user's.) However, this bias contributes to the exclusion from evaluation procedures of phenomena that do not lend themselves to quantification. For the evaluation of nontechnical aspects of system functionality and acceptability, the methods of qualitative social science are more suitable.

10

objectively, such as the so-called "look and feel" of a system.

#5: One Reality

AI takes a universalistic approach to many issues, assuming that if there is a right answer to a problem, it is right in all contexts. To put it another way, designers of AI systems, like other cognitive scientists, describe problem-solving tasks in such a way as to assume that satisfactory answers exist whose correctness can be ascertained independently of the problem-solving context. But this assumption is too simple to be useful in relation to such complex problems as assessing whether or not a particular type of social or technological change (such as introducing a system into health care environments) is desirable in specific circumstances. A more realistic approach acknowledges that different individuals, and individuals in different positions, can have quite different perspectives. Thus, change can be positive, neutral, or negative from the standpoint of designers, management and users, and these people may evaluate quite differently the costs and benefits of a given change. Furthermore, they may have different views on just what the changes are that are brought about by the introduction of a given piece of technology: some effects may be invisible to non-users, others perceived only by managers.

Question 2: What would an evaluation procedure include if it did not delete the social? In addition to characteristics of the software as seen by the developer, we need to evaluate its characteristics as seen by users. The latter may have a greater influence on a system's ultimate utility. Determining how 'actual members of the user community use a system in their routine work will provide insights on the issue of how to enhance user acceptance. Question 3: Who should carry out evaluations? Given that everyone has a different perspective and different interests, whose perspective should be adopted for something as important as evaluation? Can evaluators be found who do not have a stake in the success of the system? Question 4: Where should a system be evaluated? Some characteristics may be evaluated in a controlled laboratory environment, others can only be evaluated in the context of routine use.

Question 5: Against what should a system be evaluated? What is the gold standard?

Since different players can be expected to have different interests and points of view, there are likely to be multiple valid perspectives on whether a given system is successful or not, and these perspectives may well be inconsistent. Rather than expecting to find one reality (i.e., that a system is either acceptable or unacceptable), we need design and evaluation procedures that elicit this predictable diversity of views and analyze their implications for design. A design team that investigates just what the range of local perspectives is before undertaking to build a system for a given type of setting, is much less likely to be blindsided later on by unexpected differences of opinion about particular features of the system.

Question 6: Whose use of the system should be evaluated? Should it be a developer who knows exactly what is assumed in the system and how to use it efficiently, or a naive user who doesn't know any of this? Some of the ideas in this paper have been suggested before by social scientists who have evaluated medical information systems [16][20][21]. These ideas and the methods of qualitative research associated with them are accepted in the social science community. However, since they have as yet had little influence on medical informatics, we have yet to reap the benefit they can offer us in the area of evaluation.

III. Broadening Our Approach to Evaluation: Some General Questions

IV. Conclusions

Space does not allow a full discussion of how one should evaluate medical software, nor do we believc that a single method is appropriate for all situations. However, we do believe that all systems need to be evaluated with respect to non-technical as well as technical aspects of performance if they are to be successfully transferred to use in routine health care. The following questions are intended to broaden the conventional perspective on the subject by suggesting some of the non-technical aspects that we believe should be considered during evaluation.

The interdisciplinary approach to evaluating medical information systems that we have outlined is a response to a problem that is widely acknowledged: how to get knowledge-based systems into routine use. Cognitive science has helped to produce high-performance systems. Qualitative social science can help with some additional aspects of our work: thinking about ways of bringing the social, the informal, and the non-technical back into our approach to such problems as evaluation. Thus, it can help us broaden our approach to evaluation in terms of an expanded paradigm.

Question 1: What should we evaluate when we evaluate a medical information system? There are many dimensions to software and many dimensions to health care. Some involve factors that are difficult to measure

To include the social in our evaluation procedures, we need to understand not only how users react to a system, but also why they do so. This requires systematic

11

investigation of the way in which a system affects both individual users and the pattern of their daily work. It also strongly implies that the design process should begin with investigation of the work patterns into which the future system will be designed to fit. Although even the bestplanned system may bring about unforeseen changes (e.g., in social relations in the workplace, in the relative power of different work groups, etc.), such investigation before crucial design decisions are made should help produce systems that are more useful and more acceptable to future

[10]

[11]

users.

Acknowledgement [12]

We gratefully acknowledge partial support from National Library of Medicine contract NO1-LM-8-35 14.

Rererences [1]

[2]

[3]

[4]

[5]

[61

[7]

[8]

[9]

Schoolman, H. "Obligations of the Expert System Builder: Needs of the User." To appear in M D Computing, forthcoming September 1991. Forsythe, D. and B.G. Buchanan, "Knowledge Acquisition for Expert Systems: Some Pitfalls and Suggestions." IEEE Transactions on Systems. Man and Cybernetics, Vol. 19, Issue 3, 1989. Forsythe, D. "Engineering Knowledge: The Construction of Knowledge in Artificial Intelligence." Technical Report No. CS-90-9, Computer Science Department, University of Pittsburgh. 1990. Forsythe, D., J.A. Osheroff, B.G. Buchanan, and R.A. Miller, "Expanding the Concept of Medical Information: An Observational Study of Physicians' Information Needs." Technical Report No. CS-90-7, Computer Science Department, University of Pittsburgh, 1990. Osheroff, J. A., D.E. Forsythe, B.G. Buchanan, R.A. Bankowitz, B.H. Blumenfeld, and R.A. Miller. "Analysis of Clinical Information Needs Using Questions Posed in a Teaching Hospital." Annals of Internal Medicine, 1 April 1991. Forsythe, D. and B. G. Buchanan, "Non-Technical Problems in Knowledge Engineering: Suggestions for Project Managers." To appear in J. Liebowitz (ed.), Proceedings of the World Congress on Expert Systems. New York: Pergamon Press, forthcoming December 1991. Forsythe, D. "The Construction of Work in Artificial Intelligence." Technical Report No. ISL91-4, Computer Science Department, University of Pittsburgh, 1991. Buchanan, B.G. and E.H. Shortliffe (eds.) RuleBased Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Reading, MA: Addison-Wesley, 1984. Harrison, P. R. "Testing and Evaluation of Knowledge-based Systems". In J. Liebowitz and

[13] [14]

[15] [16]

[17] [18] [19]

[20]

[21]

12

D. De Salvo (eds.), Structuring Expert Systems. Engelwood Cliffs, NJ: Yourdon Press, 1989. Fafchamps, D. "Ethnographic Workflow Analysis: Specifications for Design." To appear in J.H. Bullinger (ed.), Proceedings of the 4th International Conference on Human-Computer Interaction. Amsterdam: Elsevier Science Publishers, forthcoming September 1991. Fafchamps, D., C.Y. Young, and P.C. Tang. "Modelling Work Practice: Design and Evaluation of a Physician Workstation." Hewlett-Packard Laboratories, Palo Alto, CA, 1991. (submitted for publication) Star, S. L. "The Sociology of the Invisible: The Primacy of Work in the Writings of Anselm Strauss." To appear in D. Maines (ed.), Social Organization and Social Processes: Essays in Honor of Anselm Strauss. New York: Aldine de Gruyter (in press). Gasser, L. "The Integration of Computing and Routine Work." ACM Transactions on Office Information Systems, Vol. 4, No. 3, 1986. Gasser, L. The Social Dynamics of Routine Computer Use in Complex Organizations. Computer Science Department, University of Southern California, 1984. Terkel, S. Working. New York: Ballantine Books, 1972. Kaplan, B. and D. Duchon. "Combining Qualitative and Quantitative Methods in Information Systems Research: A Case Study." MIS Ouarterly, Vol. 12, No. 4, 1988. Werner, 0. and G. Schocpfle. Systematic Fieldwork. (2 vols.) Newbury Park: Sage, 1987. Zuboff, S. In the Age of the Smart Machine. New York: Basic Books, 1988. Gaschnig, J., P. Klahr, H. Pople, E. Shortliffe, and A. Terry. "Evaluation of Expert Systems: Issues and Case Studies". In F. Hayes-Roth, D. A. Waterman, and D. B. Lenat (eds.), Building Expert Systems. Reading, MA: Addison-Wesley 1983. Lundsgaarde, H.P. "Evaluating Medical Expert Systems." Social Science and Medicine, Vol. 24, No. 10, 1987. Lundsgaarde, H.P., P. J. Fischer, and D. J. Steele. Human Problems in Computerized Medicine. Lawrence, Kansas: University of Kansas Publications in Anthropology No. 13, 1981.

Broadening our approach to evaluating medical information systems.

Evaluation in medical informatics tends to follow the paradigm of controlled clinical trials. This model carries with it a number of assumptions whose...
1017KB Sizes 0 Downloads 0 Views