Automated integration of external databases: A knowledge-based approach to enhancing rule-based expert systems Lewis Berman, M.D. 1,2, 1 Center for Medical Informatics, 2 Section of Pulmonary and Critical Care Medicine and 3 Occupational and Environmental Medicine Program, Yale University School of Medicine, New Haven, CT, 06510

Expert system applications in the biomedical domain have long been hampered by the difficulty inherent in maintaining and extending large knowledge bases. We have developed a knowledge-based method for automatically augmenting such knowledge bases. The method consists of automatically integrating data contained in commercially available, external, on-line databases with data contained in an expert system's knowledge base. We have built a prototype system, named DBX, using this technique to augment an expert system's knowledge base as a decision support aid and as a bibliographic retrieval tool. In this paper, we describe this prototype system in detail, illustrate its use and discuss the lessons we have learned in its implementation.

INTRODUCTION Expert systems are computer programs that attempt to emulate certain behaviors of human experts, usually in a well bounded domain. They have been successfully used in a variety of settings both biomedical and nonbiomedical [1]. However, their application to the biomedical domain has been beset by a number of difficulties. One of the most commonly cited difficulties is the large and constantly changing base of knowledge required for medical reasoning. Several general categories of expert system have been applied to the biomedical domain. Two types in particular have seen the most frequent use, rule-based and table-based expert systems. Rule-based expert systems such as Mycin [2] and HT-Attending [3] encode knowledge and the directions for using that knowledge in a single integrated packet, the rule. Attempts to modify one rule can have unintentional repercussions on other related rules. New rules added to the system to supplement the knowledge base may also unexpectedly interact with other rules leading to unpredictable results. The second major type of expert system, the table-based system, is organized differently. In these systems, exemplified by INTERNIST-I/Quick Medical Reference [4] and AI/Rheum [5], a knowledge structure is defined a priori in which each data element is

0195-4210/92/$5.00 01993 AMIA, Inc.

categorized, scored and associated with other elements in a predetermined way. The instructions for traversing and integrating this knowledge structure are fixed as an immutable part of the software. Although the knowledge in these systems is independent from control of the reasoning process, adding new knowledge can be arduous since it is often difficult to determine relative scores and associative links. During their development, expert systems are validated by being run over a large set of sample cases, comparing their performance on these cases to that of human experts. Because modifying and supplementing the knowledge base of an expert system has the potential to alter its inferencing behavior over both old and new knowledge, one must rechallenge the system with old cases as well as new cases to exhaustively validate its behavior. The work entailed in enhancing a preexisting knowledge base thus grows with the size of the knowledge base. In contrast to an expert system's knowledge base which stores data and inferencing information together, a database contains data only. To maintain a clear distinction in this paper, "knowledge base" will always refer to the data and inferencing information typically used by an expert system. "Database" will only be used to refer to data stored in an organized fashion without inferencing information. Since in databases there is no need to store inferencing pathways or conceptual links, they are in general much easier to maintain than knowledge bases. In addition, a large and rapidly growing number of commercial, on-line databases are now available to any computer equipped with either a modem or direct access to national computer networks. In this paper, we hypothesize that the information available within these databases can serve as a dynamic, structured resource allowing us to separate specific knowledge of a biomedical domain from an expert system's knowledge base. The goal of this research was to identify a range of such databases containing information relevant to occupational asthma, design a prototype rule-based expert system to access and intelligently integrate the information available in

227

these databases and to use this integrated information to answer clinical questions and guide directed bibliographic database queries.

BACKGROUND The difficulties inherent in maintaining and augmenting knowledge bases have been well recognized by many investigators [6]. Research efforts aimed at overcoming these difficulties have focused on two major approaches. In the first approach, exemplified by the TEIRESIAS [7] and QMR-Kat [8] programs, tools have been created to help extract knowledge from human experts while simultaneously ensuring structural and factual consistency within the knowledge base. While these are clearly valuable adjuncts to knowledge acquisition, these tools can only reason over (and thus maintain consistency with) knowledge already in the knowledge base. They cannot guarantee that new knowledge added to the system is correct nor that the system will generate correct conclusions when sample cases are analyzed. Rennels [9] pursued a different approach with the Roundsman system. Roundsman consists of a core set of basic domain knowledge which reasons over context specific data provided by abstracted clinical journal articles. This system proved to be very powerful knowledge base augmentation tool. In practice, its use was limited by the required human abstraction and interpretation of the clinical articles. The process of abstracting the articles introduces problems similar to those found in the general knowledge acquisition process. In the case of Roundsman the results are influenced by the biases and interpretations of the abstractor and in the case of an expert system, the outcome is influenced by the knowledge engineer and domain expert. In contrast to these approaches, we sought to explore a different paradigm for knowledge base augmentation, the intelligent integration of external, online databases with the knowledge base of a traditional rule-based expert system.

DESIGN CONSIDERATIONS Domain Selection This paper describes a project based on the premise that a rule-based expert system's performance could be enhanced by integrating its knowledge base at runtime with data stored in external databases. To simplify construction of this prototype, called DBX, we selected a domain where the required external knowledge is already easily available and structured in a useful manner. Occupational and environmental medicine is a subfield of internal medicine where great emphasis is placed on identifying job-related and environmental

causes of disease. In order to successfully pursue this identification, it is crucial to have 1) a thorough

understanding of the relationship between different chemical compounds and 2) the ability to understand and access both the occupational medicine and chemistry literatures. By selecting occupational asthma as our prototype domain, we were able to take advantage of the wide range of physical and organic chemistry databases available commercially. Database Identification and Selection The task of identifying suitable databases for a project such as this is not trivial. We undertook a four pronged approach to the delineation of databases potentially applicable to occupational and environmental medicine. 1.

2.

3. 4.

Extensive interviews with members of the Section of Occupational and Environmental Medicine at the Yale School of Medicine. Discussions with the librarians at both the Yale medical and chemistry libraries. Inspection of literature provided by commercial on-line database providers. Review of a compilation of many commercially available databases and a brief synopsis of their content [10]. We separated the databases we encountered into

two categories, "reasoning databases" and "retrieval databases". Reasoning databases are databases structured in a fashion amenable to integration with the rule-based

expert systems inferencing process. Retrieval databases are those databases which possess useful information not easily incorporated into the traditional rule-based

model. These two categories are described in more detail below.

Reasoning Databases. Other than applicability to the problem domain, we established only two criteria for including a particular database as a reasoning database in DBX. The data in the database had to be structured (not bibliographic or free text) in nature and, ideally, each data record had to be accessible by a unique key. For this project, we used the Chemical Abstract Service (CAS) Registry Number (RN) as the key. The RN is a number assigned by CAS upon the first publication or description of a new chemical structure. Although the same chemical structure can have many different commercial and chemical names, they are all identified by a single CAS RN. Utilizing these search methods, we identified two databases that met our criteria, CHEMID [11] and CAS Registry [12]. For the purposes of developing our prototype, pertinent records from each of these databases

228

were abstracted and stored locally in a flat-file text formaL CHEMID is an on-line database maintained by the National Library of Medicine to facilitate the use of its other bibliographic databases. This database associates all of the names by which a compound can be identified, structurally or commercially, with its appropriate, unique CAS RN. The following example shows an abridged CHEMID entry for the chemical compound azodicarbonamide. In this example, the heading RN designates the compound's unique CAS RN and the headings Ni, NM and SY denote the various names used to refer to this particular compound in a variety of databases and printed literature.

AZODICARBONAMIDE RN - 123-77-3 NI - Diazenedicarboxamide (9CI) [TSCA] NI - Formamide, 1,1'-azobis- (8CI) [RTECS] NM - Azodicarbonamide [CCRIS] NM - 1,1-azobisformamide [MESH] NM - 1,1'-AZOBIS(FORMAMIDE) [HSDB] SY - ABFA [HSDB] SY - A13-52516 [NLM] SY -AZ [HSDB] SY - Azobiscarbonamide [HSDB:RTECS] SY - Azobiscarboxamide [HSDB:RTECS] SY - Azobisformamide [NLM] SY - Azodicarbamide [HSDB:RTECS] SY - Azodicarboamide [HSDB:RTECS] SY - Azodicarbonamide [HSDB:MESH:RTECS] SY - Azodicarboxamide [HSDB:RTECS] SY - Azodicarboxylic acid diamide

[HSDB:RTECS] CAS Registry is one of a large number of online databases supported by the Chemical Abstract Service (CAS) which are known collectively as CAS online. These databases are augmented copies of CAS's extensive collection of printed abstracts. Each record in CAS Registry uniquely identifies a compound by its RN and contains extensive information as to the compound's alternate names, molecular formula, structure and stereochemistry. Of particular note, this database is organized hierarchically by chemical structure. An example entry from this database, illustrating this hierarchical structure, is shown below for the compound I -Propanolol.

229

I-Propanolol I-Propanolol, 2-methyl-2-nitro1-Propanolol, 2-methyl-2-[(2phenylethyl)amino]I-Propanolol, 3-[(6-methyl-2pyridinyl)methoxy]I-Propanolol, 3-(methylthio)I-Propanolol, 2-nitroWe defined three relationships to classify the structural meaning encoded in this hierarchy. Compound A is a child of compound B implies that one can find compound B as a core component in the structure of compound A. Compound A is a sibling of compound B means that both A and B are structural modifications of the same parent compound. The parent relationship is the inverse of the child relationship. In this example, I-Propanolol is a parent to I-Propanolol, 2-methyl-2-nitro-. I-Propanolol, 2methyl-2-nitro- is a sibling of 1-Propanolol, 2-methyl-2[(2-phenylethyl)amino]-. 1-Propanolol, 2-methyl-2nitro- is a child of I-Propanolol.

Retrieval Databases. Retrieval databases represent the bulk of databases currently available. This type of database encompasses a large amount of pertinent data that is too ill-structured for knowledge-based inferencing purposes. Many retrieval databases are available with contents that overlap our area of interest, occupational asthma. Since bibliographic databases in many ways represent the archetypical retrieval database, we chose to include TOXLINE [13], the National Library of Medicine's on-line toxicology information service, as a component of DBX. TOXLINE accumulates bibliographic information about the pharmacological and toxicological effects of drugs and other chemicals from a melange of secondary source databases. As with the structured databases, for this demonstration prototype we abstracted records of interest from TOXLINE and stored them locally in a flat, text file format.

Expert System Implementation We chose to write our rule-based expert system using a commercially available expert system shell. To facilitate the process of linking our expert system to external databases we chose a shell that facilitated automatic database access and also supported externally produced database bridges. After investigating several products, we selected the NEXPERTrm (Neuron Data, Palo Alto, CA) object oriented expert system shell.

SYSTEM DESCRIPTION The overall DBX system design is illustrated in Figure 1. The inferencing process is outlined schematically in Figure 2. DBX's interaction with the clinician is initiated by the system's request for the name of a suspected asthmogenic compound. The system looks for this substance in an internal database of known asthma causing agents. If unsuccessful, DBX sequentially accesses the CHEMID and CAS Registry databases to find a RN for the agent and then reexplores its own knowledge base of known asthmogens for a match. If still unsuccessful at classifying the agent, DBX again turns to the CAS Registry database. There it extracts the RNs of compounds structurally related (as parents or siblings) to the putative asthmogen. The system then looks to see if any of these related compounds can be found in its private knowledge base of confirmed asthmogens. If at any point in this process DBX matches the putative asthmogen's or a related compound's RN to the RN of a known asthmogen inferencing is interrupted and the results are presented to the user. The user is then offered the option of runningaa directed TOXLINE search based on the results of the inferencing session. The query is customized by utilizing the name or RN of the originally entered compound and the RN of any structurally related asthmogenic compound(s) that DBX has identified.

sRT I

K)

No

I(D or Reg ?

_

kwn asth KS

Asthmogenicity la

n(R KS w

L

a

nh

TOXLINE query base_ retrieval

Figure 2. An outline of DBX's expert system inferencing process with augmentation by external databases. For each inferencing step, the source of data is indicated in parenthesis. KB = expert system's knowledge base; Reg = CAS Registry; CID = CHEMID. CHMD

~~~Retrieval

~~~~~~Database

_

rent Is asthmoen

known

CAS Online

\

known

asthmten

as

Reasoning Databases

ChemID

y

a ax

RESULTS

Toxline

a

Below, we present a series of sample cases ,whichillustrate DBX's behavior while traversing all of its potential inferencing pathways.

Base

Example 1. Nickel The compound "Nickel" is recorded in the known asthmogen knowledge base. Therefore, when presented with the putative asthmogen "nickel", DBX finds a direct match in its known asthmogen knowledge base and declares, "Nickel is a known asthma causing

W

|

Expert

Sytert

Clinician 1

agent". Figure 1. The conceptual interrelationship amongst the rule-based expert system, expert system knowledge base and external databases.

230

This example shows the functionality of the knowledge base without any enhancement. Under these circumstances, the system is restricted to alphanumeric matching of chemical names. Since many chemicals are known by a variety of names, pure alphanumeric

matching can be expected to perform poorly in practice. In a production system, the lexical matching could be made more sophisticated. Example 2. MDI When the user enters "MDI" as the putative asthmogen, DBX immediately looks for a name match in the known asthmogen knowledge base. It does not find a match. It then checks the CHEMID database where it finds an entry for "MDI" with a CAS RN of 101-68-9. By querying its database of known asthmogens it can alert the user that, "MDI better known as Diphenylmethane diisocyanate is a known asthma causing agent". In contrast to the last example, utilizing the CHEMID database has significantly improved performance. We have gone from a state of uncertainty as to the asthmogenicity of MDI to a state of absolute certainty. In addition, our yield on TOXLINE searches has improved from no references to 6 references. Example 3. Niagathal Here again, when "Niagathal" is entered into DBX as a putative asthmogen, the known asthmogen knowledge base is queried for a name match. When "Niagathal" is not found, it is searched for in CHEMID where it is found to have RN 117-08-8. This RN is still not found in the known asthmogen list, so CAS Registry is searched looking for a parent compound. In CAS Registry, 1,3-Isobenzofurandione (RN 85-44-9), is found to be the structural parent of niagathal. Querying the known asthmogen list, we find that this parent compound is a known asthmogen. DBX informs the user that, "Although there is no direct evidence that niagathal is an asthma causing agent, it is a structural modification of Phthalic anhydride, a known In reality, niagathal is a known asthmogen". asthmogen purposely omitted (for demonstration purposes) from the system's known asthmogen list. It is usually referred to as tetrachloro-phthalic anhydride. The ability of the system to correctly point us in the appropriate direction when processing this compound exemplifies the power of this inferencing approach. Bibliographic retrieval is not significantly enhanced in this example. Searching for niagathal alone (by RN) yields 24 references. Searching for the parent compound, phthalic anhydride, leads to 31 references. Of these references, 12 are common to both compounds.

Example 4. Araldite EPN 1138 As a last example, we look at "Araldite EPN 1138". This compound is unclassifiable by direct name search of the known asthmogen list. It is also not found on query of the CHEMID database. On searching

231

the CAS Registry database, it is found to have RN 136162-33-9. Neither it nor its parent compound, 5Isobenzofurancarboxyllic acid, is found on the known asthmogen list. However, its structural sibling trimellitic anhydride is found on the known asthmogen list. As such, "Araldite EPN 1138" is identified to the user as a structural modification of trimellitic anhydride with the following message, "Although there is no direct evidence that Araldite EPN 1138 is an asthma causing substance, both it and Trimellitic anhydride, a known asthmogen, are structurally derived from 5Isobenzofurancarboxylic acid." A TOXLINE search for the putative asthmogen, "Araldite EPN 1138", is unsuccessful. Searching for its structural relative, trimellitic anhydride, yields 48 references.

DISCUSSION This project explores the process and utility of integrating the knowledge base of a rule-based expert system with the data contained in external databases in order to augment the expert system's inferencing capability and to enhance information retrieval capabilities. Although expert systems currently use external databases, these databases serve as static suppositories of data used passively in the inferencing process. Usually a query is directed at the database, one or two salient pieces of information are returned and processing continues. In contrast to this passive view of data in a database, we adopted a more dynamic perspective. We analyzed the process that human experts use to integrate information when they turn to outside sources of data such as reference books or manuals. We hypothesized that human experts use their own domain knowledge to impose a conceptual framework of rules and common sense over this raw data and then apply the integrated whole to the problem at hand. In other words, the expert temporarily integrates the external data with his/her preexisting domain specific knowledge and structure for the purpose of addressing a specific problem. In this model, the database actually represents an extension of the knowledge base not just packets of easily retrievable data For this model to work, one must be able to separate domain knowledge into three categories. The first category is data that is so integral to solving problems in the domain that it becomes the core of any experts or expert system's activity. The knowledge embodied in this data covers areas such as common sense and the basic paradigms used to solve problems in the domain. The second category of knowledge is domain specific "data mapping" knowledge. This knowledge allows the expert to take the raw data

contained in database or reference materials and integrate it with the first type of core data to allow an expansion in the scope of what can be done with the problem solving algorithms. The third category of data is simply the raw data in the database. In this project, we encoded the first two categories of knowledge as rules in an expert system and left the last category as an external database. DBX illustrates one potentially successful approach to integrating a rule-based expert system w'ith external databases. We were able to augment the classification and bibliographic retrieval capabilities of our expert system by reasoning over the knowledge available in on-line, external databases. Incorporating this model of database integration into an expert system has multiple practical advantages as compared to traditional methods of knowledge base construction. The first advantage is that the expense and validation effort incurred in constructing these databases are shifted to their commercial providers. Second, since the database providers are not being asked to make subtle and difficult decisions by including inferencing directives with the data in the database, the maintenance cost is presumably lower than that of maintaining an expert system knowledge base. While these two advantages are relatively obvious, mechanical consequences of the integration technique, their importance should not be overlooked. If knowledge base construction requires fewer resources, more research effort can be directed at better elucidating the actual inferencing steps utilized by human experts to solve clinical problems [14]. In addition to these advantages, there are two other potential benefits. We hypothesize that by maintaining the inferencing rules completely separate from the external data, changes to the databases should not have the far reaching consequences on system performance that modification of an expert system knowledge base can entail. If true, this would significantly enhance our abilities to actively maintain and enhance knowledge bases. We further believe that "common sense" knowlcdge and "rules of thumb" may sometimes be better implemented using an integrated, external database. For example in the domain of occupational medicine one can say, "If a suspected asthmogen is not known to be directly asthmogenic but it is a metal commonly used with machining oils, check the asthmogenic potential of the particular machining oil". Comprehensively representing this type of "rule of thumb" in a knowledge base can require a huge amount of knowledge. By integrating the data available in external databases (i.e., which compounds are metals, which compounds are machining oils, which metals are machined) with a simple rule, this difficult

concept can be simplified both in terms of size and complexity. We did not attempt to explore either of these putative benefits in the current prototype. Based on our experience with this prototype, we have identified several properties that are useful and possibly critical for the successful integration of an expert system knowledge base with an external reasoning database. First, it is very helpful if some of the databases chosen for integration are searchable by a common key. This circumvents problems with alternate spellings and names that can plague such database integration efforts. In this system, the Chemical Abstract Service RN filled this role. Second, it is useful for at least one of the integrated, reasoning databases to posses a logical secondary structure. The ability to traverse this structure in an intelligent fashion can facilitate intelligent database utilization by the expert system. For our purposes, the hierarchical arrangement of chemical compounds in the CAS Registry database served as a key to the success of this prototype. Expert systems such as our prototype can be applied in a variety of fashions. Classically, these systems are used for decision support as we have discussed above. However, these expert systems can also be used to facilitate the clinician's effort to gather information pertinent to his needs. Used in this way even if the integrated knowledge base fails as a decision support tool, it serves a useful purpose if it successfully guides the user by guiding his/her search for information. To grade this type of interaction, one cannot simply classify the system's performance as correct or incorrect. Rather, one must compare the user's ability to gather data with and without the aid of the system. To illustrate DBX's ability to augment clinician information gathering, we included the TOXLINE search capability. In general, when no references could be found relating to the putative asthmogen, we initiated a TOXLINE search for the closest, asthmogenic structural relative the system could derive. By utilizing this technique, we significantly improved the yield of our TOXLINE searches. No efforts were made to judge the quality or applicability of the recovered references. The qualitative assessment of this prototyping effort has been very promising; however, formal validation of this concept requires evaluation in a more comprehensive fashion. By introducing true communication links to each of the databases rather than to abstracted local files and by augmenting the locally maintained list of known asthmogens, one could undertake objective trials to measure the utility of both the decision support and bibliographic retrieval modules. Since the goal of the present project has been

232

to explore and define basic concepts in prototype fashion, we did not embark on a more comprehensive assessment. One might question whether this work is generally applicable to other biomedical domains or if its utility is restricted to the field of occupational medicine. Since their are numerous databases subserving the quantitative sciences such as chemistry, physics and engineering, it is logical to argue that this technique should be transferable to any of the biomedical domains which significantly overlap these areas such as radiology, radiation therapy and molecular biology research. In addition, there are a significant number of biomedical databases available which overlap almost all of the fields of biomedicine, as well as many printed tables of information which could be placed online to facilitate this type of extended reasoning capability.

CONCLUSION We have constructed a prototype system in which a rule-based expert system integrates its knowledge base with two external, on-line databases at runtime. The resulting integrated system has the potential to perform more robustly than a non-integrated system both as a clinical decision aid and as a bibliographic retrieval tool in the domain of occupational asthma. Databases which are accessible through common key fields and which possess a logical secondary structure seem to be particularly well suited for this type of integration. This model of runtime incorporation of extcrnal databases into expert system knowledge bases is a first step in exploring how the power of expert reasoning can be enhanced in a potentially powerful fashion by taking advantage of existing on-line resources.

ACKNOWLEDGEMENT

1.

2.

3.

4.

5.

6. 7.

8.

9.

10. 11.

This work was supported by NIH Grants T15 LM07056 and ROI LM04336 and NIH Contract NOI LM13537 from the National Library of Medicine. Dr. Cullen is a recipient of NIEHS Academic Award in Environmental/Occupational Medicine ES00227. We wish to thank Majlen Helenius of the Yale Medical Library for her help in locating the external databases used in this project.

12. 13. 14.

233

REFERENCES Shortliffe EH. Medical expert systems Knowledge tools for physicians. West Jour Med. 1986; 145:830-9. Shortliffe EH. Computer Programs to Support Clinical Decision Making. JAMA. 1987; 258:61-66. Miller PL. "Expert Critiquing Systems." Springer -Verlag, New York, 1986. Miller RA, Pople HE, Myers JD. INTERNISTI, An experimental computer-based diagnostic consultant for general internal medicine. N Engl J Med 1982; 307: 468-76. Kingsland LC III, Lindberg DAB, Sharp GC. Anatomy of a knowledge-based consultant system: Al/RHEUM. MD Computing. 3(5):1826, 1986. Miller RA. INTERNIST-I/CADUCEUS: Problems Facing Expert Consultant Programs. Meth Inform Med. 1984; 23:9-14. Davis R. Interactive Transfer of Expertise. In: Rule Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project, eds. Buchanan BG, Shortliffe EH, Addison-Wesley, 1984: 171205. Giuse DA, Giuse NB, Miller RA. Towards computer-assisted maintenance of medical knowledge bases. Artificial Intelligence in Medicine 1990; 2:21-33. Rennels GD, Shortliffe EH, Stockdale FE, Miller PL. A computational model of reasoning from the clinical literature. Computer Methods and Programs in Biomedicine. 1987; 24:139-49. Datapro directory of on-line services. Datapro Research Corp., Delran, N.J., May 1988. CHEMID [database online]. National Library of Medicine, Bethesda (MD). Available from the National Library of Medicine. CAS REGISTRY [database online]. Chemical Abstracts Service, Columbus, OH. Available from STN International, Columbus, OH. TOXLINE [database online]. National Library of Medicine, Bethesda (MD). Available from the National Library of Medicine. Berman L, Miller RA. Problem area formation as an element of computer aided diagnosis: a comparison of two strategies within Quick Medical Reference (QMR). Meth Inform Med. 1991; 30:90-5.

Automated integration of external databases: a knowledge-based approach to enhancing rule-based expert systems.

Expert system applications in the biomedical domain have long been hampered by the difficulty inherent in maintaining and extending large knowledge ba...
1MB Sizes 0 Downloads 0 Views