The development of a virtual database to provide on-line access to a large archive of clinical data.

The Development of a Virtual Database to Provide On-Line Access to a Large Archive of Clinical Data Lane E. Stevens, M.S.1, Stanley M. Huff, M.D., Peter J. Haug, M.D.

Division of Clinical Epidemiology, LDS Hospital' and Department of Medical Informatics, University of Utah ABSTRACT The archival database of the HELP Hospital Information System at LDS Hospital is too large to be stored on line. The archival data are important for clinical and research applications. Demountable disk packs have been used to store the archival database. This method of storage has four significant disadvantages. A virtual database was developed to overcome the limitations of this data-management scheme. This virtual database enables the transparent use of appropriate lowcost network-based storage technology to provide on-line availability of the entire archive of clinical data. The virtual database successfully resolves the problems associated with disk packs, and opens the door to enhanced use of the data for clinical and research applications.

INTRODUCTION The HELP Hospital Information System (HELP) has been under development and in use at LDS Hospital for over 20 years [1,2]. The database of HELP is broad in terms of the types of data that are stored, and deep in terms of the number of hospital admissions that are represented. As an example, there are 750,000 records

of admission in the LDS Hospital database from 1983 through the end of 1991. Much of each patient's medical record is stored in an electronic format. There are significant challenges associated with the ongoing management of such a large volume of data. The clinical computer at LDS Hospital provides online storage for the data of all current patients and the data of all patients who have been discharged from the hospital during the most-recent six-month period. Patient data are not stored on line beyond a six-month period because the quantity of data captured by HELP exceeds the storage capacity of the clinical computer. These data are a valuable resource for clinical and research applications [3,4,5]; therefore, it is important that the data be archived in a manner that facilitates accessibility as they are removed from the clinical computer system and placed in an archival database. Through the end of 1991, demountable disk packs were used as the primary storage for the archive of clinical data. Figure 1 illustrates the migration of patient data

0195-4210/92/$5.00 ©1993 AMIA, Inc.

from the point of capture to the archival database. Based on the rate of data capture in 1991, a single 250 megabyte disk pack stores approximately the amount of data captured for all patients discharged from the hospital during a three-month period. Eleven days after discharge

Current Patient Files

Six months after discharge

Recent Patient Files

Demountable Disk Files

Figure 1. Migration scheme for patient data. One benefit of this scheme of data management is that the data are stored in a format that can be used directly by existing applications of HELP. It is not generally necessary to develop new tools to access archival data; however, there are four significant problems associated with this approach to the management of the archival data. These problems are general, and should apply in a broad sense to any system that uses a similar scheme for the management of archival data. First, access to archival data is limited. At LDS Hospital only two demountable disk drives are available. At most, two of the packs of archival data can be used concurrently. As researchers have realized the value of retrospective analysis of data, contention for access to the data has increased. Access to these two drives has been further limited by the high rate of failure of the drives. Second, the archival database is randomly partitioned across the demountable disk packs based on the date of discharge of the patients. A search of the archival data may require several days depending on the number of packs to be searched and the availability of the drives. In addition, the fragmentation of the database often requires that intermediate results be maintained during the interval between disk packs to provide continuity in analysis of the data. For example, intermediate results would be required in order to compute the mean age of a population

600

of patients that spans more than one disk pack. This adds complexity to the search. Third, each disk pack, or partition, is both logically and physically independent of the other partitions of the archive of data. The fact that each partition is independent of the other partitions makes possible the incorrect placement and the duplication of patient data. Enforcement of a policy that data for a particular patient and admission must exist in exactly one set of files is very difficult, particularly because at most two archive packs can be on line simultaneously. If the data of a patient are replicated, then the two sets of data will begin to diverge as updates to the data of that patient are made. Fourth, the inaccessibility of archival data and the lack of a method to locate all the data for a particular patient hinder the generation of longitudinal records. A longitudinal record is the complete collection of data from all hospital admissions for a given patient. The grouping of data in the HELP database by admission is done primarily for the purpose of accounting and represents an artificial organization of the data. A more appropriate organization of the data is obtained by grouping the data by individual patient in a longitudinal record. Longitudinal records are important at the time a patient is readmitted to the hospital and for research. These four problems are compounded by the fact that the archive continues to grow. Not only does the size of the archive grow, but the rate at which the archive grows is increasing. The development of new applications and the modification of existing applications results in an increase in the amount of data that is collected on each patient. The increased use of sophisticated medical devices also causes growth in the rate of data collection. Data from these devices are entered into the database without human intervention [6]. The experience of the past five years at LDS Hospital has shown that the amount of data that is collected in a given year exceeds the amount of data collected in the previous year by 25 %. Demountable disk packs are not an appropriate primary storage medium for this type of archive. The amount of clinical data in the database and the rate at which new data are being captured urgently required a different data-management scheme. Other institutions have encountered similar problems. Researchers at the Lovelace Medical Foundation have cited inaccessibility of data and the inability to create a longitudinal patient record easily as factors that motivated their decision to create a research-oriented health services database [7]. This strategy has at least two significant disadvantages. First, in order to access and analyze the data, new tools must be developed. This is a costly and time-consuming process. Second, the flow of data from the hospital information system is unidirectional. In other

words, data from the auxiliary database are not accessible to the hospital information system. The four problems that have been cited could be solved if the entire archive of clinical data were available on line, and if these data were stored in a single database. The cost of expanding the storage on the hospital's Tandem mainframe computer is prohibitive. At the time of this writing, Tandem storage devices were priced nearly nine times higher per gigabyte than network or personal-computer storage devices. Although these four problems are significant, a more pressing need for a solution developed at LDS Hospital. The two demountable drives were eliminated from the hospital computer system due to a hardware upgrade that does not support demountable drives. It is important that any solution transparently support all the existing applications and tools of the HELP system. The cost of a large-scale redevelopment of tools and applications to interact with an incompatible archival database would certainly outweigh any benefit that could be derived from the archival database [8]. A virtual database was developed to provide on-line access to the entire archive of clinical data and solve the problems associated with the storage of archival data. The virtual database transparently allows the use of appropriate inexpensive network hardware and storage devices, while continuing to support existing tools and applications of HELP.

VIRTUAL DATABASE TECHNOLOGY In its most basic form, a virtual database consists of a physical file and a virtual file. The physical file stores all the data of the database. The virtual file appears to store all the data of the database, but actually, only contains data that are being used at a given time. This concept is similar to virtual memory [9, 10,11]. Techniques to manage virtual memory first appeared shortly after programs began to exceed the capacity of physical memory. Before these techniques became available, it was the responsibility of the programmer to divide programs into pieces such that each piece would fit in memory. These pieces were collected into overlays, and the overlays were loaded and unloaded during the execution of the program [12,13]. Just as programmers divided a large program into parts if it exceeded the available memory, a database administrator must divide a database into mutually exclusive partitions if it grows to exceed the capacity of the storage system that contains it. These partitions are loaded and unloaded by users during the analysis of data much like the overlays of a program were loaded and unloaded during the execution of the program. Virtual-memory techniques allow one type of storage to substitute transparently for another type of storage. In

601

traditional virtual memory systems, disk storage is used to extend the main memory of a computer. A secondary storage system may be required when there is a limit on the amount of primary storage or when primary storage is substantially more expensive than secondary storage. The virtual file of a virtual database system fills a role that is similar to the role played by main memory in a traditional virtual memory system. The physical file logically extends the capacity of the virtual file. This is analogous to the use of disk storage to logically extend the capacity of main memory. In a virtual database system, all requests for data are made through the virtual file. A request for data from the virtual file completes successfully if the request can be satisfied from the physical file. The request is allowed to complete immediately if the required data are physically located in the virtual file. If the data are not physically located in the virtual file, the request is suspended temporarily while an attempt to transfer the data from the physical file to the virtual file is made. After this attempt completes, the request is allowed to resume (Figure 2). The placement of data into the virtual file occurs in a manner that is transparent to the requestor.

( Fa~ilure (!!Lcc~"I Figure 2. Overview of a request for data in a virtual database system.

uooeBSs

This algorithm describes the fundamental functionality that must be present in a virtual database system. This emulates the behavior of traditional virtual memory systems. In addition to supporting random requests for data, the virtual database must provide any access paths that are defined in the standard database. For example,

the standard database may allow records to be accessed either by primary key or alphabetically by name. The ability to monitor database requests, and to suspend these requests until the required data are transferred to the virtual file is crucial. This is a key feature used in the construction of a virtual database. The use of this feature creates the illusion that all the data are physically present in the virtual file. In fact, the data are only present in the virtual file when needed to satisfy a request for data.

PREVIOUS WORK The term virtual database has been defined by other researchers to indicate a unifying database [14]. This type of virtual database integrates multiple heterogeneous databases. The virtual database extends the breadth of the underlying databases by transparently enabling access to an expanded database schema. The schema of the virtual database is constructed by integrating the schemata of the underlying databases. This type of virtual database differs from the virtual database that is presented in this manuscript in two significant ways. First, the virtual database discussed in this manuscript expands the depth of a standard database by transparently extending the memory hierarchy to include storage devices that would not ordinarily be available to the database management system. The database schema of the virtual database is identical to the schema of the standard database. The virtual database discussed in this paper enables the storage of more of the same type of data, while the virtual database described by other researchers makes additional types of data available. Second, the virtual database discussed in this paper supports all existing applications that utilize the database. Redevelopment of existing applications is not necessary if the applications exclusively use the proper database-access primitives to manipulate the data. In order to access the type of database described by other researchers, modification of existing applications or the development of new applications may be required because the heterogenous databases are unified by creating a new database. THE LDS HOSPITAL VIRTUAL DATABASE In the LDS Hospital archive, the virtual files of the virtual database reside on the main hospital computer system. The virtual database is supported by a physical database that uses low-cost storage devices on a remote computer system. The main hospital computer system and the remote computer system communicate through a high-speed ethernet network. Any application that opens the set of virtual files on the clinical computer is able to transparently access any of the data in the archival database if HELP database-access

602

primitives are used. The use of lower-level (Tandem Enscribe) file-access primitives will circumvent the virtual database management routines, and therefore, only provide access to records that are physically located in the virtual files of the virtual database. The virtual database system consists of three processes, the virtual archive manager (VAM), the physical archive manager (PAM), and the virtual archive zap process (ARZAP). The VAM and ARZAP processes reside on the clinical computer system. The PAM process resides on a remote personal computer. In addition to these three processes, the virtual database system requires a physical database management system, a set of virtual files, and a specialized set of HELP database-access primitives (Figure 3).

Figure 3. Overview of virtual database management system. The VAM process controls the flow of data in the virtual database system. VAM accepts messages that describe the type of query being executed by the HELP database-access primitives, makes requests of PAM, receives data from PAM, and executes the patient data replacement algorithms. The state of the virtual database is maintained by the VAM process. A table contains information necessary to implement the data replacement algorithms. This information is composed of the primary key into the virtual database, the size of the set of data, and the time of the most recent use of the data. This information is maintained for each set of data that is physically located in the virtual database files. The primary key is also used as the primary access path into this table. The time of most recent use of the data is a secondary access path for the table. This is used to determine the least-recentlyused set of data. The PAM process accepts requests from VAM, accesses the physical database, translates the data into HELP database format, and sends data to the VAM process. PAM also provides both read and write access

to the physical database at the hospital admission level. It serves as the gateway through which data are added to

the physical database. This process supports data transfer [2] for a specific hospital admission. PAM also supports data transfer at the level of all data for a specific hospital admission. The ARZAP process removes sets of data from the virtual files to accommodate incoming data. The sets of data to be removed from the files are determined by the VAM process. The ARZAP process receives messages from the VAM process that specify the sets of data that are to be deleted. ARZAP then removes the specified data from the files so that the file space can be reclaimed. The virtual database also supports different access paths. The set of supported access paths is the same as the set of paths that are supported by the standard HELP database. This allows users to sequentially scan through the archive of patients based on any one of the available orderings of patients. In order to support transparent access to the virtual database for users of the HELP database-access primitives, two key primitives (IDDATA, RETRIEVE) were modified to recognize the virtual database. These two primitives were programmed to detect when the virtual database is being accessed and to communicate with the VAM process to indicate the set of data that is required to satisfy the request. The physical database emulates the files of the HELP database at the block level. The physical database is managed using a relational database management system that conforms to the SQL standard. The Oracle Server for Netware was selected to manage the physical database. The PAM process is an Oracle client that was developed with Oracle Pro*C, a precompiler for C. One of the reasons for the selection of relational database technology was to facilitate portability of the system and the use of appropriate storage technologies as they become available. Another reason is that tables that emulate the files of the HELP database are created easily. A third reason is that it is likely that the HELP database will be managed by an SQL compliant database management system in the future. The logical unit of transfer between the physical database and the virtual database is the complete set of data for a specific hospital admission. A hospital admission in this context is a specific admission to the hospital for a particular patient. This unit of data is analogous to a page of data in a virtual memory system. The transfer of data from the physical files to the virtual files models a page fault in a traditional virtual memory system. A page fault in the virtual database occurs when the data necessary to satisfy a query are not physically located in the virtual files, and must be transferred from the physical database.

603

at the level of data class

A standard page replacement algorithm was adapted for use in the virtual database. This algorithm causes the least-recently-used pages of data to be discarded from the virtual files as space for a new page of data is required.

CONCLUSIONS There are many substantial benefits that arise from the use of a virtual database for the archival data at LDS Hospital. The use of this virtual database solves all four problems associated with the scheme that was previously used to manage the archive of clinical data. The problems associated with the use of demountable disk packs will be solved because these disk packs will no longer be required. A more appropriate medium can be selected to contain the physical files of the virtual database system. All the data will be on line, and in a single database. This not only provides accessibility to all the data, but it eliminates the problems that were associated with the partitioning of the database. It will be possible to apply database-wide constraints to the physical database. This will prevent the replication of data. The on-line availability of the data also facilitates the generation of a longitudinal record. Although performance of the virtual database has not been discussed in this paper, the performance of the virtual database was compared with a standard database [15]. For interactive transactions, the delays caused by the virtual database were negligible when compared to the other delays inherent in interactive computing. In the case of sequential access, the virtual database was two to three times slower; however, the fact that all the data are available on line results in a net reduction in the time required to search the virtual database. Finally, more significant than the problems solved by the virtual database are the doors that are opened by this technology. For example, this technology will facilitate the longitudinal view of patient care, rather than the artificial billing-oriented view. This will allow important clinical data from previous encounters to be available in electronic form at the time of readmission to the hospital. Another example is the use of these data for research. This virtual database establishes a very-large database that can be used to answer important medical questions. In addition, researchers can use the data to answer ad-hoc questions and to guide them in their research because the data are on line. The virtual database at LDS Hospital lays the foundation for collaboration among researchers and developers of clinical applications.' The availability of data is a key element of collaboration. These data can be combined with other sets of research data, or can be used to validate the models upon which new applications depend.

References 1. Pryor TA, Gardner RM, Clayton PD, Warner HR. The HELP System. SCAMC 1982;6:19-27. 2. Kuperman GJ, Gardner RM, and Pryor TA. HELP: A Dynamic Hospital Information System. SpringerVerlag, New York, N.Y., 1991. 3. Ellis LBM and Krogh C. Noise and Validity in a Practice-Derived Database. SCAMC 1990;14:271-5. 4. Pryor DB, Califf RM, Harrell FE, Hlatkey MA, Lee KL, Mark DB and Rosati RA. Clinical Databases. Medical Care 23 (1985), 623-47. 5. Zwetsloot-Schonk JHM, Snitker P, Vandenbroucke, JP and Bakker AR. Using Hospital Information Systems for Clinical Epidemiological Research. Medical Informatics 14, no. 1 (1989), 53-62. 6. Gardner RM, Hawley WL, East TD, Oniki TA and Young HW. Real-Time Data Acquisition: Experience with the Medical Information Bus (MIB). SCAMC 1991; 15:813-7. 7. Thompson BD, Piland NF, Hoy WE, Watkins M and Montgomery KA. Standard Information Content and Procedures Used in the Formation of a Research Oriented Health Services Database. SCAMC 1990; 14:359-63. 8. Pfrenzinger S. The Importance of Being Separate. Database Programming and Design 4 (August 1991), 47-52. 9. Fotheringham J. Dynamic Storage Allocation in the Atlas Computer, Including an Automatic Use of a Backing Store. Commun. ACM 4 (1961), 435-6. 10. Carr RW. Virtual Memory Managemenit. UMI Research Press, Ann Arbor, MI, 1984. 11. Hamacher VC, Vransesic ZG and Zaky SG. Computer Organization. McGraw-Hill, Inc. New York, NY, 2nd edition, 1984. 12. Tanenbaum AS. Operating Systems: Design and Implementation. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1987. 13. Hennessy JL and Patterson DA. Computer Architecture: A Quantitative Approach. Morgan Kaufman Publishers, Inc., San Mateo, CA uncorrected preliminary edition, 1989. 14. Motro A. Superviews: Virtual Integration of Multiple Databases. IEEE Trans. on Software Engineering SE-13 (1987), 785-98. 15. Stevens LE. The Development of a Virtual Database and an Evaluation of Techniques to Improve the Performance of a Virtual Database. Masters Thesis, University of Utah, 1992.

604

Large ring-forming alkylations provide facile access to composite macrocycles.

Organic materials database: An open-access online database for data mining.

Access to data: a contemporary direction for clinical trials.

Open access to clinical trials data.

Litigation seeking access to data from ongoing clinical trials: a threat to clinical research.

RCSB PDB Mobile: iOS and Android mobile apps to provide data access and visualization to the RCSB Protein Data Bank.

Rural providers' access to online resources: a randomized controlled trial.

A free-access online key to identify Amazonian ferns.

StemBANCC: Governing Access to Material and Data in a Large Stem Cell Research Consortium.

Expanding Access to Large-Scale Genomic Data While Promoting Privacy: A Game Theoretic Approach.

Patients' online access to their electronic health records and linked online services: a systematic interpretative review.

Development of a Concise Synthetic Approach to Access Oroxin A.

A technique for analyzing clinical data to provide patient management guidelines.

New Mexico practitioners' access to and satisfaction with online clinical information resources: an interview study using qualitative data analysis software.

Online access to medical records: finding ways to minimise harms.

Do virtual patients prepare medical students for the real world? Development and application of a framework to compare a virtual patient collection with population data.

The Virtual Xenbase: transitioning an online bioinformatics resource to a private cloud.

Using a database management system to manage quality assurance data.

Online access to MEDLINE in clinical settings: impact of user fees.

Rescuing failures: can large data sets provide the answer?

Korean Variant Archive (KOVA): a reference database of genetic variations in the Korean population.

Growing access to phenotype data.

Development of a clinical data architecture.

Identifying the effects of social media on health behavior: Data from a large-scale online experiment.