General purpose information handling techniques for pathological data.

~omput.

Biol. Med.

Pergamon

Press 1975. VOI. 5, pp. 221-233.

Printed 1” Great Britain

GENERAL PURPOSE INFORMATION HANDLING TECHNIQUES FOR PATHOLOGICAL DATA T. C. SHARPE and D.E.

CLARK

Medical Computing Unit, University of Manchester, Coupland Building No. I, Manchester Ml3 9PL (Received

17 August 1973 and in revised foim 26 December 1974)

Abstract-The paper describes a set of programs which can be made available to the pathologist as a simple but powerful tool in retrieval from free-text reports. The pathologist may specify a combination of words, or a numerical quantity in a given range, as the condition for selection; the identification numbers and, if required, the texts of the selected reports are printed back for his perusal. There is no restriction on the terminology which may be used in the reports, but a cumulative dictionary of terms is produced for quality control. The programs are written almost entirely in standard FORTRAN. Pathology data

Retrieval

Free text

FORTRAN

Dictionary

Concordance

INTRODUCTION Modern medicine is continually taking up new physical techniques. The feeling is that experience and intuition should go together with a scientific approach to achieve results in this complex field. The study of pathology has been fundamental in improving medical practice. However, pathological information has traditionally been recorded in narrative form, and for a number of reasons this still appears to be the best way to do it. Unfortunately there are difficulties in retrieving information from this type of record which to some extent prevent its use as a source of information for teaching and research. This applies to conventional computer techniques as well as manual methods. Therefore it has been necessary to devise special systems for handling this type of information. The earliest attempts at using data processing devices for medical information handling involved coded data [l] techniques similar to those for stock control and wages and salary computation. These required highly stylised input. Next came a check-list [2] or multiple-choice question with the facility for inserting comments (variable length, variable field). This occurred at about the same time Korein [3] succeeded in the production of a stylised case abstract with both input and output in English language. In Britain, the check-list technique was first used at Glasgow-the SWITCH [4] system. The same ideas were used by Anderson [5] in a system involving on-line data input from terminals in various parts of a hospital. A special facility called MUMPS [6] has been developed at the Massachusetts General Hospital. It possesses the ability to handle strings of letters and numbers, although the scientific arithmetic facility is very limited. It is an interpreter, running interactively between terminals, and is therefore well suited to medical use. Another system which has been developed is the QPL [7] program at U.C.L.A.-this is a very large and comprehensive system for processing English medical text. There are a number of other 221

222

T. C. SHARPEand D.E.

CLARK

systems available, but the two last mentioned (although they work very well) illustrate some of the disadvantages from which most of them suffer. The MUMPS system is written in assembler language which means that it is tied to a particular type of computer; the QPL system is so large that it would be difficult to run it on any but the most powerful machines. These are not generally available to the pathologist. There is a need for a system written in a way which makes it generally available. The programs described in this article are written in FORTRAN Iv-a language which is available on an estimated 75% of machines. A medical information system written in this language, providing it used only a modest sized computer system, would enable a pathologist starting a project in Manchester to continue it in Munich or Minneapolis. The U.K. Atomic Energy Authority at their Culham Laboratories have developed a text handling system for retrieving information on any desired topic under Atomic Energy Law [S]. This system was implemented mainly in FORTRAN II but has been re-written almost entirely in FORTRAN IV with a minimal amount of low level programming for character and bit manipulation. Facilities for handling numbers and dates interspersed throughout the text have been added.

THE

CONCORD

SYSTEM

Introduction

The present program is a compromise between the simpler key-word searching programs and the linguistic-analysis programs. It was originally implemented on a 32k tape based computer system, which imposed severe restrictions on the size of dictionary used and the type of retrieval employed. Instead of generating a cumulative concordance of all words encountered and searching this for a given word or phrase, CONCORD produces a separate dictionary and concordance for each block of text input to the system. The concordances are then searched one by one, the search time on each concordance being very short. There is no theoretical limit on the amount of text which can be entered to the system, only on the size of each block, which is governed by the space available in core store. However, a combined dictionary of terms is produced for “quality control” purposes, and the size of this is limited. External characteristics

The package consists of three separate programs: (a) CONCORD, a concordance and dictionary-producing program; (b) CONQUEST, a retrieval program; (c) CONDIC, a thesaurus generating program. The input to CONCORD consists of blocks of text each of which is preceded by an identification record and followed by a terminator. The text consists of ordinary English containing words and punctuation. If any numeric quantities appear, they have to be associated with an alphameric identifier enclosed in brackets, e.g. (WEIGHT) 50 Kg (see Fig. 1). CONCORD generates a separate dictionary and concordance for each block of text (Fig. 2). CONQUEST employs a simple “language” in which the user can define his search requests (see Fig. 3). He can specify several different searches to be carried out on one pass through the concordance data and then either create a subset to be used for further searches or re-scan the whole of the data with a different set of requests.

Pathological

223

data

Report Number 9 (M) 24 (0) 9.70 (R) 1234.70 Old penetrating cornea1 injury with, (1) partial loss of intraocular contents and traumatic cataract. (2) total retinal detachment. Report Number 10 (M) 32 (0) 10.70 (R) 1235.70 Recent penetrating cornea-scleral injury with major loss on intraocular contents, extensive anterior dialysis + total detachment of retina. Fig. 1. Text of two biopsy reports.

far report

Co*corda”ce

freq

No.9.

doe 9

1 1

9

1

9

1

9

1

9

1

9

1

9

1

9

1

9

1

9

1

9

1

9

1

9

1

9

1

9

1

9

1

9

1

9

Fig. 2. Concordance

for report No. 9.

11.0

Fig. 3. Search instructions-pick

off first 10 records, INJURY”.

(0)

then search for “PENETRATING

224

T. C. SHARPEand D. E. CLARK End of file

reached

76 documents

processed

Search

Number

I

Search

instructions

operation

minus

lwmher

L.-r

0

0

Il.0

STOP

0

0

plus

word

(0)

FINISH

The search

query

is

satisfied

Repcmt

Number - - - - -

IkpLlrt

Nvnber

Report

ti.Jmber - - - - - 3

Ilepmt

Number - - - - - 4

-

- - - - - 2

RepOrr

Number - - - - - 5

Report

Number - - - - - 6

Report

Number - - - - - 7

Report

Number - - - - - 8

Repcr

Number - - - - - 9

Report

Number - - - - -10

OR

by 10 documents

1

1

STOP

STATISTICS

-

No. of records

searched

76

No. of

in Subset

10

records

End of file

10 documents

The search

reached

processed

query

is

Fig. 4. Output from CONQUEST,

satisfied

by 2 documents

-

for search in Fig. 3. The text of the two reports is also printed out, as in Fig. 1.

Pathological data

225

The identification record and (if required) the text of each block satisfying the search is reproduced (Fig. 4). CONDIC merges the dictionaries of a number of blocks of text to produce a listing of all the words used. The principal purpose of this is for detection of spelling mistakes, etc. and also as a guide to terminology (Fig. 5).

Fig. 5. Part of the dictionary of terms produced by CONDIC.

INTERNAL (a)

STRUCTURE

CONCORD

The main purpose of this program is to compile a dictionary of all the words used in a block of text and to produce a parallel concordance. The dictionary has one entry for each different word of the text, the entries being in alphabetical order. The concordance gives a list of the occurrences of each dictionary word in the original text, stating precisely where they are (e.g. 3rd word in the first sentence and 2nd word in the 4th sentence). The dictionary consists of a set of “subchains”. At the beginning of each subchain is a marker word. In order to locate a given word the markers are scanned to find which is alphabetically nearest to the word, then the relevant subchain is scanned to find the word itself. The optimum number of markers is a function of the dictionary size, but it is convenient to use the 26 letters of the alphabet as markers, so that all the words in the first subchain begin with the letter A, all those in the second subchain with the letter B, etc. The order of the dictionary words in core store is arbitrary; a set of pointers is used to indicate the alphabetical order of the words. Thus only the location of the first word in each subchain is known. This will have an entry in a parellel array containing the core location of the next word in the dictionary and so on. The dictionary

T. C. SHARPE and D. E. CLARK

226

CONCORD

I Put dlctionory I” true olphabetlcol order

Fig. 6. CONCORD

entry also has a pointer to the concordance chain for that particular word and a pointer to the “tail” of the word. The “tail” exists because in a fixed word-length machine it may be necessary to use several machine words to store one dictionary word. Before being stored on tape or disc for subsequent searching, the dictionary and all its pointers are put into true alphabetical order. In addition to the dictionary and concordance-producing routine, the program contains input routines which scan the text character by character and break it up into separate words and a number conversion routine which converts strings of digits into binary numbers by multiplying them by successive powers of ten (Fig. 6). In order to make economical use of core storage, the pointers referred to previously are packed 2, 3 or 4 to a machine word and unpacked as required. (b)

CONQUEST

The search requests are read in at the start of a run and retained in core store. A numeric code is associated with each keyword (OR = 1, AND = 2, etc.). The dictionary and concordance are read in one at a time. A binary search is carried out on the dictionary to see if it contains the word specified in the first search request (a binary

Pathological

data

227

search consists of a comparison between the search-word and the midpoint of the dictionary to see which half of the dictionary must contain the word; then a repetition of the process on that particular half of the dictionary; and so on until either the word is found or the search length reduces to zero. This requires a maximum log,N repetitions for a dictionary containing N words). If the word exists, the concordance references are extracted. The second search instruction is processed in the same way. If it imposes a condition on the first instruction (i.e. it is an AND or NOT instruction) and a plus or minus range is specified the concordance references are compared with those relating to the word in the first instruction and deleted if they do not satisfy the condition. If any references remain after processing a string of search instructions (terminated by a STOP instruction), the search is said to be satisfied. Numeric instructions contain a number and an identifier tag. If the tag can be found in the current block of text, the associated number is extracted and compared algebraically with the specified number to see whether it satisfies the greater, less than or equal to condition. When processing is complete for one block of text, the search results are written to tape or disc (Fig. 7). When all the blocks of text have been processed, the results CONQUEST

ccmcordanceand text for this block

Branch amrding to instruction

Do btnary search

I

Fig. 7. CONQUEST.

I

I

T. C. SHARPEand D. E. CLARK

228 CONDIC

dictionary from disc or tape

-~-----Disc -

Eorller

I

Compare dictaonary

I

Later

f

Enter word

Proceed to next input word

a Pmceed to next input word

Fig. 8. CONDIC.

are read back and the program proceeds sequentially through the text-file printing out the title and if required the contents of each block of text which has satisfied the various searches. The dictionary, concordance and text may also be selectively written to a sub-file for further processing. (c) CONDIC Condic uses a simple core-bound merge technique to produce a combined dictionary of all the words used in a series of blocks of text (Fig. 8). Separate areas of core are reserved for words which will fit into 1, 2 or 3 machine words, HARDWARE

AND

SOFTWARE

REQUIREMENTS

CONCORD has been implemented on IBM 7090 and IBM 370/165 computers. The search program CONQUEST can run interactively under TSO or in batch mode, whilst the other programs run in batch mode only. The majority of the subroutines which

229

Pathological data (a)

7090

(36-bit

word = 6 characters)

CONCORD

CONQUEST

CONDIC

Program

4.5

3.9

1.3

Working arrays

19.4

15.2

24.0

Library routines

4.5

6.6

5.6

_28.4

25.7

Total

lb) 370(32-bit

word=4

CONCORD

bytes)

CONQUEST

CONDIC

Program

6.1

5.6

2.9

Working arrays

20.8

17.4

25.0

Library routines

5. I

4.5

4.4

32.0

27.7

32.3

Total Totol size of partition (Kbytes)

(c)

142

B

CONDIC

I 8 e% a

1.5c

if+1 & 0

500

1000

1500

2000

2500

No. of words in dictionary No. Of words per block

+

Fig. 9. Allocation of core storage (K words).

make up the programs are written in standard FORTRAN IV and should be transferable to most other machines. However, certain additional routines are required, (a) for character and bit manipulation and (b) for comparison of alphanumeric strings. In the 370 version these are implemented, (a) in extended FORTRAN, (b) in assembler language, whilst in the 7090 version library routines are used. The routines consist of 10-20 machine instructions each. The core store requirements for the three programs in the 7090 and 370 versions are shown in Fig. 9. The working storage required for CONCORD is a function of

T. C.

230

SHARPE

and D. E. CLARK

the size of the largest block of text which can be handled at a time and also of the word-length of the computer. The working area for CONDIC is set by the size of the dictionary of terms and therefore depends on the number of blocks processed as well as their size and variety of expression. By reducing the block size it should be possible to run a version of the program on a mini-computer provided it had at least 16k of core (12 or 16 bit word) and either two tape drives or disc. However, there may be problems in compiling CONCORD due to the large number of source statements (nearly 4000, 50% of which are comments + blank common). LIMITATIONS

AND

TIMINGS

The largest block of text which can be handled by the present system is 2000 words or 1000 different words OP 200 lines and 100 numbers. CONQUEST can accept up to 500 search instructions on a single pass, and can create sub-sets to a depth of 4. CONDIC can generate a dictionary of up to 10000 words. These limitations apply to a system comparable to the 7090. Figure 9 shows the amount of core store required in the 7090 and 370 versions of the programs. Most of the storage space is allocated to the working arrays; the size of these arrays is proportional to the maximum dictionary and concordance size (in CONCORD and CONQUEST), the maximum number of search instructions (in CONQUEST) and the size of the combined dictionary (in CONDIC). (a) CONCORD

i’:-I 0

200

i=

400 600 600 No. of words per block

1000

(b) CONQUEST

I

0

I 200

I

I

I

400 600 800 No. of words per block

I 1000

Fig. 10. Typical timings (set) for processing one block of text.

Pathological data

231

The timings for processing various block sizes on the 7090 are shown in Fig. 10. It will be seen that the initial concordance generation (a) is relatively slow. The intercept on the time axis represents the time taken to initialise the various arrays. The search program (b) is considerably faster, and fairly complex searches may be run without seriously affecting its performance. For very short records and simple searches, the speed is limited by the data transfer rate from magnetic tape. The time taken for CONDIG (c) to merge one block into the combined dictionary is proportional to the size of the combined dictionary and the size of the block, and is once again I/O limited for short blocks. The 370 is approx. 10 times as fast as the 7090 on these programs.

SYSTEM

EVALUATION

Unfortunately it is not possible at this stage to give an account of the clinicians’ reactions to the system as the amount of data accumulated so far is insufficient for them to start obtaining any useful results. One can infer from this that although there has been a certain amount of interest in using the system, there has been an unwillingness to spend large amounts of time putting data into the system. This is understandable in the purely clinical departments but is rather surprising in the departments which have a research and teaching commitment, because the potential return should make the effort worthwhile. Our experience is that the problems are purely practical and have to do with data acquisition because: (i) it requires money; (ii) it requires staff. Clearly the data we are considering is confidential and cannot be sent off to a commercial agency. -Therefore unless the medical computing unit can furnish equipment for data acquisition, the user is faced with a considerable capital cost. In addition, he will either have to persuade one of his staff to prepare the data or pay for a new member of staff. At present there is an on-going project in the Department of Pathology in the Manchester Royal Eye Hospital. For this, a surplus 5-hole paper-tape punch was used and special software was written to convert the data into BCD card images. The secretarial staff were fortunately co-operative, and the hard-copy obtained was thought to be a useful by-product. DISCUSSION Although the possibility of generating an Esperanto for pathological data handling should at first sight seem remote, our results show that it is certainly possible to run a generalised processing system for such data, at least in a batch-mode (this is analogous to sending a number of samples to the laboratory and receiving the results after a certain period of time). The main use of the system would be for feeding back information to pathologists on the results of their findings. Other uses may be for instance: (1) detecting shift in incidence of certain disease; (2) identification of problem areas; (3) death audit; (4) fund of past experience for prognosis. No hospital information system will be of practical value unless the pathological system is improving medical precision and accuracy all the time ; we all go on making the same mistakes unless we have some way of finding and correcting the errors. The same goes for undergraduate and post-graduate teaching, where access to real examples

232

T. C. SHARPEand D. E. CLARK

of pathological data could be of immense benefit in high-lighting difficulties and illustrating trends. Although batch-processing has been mentioned, the system would be much more useful for clinical and teaching purposes if it were possible to retrieve information “straight away” i.e. in real-time, using terminals such as teleprinters or t/v display units. Most university computer systems are not designed to accept heavy terminal usage involving large scale data handling so there would be a need for a separate medical (school) computer; this would be medium sized rather than small because of the demands upon it. The program has been written so that it would fit into an environment where input was from teleprinter or display, and output was to be done at high-speed; the concordance generation is relatively slow but has to be done only once on each new item of data, whilst the search process, which may be repeated many times, is extremely fast. The question of file maintenance (which would enable data to be added to and corrected when required) has not been considered in detail for two reasons-partly because the problem has already been solved adequately by systems now in operation, and partly because it depends largely on the particular computer in use, and its operating system. However, the basic storage and retrieval system should be readily transferred from one medical institution to another. SUMMARY The need has become apparent for a medical information system which can handle English text and can be transferred from one medical institution to another. A set of computer programs have been written for this purpose and they have been tested on clinical data from various departments of a hospital. The elements of this tool have been kept deliberately simple and consist of: (a) CONCORD, an input and concordance generating program; (b) CONQUEST, a search and retrieval program; (c) CONDIC, a program to generate a dictionary of terms. The data processing system is written in FORTRAN IV and is thus as machine independent as possible. It requires a medium scale computer and for most effective use in teaching institutions should be operated in real-time mode using terminals. REFERENCES ,l. J. E. Schenthal, J. W. Sweeney and W. Nettleton, Clinical application of large-scale electronic data processing apparatus. JAMA 173, 611 (1960). 2. Van Brunt, Collen, Davis, Besag and Singer, A pilot data system for a medical centre. 3. Korein et al., Computer processing of medical data by variable-field-length format. J. AMA l%, (11) (1966). 4. F. Kennedy, A. G. Cox, A. I. M. Glen, A. D. Roy and C. E. Sundt, A computer based system for handling clinical data. Computers in the Seroice of Medicine, vol. 1. 5. J. Anderson, The development of medical recording. Working Conf: on The Information Processing of Medical Records. Lyons, North-Holland, Amsterdam (1970). 6. G. 0. Barnett and P. A. Castleman, Comput. Biomed. Rex 1, 41 (1967). 7. B. G. Lamson, Storage and retrieval of uncoded tissue pathology diagnosis in free text. (Paper given at 7th IBM Medical Symp. Poughkeepsie, October 1965). 8. Niblett and Price, STATUS, a Concordance Generating Program. Leaflet published by H.M.S.O. About the AuthorsDAvrD E. CLARK began his career as a Telephone Engineer in 1950. From 1952 to 1954 he worked on Centimetric Radar in the R.A.F. He then took a Medical Degree at Leeds University, qualifying in 1960, and continued his studies there as May and Baker

Pathological

data

Research Fellow, 1961 to 1965. During this time he also received the D.I.C. in Engineering in Medicine from Imperial College, being the first English person to do so. Dr. Clark became Director of the Medical Computing Unit at the University of Manchester in 1965. Since then he has visited many medical centres in Western Europe and North America, and lectured and written extensively on the subject of Medical Computing. He is a member of the British Computer Society, and an editor on two international journals. About the Author-THoms C. SHARPEreceived the BSc. in Physics from Imperial College, London, in 1969. He was then employed by the Plessey Company, participating in research and development of computer memories and optical character recognition equipment. Mr. Sharpe took up his present post as Systems Programmer in the Medical Computing Unit of the University of Manchester in 1970. The set of programs described in the current paper formed the basis of his Master’s Degree, which was obtained in 1973. His present work consists of development of the software for data collection and analysis in a variety of fields related to medical research.

233

Handling information in general practice--using feature cards with computers.

Laboratory techniques for handling gametes and embryos.

Including auxiliary item information in longitudinal data analyses improved handling missing questionnaire outcome data.

A general method for handling missing binary outcome data in randomized controlled trials.

A Small Acoustic Goniometer for General Purpose Research.

An economical design for a general purpose pH--stat autotitrator.

Techniques for Updating Pedestrian Network Data Including Facilities and Obstructions Information for Transportation of Vulnerable People.

Toward a General-Purpose Heterogeneous Ensemble for Pattern Classification.

Improving Clinical Data Integrity by using Data Adjudication Techniques for Data Received through a Health Information Exchange (HIE).

A general-purpose pulsed field controller.

Handling large datasets of hyperspectral images: reducing data size without loss of useful information.

Sequence data handling by computer.

biorepositories: handling information associated with compliant sample management.

Information for and from general practice.

MVAPACK: a complete data handling package for NMR metabolomics.

Data handling and pattern recognition for metal contaminated soils.

Ignorability for general longitudinal data.

Medicine: Adapt current tools for handling big data.

Ionic Liquid-Liquid Chromatography: A New General Purpose Separation Methodology.

General-purpose timer from transistor logic to auxiliary equipment.

Invasive genotypes are opportunistic specialists not general purpose genotypes.

Comparison of progressive addition lenses for general purpose and for computer vision: an office field study.

pathological and molecular information.

Influence of blood handling techniques on lactic acid concentrations.