Automated quality control of biomedical data. Part I: system description.

AUTOMATED QUALITY CONTROL OF BIOMEDICAL DATA PART I: SYSTEM DESCRIPTION

Critical Reviews in Clinical Laboratory Sciences Downloaded from informahealthcare.com by University of Sydney on 12/30/14 For personal use only.

Authors:

Referee:

Leif H.Larsen Richard D. B. Williams Geoffrey R. Nicol Shepherd Foundation Melbourne, Australia

Peter Wilding

Central Birmingham Health District and Wolfson Research Laboratories Birmingham, England

I. INTRODUCTION Technological advances have resulted in the widespread establishment of laboratories with large-scale use of automated analyzers and computerized data processing facilities. Because of the reduction in the human contribution to laboratory measurement, it is becoming increasingly difficult to maintain a constant vigil against occasional malfunctions of the testing equipment or the laboratory methodology. Consequently, there is an increasing awareness of the need for efficient control procedures. Many of the potential benefits of automated procedures may be nullified if the integrity of the data is not preserved. In an automated environment, it is essential to build into the system a set of stringent safeguards, with the object of minimizing the initial entry of errors into the system, and, as a second line of defense, to maintain an effective control system for detecting the residual errors. This paper describes the control procedures in operation at the Shepherd Foundation Centre in Melbourne, Australia. This is a nonprofit, automated, multiphasic health-testing laboratory and clinic. All the patients are referred to the Founda-

tion by medical practitioners either for the elucidation of nonspecific symptoms or for periodic health evaluations. The routine biomedical profile consists of about 60 physiological and laboratory measurements and an automated medical history questionnaire. Existing computer files contain data from over 60,000 patients, currently increasing at a rate of over 100 patients per day. Computer procedures are used to znalyze this data on a routine basis. The Foundation, therefore, represents a good example of the type of automated environment that needs to have efficient control procedures. The main topic of this paper is an automated statistical quality control system based on the cumulative sum (cusum) principle. Several disciplines have contributed to its development. Therefore the paper is subdivided into sections which describe the system from the respective viewpoints of the laboratory director, the quality control specialist, the biomedical statistician, and the computer systems analyst. This description of the control system is presented as the first of two companion papers. The second paper, Larsen, Williams, and Nico13 will provide a complementary discussion of the results obtained from the system. November 1977

241


11. LABORATORY PERSPECTIVES The demand by the community and medical profession for more reliable and comprehensive diagnostic tools has created a situation where automation of test procedures is necessary and commonplace. Unfortunately, the means of assessing and interpreting the data produced have not kept pace with the rate of increase of test results. Modern laboratory instruments have a high capacity for generating results, but the reliability of the data may often be variable. This is not surprising when- one considers the number of factors that affect a test result. There are many sources of variation between the starting point of the patient’s physiological state and the end point of transmitting the result to the physician. Some of these sources of variation will affect the accuracy of the observation (accurate observations will be close to the “true” value). Other sources of variation will affect the precision of the observation (precise observations will be capable of being reproduced consistently, while not necessarily being accurate). There are several sources of variation which affect both accuracy and precision. All the different sources of variation that occur in biomedical data can be classified into four different groups: intrinsic errors, systematic errors, rogue errors, and random variation. We have found this classification to be helpful as a basis for control procedures. The notes below clarify the distinction between the four groups, and the subsequent discussion describes the primary and secondary control procedures.

A. Sources of Variation 1. Intrinsic Errors An intrinsic error is a consistent bias that exists in the test system. It cannot easily be eliminated, although in some instances correction factors can be applied. It affects the accuracy but not the precision of the test. The factors that determine the magnitude of an intrinsic error relate to the choice of the instrumentation and methodology. An example of this type of error occurs in the estimation of glucose by a reducing method where the intrinsic error arises from poor specificity.

2. Systematic Errors Like intrinsic errors, systematic errors reduce the accuracy of the data, but do not affect 242

CRC Critical Reviews ijr Qbiical Laboratory Sciences

precision. Unlike intrinsic errors, they are not built into the system. They are invariably related to technician expertise and can be diminated by careful attention to standardization of technique. A typical example of this type of error is in sample collection, where prolonged or excessive peripheral venistasis can elevate many biochemical parameters such as serum potassium, calcium, albumin, and cholesterol.

3. Rogue Errors Rogue errors occur only occasionally in a well-controlled system. They are typically well outside the usual range of variability and affect both the precision and accuracy. They commonly derive from areas related to sample and result administration, especially where transcription of results is required. This type of rogue error can be minimized by reducing the human contact with data by means such as on-line data collection and machine-generated labels. Another common source of rogue error lies within some automated laboratory instruments. For example, the temporary occlusion of an autoanalyzer tube can drastically affect a result without obvious indication to the operator. This type of error is very difficult to detect and control on a manual basis, but secondary automated control procedures provide an effective safeguard. 4. Random Variations Random variation affects the precision of a test without affecting the accuracy. It reflects the degree of control or stability of the testing system and can generally be minimized by replication of the test and averaging the results. An example is the variation in the reaction rate of an enzyme caused by temperature fluctuations in the heating bath. Random variation determines the sensitivity of a test, and it is of utmost importance that it is reduced to a level consistent with the desired clinical sensitivity. Both in vivo and in vitro variables should be considered in the control of random errors. The effect of variation due to physiological causes is also important in this context. All biomedical measurements are subject to variability both between and within patients. Betweenpatient variations affect the sensitivity of the control procedures (see the discussion of our standardization procedures in Section 1II.C). Within-patient variability affects the precision of

the observations, and the value actually recorded may not be the most typical value for the patient.


B. Primary Control Procedures (Error Prevention) The term “primary control” is used to describe procedures in which the objective is to prevent the initial entry of errors in the data. The major objective in the establishment of a test procedure must be to minimize the occurrence and magnitude of all types of error. It is beyond the scope of this paper to give a full review of the primary control procedures at the Shepherd Foundation. In summary, there are two main aspects, the equipment and the staff. Equipment - The trend towards greater automation of testing equipment makes it increasingly important to emphasize the correct initial selection of equipment, its routine maintenance, and the procedures for continually reviewing its performance. Staff - The selection, training, and psychological motivation of the staff are of paramount importance. In our experience, the attention paid to staff motivation has had particularly beneficial results in terms of data quality. This includes a general improvement in accuracy and precision as well as a reduction in the incidence of rogue errors. We regard staff training as an ongoing procedure, incorporating continuous monitoring of performance and methodology with a periodic review of proficiency. C. Secondary Control Procedures (Error Detection) The objective of secondary control procedures is to detect the errors that are present in the data. The detection of errors has an obvious immediate benefit (it enables the error to be corrected), but there is an additional benefit which in the long run is more substantial: it triggers an investigation to determine why the error occurred. The consequent feedback to the primary control function of error prevention will, hopefully, reduce the incidence of that type of error in the future. We operate three types of detection procedure, as outlined below. Routine review points - The data for each patient are checked manually at four different points during the course of their progress; these points are the initial reading of values by the technician, the preliminary print-out of data for physical tests, the final print-out of data, and the preparation of the report for the referring practi-

tioner. In addition, the real-time dataentry system alerts the technician to abnormal or illegal data entries. Control samples - This procedure is based on the standard practice of introducing known and unknown control samples into each batch of measurements. The frequency of introduction of secondary controls is generally determined by the stabilitv of the test and the economics of the analytical procedures. While the need for secondary controls is well established, the analysis of the information provided by this technique is rarely used efficiently. We believe that its main role is to supplement the analysis of patient test results as provided by the statistical quality control system (see next paragraph). Automated statistical quality control - This procedure monitors the test results at the final stage of operation, prior t o presentation to the referring practitioner. It uses the “cusum” technique as described in Section 111. The effectiveness of the control system is due to its ability to recognize changes or abnormalities in the pattern of the test results and is independent of the extrapolation of quality from secondary controls. This system is the main topic of the present paper. D. An Outline of the Quality Control (QC) System The QC system represents a technological solution to a technological problem. The problem was threefold. In the first place, computerization of the data processing and storage facilities made it possible to generate large volumes of data for input to statistical and epidemiological investigations. This highlighted the need to ensure that the quality of the data was demonstrably beyond question. In the second place, the trend towards increasing automation of the testing equipment made it more difficult to maintain a constant human vigil against occasional malfunctions. A third consideration was the desire to replace the subjective assessment of quality by fully objective methods. The solution was developed over a period of 18 months (late 1973 to early 1975) and is now well established. It represents a fully integrated computer system which operates on several different time scales. This range is necessary in order to provide comprehensive protection against all the different types of error that can occur. On-line control - The ultimate degree of protection by a statistical QC system would be Noveiiiber 1977

243


obtained by monitoring the results in real time, with the system operating on-line from all instruments. This is not yet operational at the Shepherd Foundation but is in the planning stage as an imminent extension of the system. We anticipate that on-line control will reduce the number of secondary control samples, while at the same time give greater confidence in the reliability of the results. The technician operating the test will have access to more complete and up-to-date information about the stability of the test and thus will avoid the production of invalid and expensive data. Short-term control - On a daily basis, the computer monitors the pattern of test results for successive patients. This embraces a total of about 60 physiological and laboratory observations per patient. The results are summarized separately for each test, and the quality status of each test can be defined at any time in terms of three alternatives: 1. Status “green” - This indicates that there is no suspicion of any change in the mean level of the test results. 2. Status “amber” - There is some suspicion of a change, but not enough evidence to justify a firm conclusion. 3. Status “red” - There is enough evidence of abnormality to indicate that a change in the mean level has definitely occurred. Three computer reports are provided at the end of each day. A daily data-file listing gives a convenient summary of the day’s results for each patient for each test. A daily summary report gives the end-of-day status for each test. It quotes the current cusum values and indicates whether any abnormalities were detected. A daily exception report gives patient-by-patient results and cusum values for each test having abnormal results. This level of control provides excellent protection against rogue errors and is also efficient in the detection of short-term systematic errors. Medium-term control - On a weekly basis, we monitor the daily mean and the daily standard deviation for each of the quantitative tests. The computer gives the same status classification (green, amber, red) as for the daily results. It also analyzes the variability of the test results between and within days, to provide additional insight. The information is presented in a weekly analysis report, which has a set of three tables for each 244

CR C Crirical Reviews in Clinical I.ahoratory Sciences

test. This level of control gives good protection against moderate but persistent systematic errors (e.g., slow drift in one of the testing machines) where the errors may be too small to be detected on a daily basis. In addition, it provides very clear indications of changes in the precision of the test results so that we can monitor random variation. Long-term control - On a monthly basis, the computer gives a comprehensive statistical summary, by patient sex, of all the results obtained for each test during the previous month. It automatically provides the frequency distributions, with summary statistics including the monthly mean, the within-month standard deviation, coefficients of skewness and kurtosis, and a range of percentile values. It also examines the effect of age on the test result; calculates formulae for the constant, linear, and quadratic models; and estimates the goodness of fit in each case. The information is presented in a monthly analysis report, which has a set of six tables for each test. Month-by-month comparisons are facilitated by the presentation of the key statistics for each test, in the form of summary tables, in past chronological monthly sequence. This level of control has enabled us to detect variations in batches of reagents and batches of caIibration sera, as well as in operator techniques. It is relevant to both random and systematic variability. In addition, by checking the effect of changes in the test equipment, we can monitor intrinsic errors. The monthly summaries also serve a very useful purpose in providing an overview of laboratory performance. Further applications of the information from the monthly analysis (such as the estimation of reference ranges) are outside the scope of this paper but will be discussed in Part I1 of this paper (Larsen, Williams, and Nicol)?

E. Conclusions To summarize our experience from the laboratory viewpoint, we have found that the automated QC system has provided a number of major benefits: 1. Direct improvement in data quality due to the detection of all significant errors 2. Indirect improvement in data quality due to the feedback of information to the staff and the incentive to eliminate errors at source 3. Reduction in the need for secondary control samples

inspection of the results on a conventional graph. Indeed, it is possible with conventional procedures that such a change would pass unnoticed. Figures 1 and 2 illustrate the clarity with which this technique succeeds in pinpointing the onset of a change in mean level. Figure 1 plots the serum calcium levels for each of 50 patients, and Figure 2 plots the corresponding cusum values. In fact, there was a change in mean level after 20 patients had been tested. This is not obvious from the conventional graph of the test results (Figure l), but the cusum graph in Figure 2 makes it quite clear that a change has occurred. The remainder of this section assumes a basic familiarity with the principles of cusum operations.

4. Removal of the subjective element in the assessment of quality standards 5. Assurance that our measurements are of a known and demonstrable standard of accuracy and precision.


111. QUALITY CONTROL PERSPECTIVES A. Introduction The preceding section of this paper gave a general description of the statistical quality control system in operation at the Shepherd Foundation. We will now discuss this system in greater detail from the quality control viewpoint. The control procedure is based on the cusum technique. This technique has been applied successfully in recent years to the control of industrial and biomedical processes (see, for example, Woodward and Goldsmith6 and Van Dobben de Bruyn'). It may be helpful to summarize the principles of this technique before discussing some of the special features of the Shepherd Foundation application. In broad terms, the function of a cusum procedure is to keep a running total of the amount by which the individual results exceed a predetermined reference value and to use this running total to decide whether there has been a change in the mean value of the results. In most typical cases, a shift in the mean level is not immediately apparent. It takes some time before the effect of the change is clear enough to be detected by visual

B. Cusum Control Procedure Our cusum control procedure has seven stages: selection, transformation, standardization, scaling, comparison with reference limits, calculation of cusum value, and comparison with decision limits. Figure 3 gives a flow-chart of the overall procedure.

I . Selection It is our practice to specify truncation limits for every test. These limits define an acceptable range of observations which is much wider than the usual range of variations for the test concerned. Values which fall outside this range are either rogue results or extreme values recorded from symptomatic patients for tests such as serum

Patient no.

6.5

b 4

I

6

1

8

lb

1 :

d4

1's

lb

2b

2;

d4

2k

6 :

d0

312

314

:3

i0

k0

4;

4'4

I 46

1

40

20

FIGURE 1. Test results of serum calcium determinations are plotted for 50 successive patients. A true change in mean value occurred between the first 20 patients and the last 30 patients. This is not readily apparent in this type of graph. November 1977

245

60-


40-

20

RtiNll

0

0

I 2

1

I 6

1

6

1 1 10 12

14

16

I 16

1

20

I

22

I

24

I 26

1 26

I 30

h

I 34

I

36

I

30

1 40

1 42

1 44

1 46

1 46

I 50

no. '

-2oJ

FIGURE 2. C u m valuer corresponding to the test results plotted value which occurred after the 20th patient was tested.

glucose or triglycerides. The presence of such outliers will be recorded by the computer so that the value can be verified, but the outliers will be excluded from the normal analysis.

2. Tmnsfonnation It is commonly believed that most biomedical observations have a Gaussian distribution. In fact, the only test in our experience which consistently gives a true Gaussian distribution is height. A few other variables have a distribution which is approximately Gaussian, but the vast majority of the tests in our biomedical profile have markedly skewed distributions. Such tests typically have a much greater spread of observations above the mean than below it, i.e., a positive 'skewness. Examples are hearing threshold, cardiovascular measurements, erythrocyte sedimentation rate and some serum analytes such as total bilirubin, alkaline phosphatase, and triglycerides. Tests with a negative skewness are less common; examples are visual acuity and mean corpuscular hemoglobin. The effect of skewness on the performance of the control system is that positive skew increases the incidence of Type I errors (false alarms, i.e., the system indicates abnormality when in fact there has been no change in the mean level), while negative skew reduces the incidence of Type I errors. It appears that skewness has relatively little effect on the incidence of Type I1 errors (failure to detect a genuine shift in the mean level) (see Bissell' for details). 246

CRC Critical Reviews in cli,iical Imboratory Sciences

in Figure 1 clearly show the change in mean

To overcome this problem, the system provides the facility to normalize the distribution of the test observations by means of a power transformation of the form y = (x + a)b, where y is the transformed variable, x is the original variable, and a and b are constants specified by the user for the test in question. We have developed an ancillary computer program which receives as input the distribution of the original variable and calculates the optimum values of the constants a and b, such that the distribution of the transformed variables has zero skewness and minimum kurtosis. Note that this transformation is only relevant to daily control. Weekly and monthly quality control are performed at the level of the daily and monthly means, which by virtue of the Central Limit Theorem will be normally distributed irrespective of the shape of the parent distribution. Distributions which are consistently shown to be non-Normal (see Section IV.C.3) are transformed as described above, to correct the performance of the control system. In other cases, an indication of non-normality may be due to the presence of outliers. An additional test to detect such outliers is presently under development; this facility has not been given a high priority hitherto, since the truncation procedure (Section 1II.B. 1) has proved to be quite effective in this context. 3. Standardization The object of standardizing the data is to reduce or eliminate random variability attributable

start


A

transform, standardise, scale

yes

start the cusum

status red

=

update the cusum

status = amber

status = green

I FIGURE 3.

A simplified flow chart of the automated quality control procedure.

to differences in patient age and sex. There are a number of tests where the residual variability of the observations is significantly affected by interpatient differences. It is useful to distinguish between three components: the variability due to the daily mix of patients of different ages and sex, the variability between patients with the same

agelsex classification, and the variability widiin the individual patient. The relevance of these interpatient differences to the control system is that they all contribute to the overall random variability, or the background “noise” level. The task of detecting abnormalities in the data is like listening for an alarm signal against this backNovetnber 1977

247


ground noise. If the background noise can be reduced, then the sensitivity of the control system will be improved. After examining several alternatives, it was decided to use age-specific and sex-specific coefficients to adjust all observations to an equivalent level for a standard agelsex classification. Each month, three alternative regression models are automatically fitted to each test variable according to sex. These models assume that age has no effect, a linear effect, and a quadratic effect, respectively, on the test observations. The goodness of fit of the model is automatically calculated in each case. This enables an appropriate model to be selected, and the standardization coefficients can be updated as appropriate (see Sections IV.B.4 and IV.B.5 for further details).

4. Scaling All values are converted to a common scale of measurement with a mean of 50 and a SD of ten, giving an effective range in the region 0 to 100. Implicit in this procedure is the specification of the target values, representing the quality standards for each test. The procedure occupies a key position in the design of the QC system and it is described in some detail in Section IILC.

5. Comprison With Reference Limits Once the original observations have been selected, transformed, standardized, and scaled, they are in the final form for analysis by the cusum procedure. The first step is to decide whether the value is within the normal range of variability around the target value which the user is prepared to tolerate before the cusum mechanism should be triggered. At this point, the scaled value is compared with the reference. limits as specified by the user for the particular test concerned. The choice of reference limits is made concurrently with the choice of decision limits. The combination of the two pairs of values defines a particular cusum scheme. We use the tabulations of average run length (Am),as presented in Table 1, as a guide in the selection of a scheme. The figures in Table 1 represent an extension of the data in Table 5 of Van Dobben de Bruyn.s They are calculated for a two-sided scheme, whereas Van Dobben de Bruyn’s figures are for a one-sided scheme, and they are presented in a format which 248

C‘RC Critical Reviews Ci clinicat I.aboratop Scieiices

enables the ARL to be read off as a function of the previous publication of these values. The selection of a specific scheme for a particular test is governed by two opposing considerations. On the one hand, it is desirable for the scheme to be robust to random fluctuations, so that it should have a high ARL when the process average is acceptable (m = 0 in Table 1). On the other hand, the scheme should be sensitive to abnormalities, so that it should have a low ARL when the process average is rejectable. In practice, for most tests, we find that the scheme k = k0.8, h = k5.0 gives a good compromise between robustness and sensitivity. Translated into the scaled units, this gives reference limits of 42 to 58 and decision limits of +-SO. If the test status is currently “green,” then the scaled values are allowed to fluctuate between the reference limits without any detriment to the test status. However, once a value which crosses either of these two thresholds is recorded, a cusum count begins.

6. Calculation of Cusum Value Separate calculations take place for the upper and lower cumulations. If the sum of the cusum plus the scaled observation exceeds the upper reference limit, then the upper score is the (positive) difference; otherwise, the upper score is defined to be zero. Similarly, if the sum of the cusum plus the scaled observation is less than the lower reference limit, then the lower score is the (negative) difference, otherwise zero. The updated cusum value is the sum of the two scores. 7. Comparison With Decision Limits The final stage in the control procedure is to decide from the updated cusum value whether the new status is green, amber, or red. If the cusum value is zero, then the testbas status “green,” i.e., the observations are in control. However, if the value is nonzero, then it is compared with the decision limits to decide whether the status is “amber” or “red.” Nonzero cusum values, which fall within the decision limits are regarded as of intermediate significance - not sufficiently extreme to justify the “red” status, but high enough to require close scrutiny. These are given the status “amber.” Values outside the decision limits are regarded as definitely out of control and are given the status “red.” In this way, the system is able to maintain a

TABLE 1 Average Run Length


Process average (m)

Decision limits (h)

0

io

i2.0 i3.0 i4.0 i5.0 i6.0 i8.0 i 10.0

5.00 8.65 13.3 19.0 25.8 42.3 63.0

i0.2

i2.0 i3.0 i4.0 i5.0 i6.0 *8.0 i 10.0

i0.4

i0.6

Reference limits (k)

i0.8

il.0

i0.8

i1.2

i1.6

i2.0

i2.4

i2.8

i3.2

4.29 6.76 9.36 12.0 14.7 19.8 24.9

3.15 4.46 5.73 6.99 8.24 10.7 13.2

2.37 3.22 4.06 4.89 5.72 7.39 9.06

1.89 2.54 3.16 3.79 4.4 1 5.66 6.91

1.58 2.12 2.62 3.1 1 3.62 4.62 5.62

1.37 1.83 2.26 2.66 3.08 3.92 4.75

1.22 1.61 2.02 2.35 2.70 3.42 4.14

1.11 1.42 1.81 2.13 2.42 3.06 3.68

7.95 16.4 30.2 52.0 85.5 214 410

6.09 10.1 14.6 19.2 23.9 33.6 43.2

3.90 5.60 7.28 8.94 10.6 13.9 17.3

2.74 3.75 4.75 5.75 6.75 8.75 10.7

2.10 2.84 3.55 4.26 4.98 6.4 1 7.84

1.72 2.31 2.86 3.41 3.97 5.11 6.19

1.46 1.96 2.42 2.87 3.33 4.24 5.14

1.28 1.72 2.13 2.49 2.88 3.65 4.42

1.16 1.52 1.92 2.23 2.54 3.23 3.89

i2.0 i3.0 i4.0 i5.0 i6.0 i8.0 i 10.0

14.0 36.8 89.0 207 470 2,400

9.19 16.8 26.4 38.0 51.6 84.6 126

5.02 7.43 9.88 12.4 14.9 19.9 24.9

3.24 4.49 5.74 6.99 8.24 10.7 13.2

2.38 3.22 4.06 4.89 5.72 7.39 9.06

1.89 2.54 3.16 3.79 4.4 1 5.66 6.91

1.58 2.12 2.62 3.11 3.62 4.62 5.62

1.37 1.83 2.26 2.66 3.08 3.92 4.75

1.22 1.61 2.02 2.35 2.70 3.42 4.14

i2.0 i3.0 i4.0 i5.0 i6.0 i8.0 i 10.0

27.0 97.5 330 1.100 3,700 4 1.000

15.0 32.3 60.0 104 171 428 820

6.83 10.7 14.9 19.4 24.0 33.6 43.2

3.96 5.62 7.28 8.94 10.6 13.9 17.3

2.74 3.75 4.75 5.75 6.75 8.75 10.7

2.10 2.84 3.55 4.26 4.98 6.41 7.84

1.72 2.3 1 2.86 3.4 1 3.97 5.11 6.19

1.46 1.96 2.42 2.87 3.33 4.24 5.14

1.28 1.72 2.13 2.49 2.88 3.65 4.42

i2.0 i3.0 i4.0 i5.0 i6.0 i8.0 i 10.0

57.0 295 1,400 7,000

26.8 72.8 178 412 939 4.700

9.97 17.3 26.6 38.1 51.6 84.6 126

5.06 7.44 9.88 12.4 14.9 19.9 24.9

3.24 4.49 5.74 6.99 8.24 10.7 13.2

2.38 3.22 4.06 4.89 5.72 7.39 9.06

1.89 2.54 3.16 3.79 4.4 1 5.66 6.9 1

1.58 2.12 2.62 3.1 1 3.62 4.62 5.62

1.37 1.83 2.26 2.66 3.08 3.92 4.75

52.1 194 659 2,200 7.400

15.9 32.8 60.3 104 171 428 820

6.86 10.7 14.9 19.4 24.0 33.6 43.2

3.96 5.62 7.28 8.94 10.6 13.9 17.3

2.74 3.75 4.75 5.75 6.75 8.75 10.7

2.10 2.84 3.55 4.26 4.98 6.41 7.84

1.72 2.31 2.86 3.41 3.97 5.11 6.19

1.46 1.96 2.42 2.87 3.33 4.24 5.14

i2.0 i3.0 i4.0 i5.0 i6.0 i8.0 i 10.0

*

35,000

*

130 1.000 7.000

* *

t0.4

*

*

f

Ndre: The average run length (ARL) at a given quality level is defined as the average number of observations which would be required to yield a cumulation to a decision limit, when the process average quality is stable at the specified level. The table above gives the values of the ARL when the observation is a standard Normal variable (zero mean. unit variance) for a symmetrical two-sided cusum scheme and for different combinations of the reference limits, decision limits. and process average. The reference limits (k) and the process average (m) are measured in standard deviations from the target value, and the decision limits are measured in standard deviations. Values less than 1000 are quoted to three significant figures; values in the range 1000 to 50,000 are quoted to two significant figures; values above 50,000 are indicated by an asterisk.

November 1977

249

detailed, objective assessment of the relative status of control of each test at all times. Its ability to do this automatically, on a patient-by-patient basis, has proved in practice to represent an effective solution to our control requirements.


C. Special Features 1. Target Values

In terms of the design of the system, perhaps the most difficult problem was the question of target values. When a cusum scheme is used for the control of an industrial process, such as the manufacture of ball bearings, there is normally an obvious quality standard, the design specification. The objective of the quality control procedure would be to control the process average to that standard. In contrast, there is no quality standard for the control of biomedical data. It would be unacceptable, for example, to operate an automated analyzer on the basis that the hemoglobin reading must be controlled to give an average of say 15.00 g/dl. Such a dictum would represent an arbitrary standard and would be inappropriate to a fluctuating population of patients with intrinsic physiological variability and a changing demographic profile with different mixes of age, sex, and socioeconomic attributes, all of which are likely to affect both the mean and the standard deviation of the observations. Unfortunately, the cusum procedure - and indeed any form of statistical quality control requires the specification of a quality standard. Therefore, it would appear at first sight that the cusum procedure, while being an admirable method in many industrial situations, is unsuitable for biomedical applications. The problem was overcome by using adaptive target values, automatically updated on the basis of the current observations. For each test, the smoothed mean and smoothed standard deviation (SD)are calculated by means of an exponentially weighted moving average as follows. For the daily scheme, where the individual observations correspond to successive patients, the smoothed mean is calculated from the preceding daily averages, and the smoothed SD is calculated from the squares of the differences between successive patient values. For the weekly scheme, where the individual observations are daily means and withinday SDs, the smoothed mean is calculated from the preceding weekly averages and the smoothed SD is 250

CNC Critical Reviews in Clitrical Lclboratory Sciences

calculated from the squares of the differences between successive daily results. In other words, the target value for the individual observations is the short-term mean, an average over the previous few days. The target value for the daily means is the medium-term mean, an average over the previous few weeks. See Section IV.B.3 for details of the calculation procedure. The use of an adaptive target value for the individual observations would tend to inhibit the detection of short-term trends or within-week cycles. However, such movements would be picked up at the level of the weekly analysis. Similarly, long-term trends may be muffled at the weekly level but would be detected at the monthly level of control.

2. Scaling All values are scaled to give a mean of 50 and a SD of ten. This is achieved by the transformation y = 50 + 10* (x - m)/s, where y is the scaled value, x is the original value, m is the current smoothed mean, and s is the current smoothed SD. This procedure has two major benefits. First, it means that reference limits and decision limits are always measured in the same units for all tests; thus, there is complete consistency of specification. For example, when setting the reference and decision limits for constituents which exhibit wide biological variation (e.g., triglycerides, cholesterol) there is no need to take this variation specifically into account since it will have been automatically allowed for in the calculation of the scaled value. The second major benefit of scaling is that it makes the results more intelligible. The cusum reports quote the scaled value of every test result. In practice, these results fall conveniently on to a simple 0 to 100 scale, with 50 as the average. For a Gaussian distribution of results, about 95% of the observations will fall in the range of 30 to 70. It is not immediately obvious whether a value of, say, 9.5 mgB for serum calcium is high, medium, or low. However, when this result is scaled to give a value of, say, 56, then it is clear at a glance that the result is just slightly above average.

3. Other Features As described in Section 1II.B. the system has several rather unusual features which were necessitated by the requirements of the data environment. For example, it provides facilities for operating within different time frames, for cen-


soring extreme observations, for data transformation, and for standardization to eliminate the effect of age and sex on the test observations. It is noteworthy that the system is fully computerized. The majority of cusum schemes, even in the industrial environment, are still operated on a manual basis. In our case, the user has the option of specifying particular values for truncation limits; transformation coefficients; and coefficients for age/sex standardization, reference limits, and decision limits. Default values are provided in each case if the option is not exercised. Once the appropriate values have been specified, the user takes no further action in terms of the detailed control procedure.

STATISTICAL PERSPECTIVES A. Introduction

Several of the facilities provided by the automated QC system depend on the routine computation of statistical algorithms. The extent to which Mean Standard deviation Skewness coefficient Kurtosis coefficient

the various control procedures make use of statistical techniques is relatively low in the case of the daily control system, moderate in the case of the weekly control system, and high in the case of the monthly control system which produces several statistical summaries. In line with the general concept of this paper, the notes in this section have the objective of describing the various techniques that have been used. For discussion purposes, they are divided into two groups: estimation procedures and significance testing.

B. Estimation Procedures 1. Distribution Statistics The monthly analysis program calculates the within-month mean (m), standard deviation (s), skewness coefficient (g, ), and kurtosis coefficient (gz) for each test. The conventional unbiased estimates are used in each case. Let Mi = Z (x m)'/n, where x denotes the observed value, m is the sample mean, and n is the number of observations. Then the relevant formulae are as follows:

rn = z x/n s = {c, Ma)'/, where c, = n/(n - 1) g, = ca M,/s' where c, = nZ/{(n - l)(n -2))

M, /s' - 3c, where c j = na (n + 1)/ {(n - l)(n 2)(n - 3)) and c, = (n - 1)'/ {(n - 2f(n 3))

g, = C,

-

-

values are calculated for each test using the 2. Percentiles technique of exponential smoothing. The monthly analysis report tabulates the The procedure for estimating the smoothed estimated values of the following percentiles: 1, mean is conventional, with the new forecast being 2.5, 5,10,20,30,40,50,60,70,80,'90,95,97.5, the weighted average of the current value and the and 99. These are interpolated by the program previous forecast. This, of course, is equivalent to from the cumulative frequencies above each cell boundary value in a frequency table, making use taking a weighted average of the complete time of the minimum and maximum data values. series, with the weights decaying geometrically, as The frequency table, which shows the distribufollows: tion of observations within the current month, is calculated automatically for each test. Cell boundaries are selected according to the range of the data to give a maximum of 15 cells. The selection where xt is the smoothed mean at time t, xt is the procedure is based on the principle that the class interval must take the value kp, where p is 1,2, or actual data value at time t, and cr is the smoothing constant. The QC system uses a constant value of 5, and k is an appropriate power of ten. The 0.05 for the smoothing constant. This has proved boundary values are also constrained to be multo be adequate for all tests. tiples of the class interval. These constraints ensure For random x, the standard error of the that the choice of classes is consistent with the forecast mean is the same as would be obtained choice that would have been made on an intuitive from the conventional unweighted average of a basis. sample of 2/a- 1 observations. With (II = 0.05, this corresponds to 39 observations. 3. Exponential Smoothing To initialize a new test, the smoothing constant As discussed in Section III.C.1, adaptive target November I977

25 I


for the ith run is given the value 2/(i t l), so that a takes successive values of 1, 213, 112, 215 . ..before stabilizing at 0.05 for the 39th and successive observations. This procedure gives straight-line depreciation of the weights over the initialization period, for example, ;b = (4x4 t 3x3 t 2x2 t x1)/10. To calculate the SD of the noise level, bearing in mind that the data may be drifting, we compute the smoothed value of the square of the difference between successive observations. The smoothed SD can then be estimated as the square root of half the smoothed square difference.

4. Regression Analysis The monthly analysis report presents the coefficients of three alternative regression models which estimate the effect of age on the test results: Constant model Linear model Quadratic model

y=a y=a+bx y = a + bx

+ cx’

In each case, y is the observed test result and x is the patient’s age. The estimation procedure is by the standard least-squares method. In contrast to the con+ntional situation where a single study might be carried out to assess the effect of age on a single biomedical attribute, we are in the situation where the study is replicated automatically, month after month, for each of the 60 tests in our profile. Therefore, the tabulation of comparative monthly results in chronological sequence for each test is unusually fertile as an information source. It is of intrinsic clinical value as an empirical record of the effect of the aging process. It is also of considerable statistical value because it enables the standard errors of the regression coefficients to be calculated from their actual distribution, thus enabling a direct comparison with the values predicted by regression theory. This facility for replicated regression analysis is still rare in statistical applications. Section IV.C describes the method of assessing the fit of the alternative regression models.

5. Standaniization As mentioned in Section III.C, the regression analyses are used to update the values of the standardization coefficients for particular tests where the effects of age and/or sex are pronounced. Standardization coefficients are supplied for each test, either explicitly or by default values 252

C’R C Critical Reviews in C7itiical Laboratory Sciericcs

of zero. These coefficients are for the constant, linear, and quadratic terms of the selected model (if the linear model is selected, for example, then the quadratic coefficient would be zero). Separate values are supplied for males and females. The computer uses these coefficients to calculate the predicted values for a standard 40-year-old male and also for each patient, taking into account his or her actual age and sex. The residual for each patient (actual value minus agelsexapecific predicted value) is added to the predicted value for the 40-year-old male, which serves as a standard reference level. This procedure reduces the random noise level of the data prior to the execution of the cusum test. The presence of a significant quadratic term does not necessarily indicate that a quadratic model is the best possible choice; it merely indicates that this model is better than a linear model. Further research would be required in such cases to establish whether some additional alternative (such as an asymptotic relationship) would be the most suitable selection. Such refinement is unnecessary for control purposes, where the choice between the constant, linear, and quadratic models has proved in practice to be simple but effective. An alternative approach to standardization would have been the use of age-group means for each sex rather than the use of regression analysis. However, this would have had the disadvantage of increasing the number of parameters to be calculated, stored, and updated for each test. For example, if five-year age-groups are used, then 15 age-specific means would be needed to cover the range from age 15 to age 90. In contrast, a quadratic model requires only three parameters. C. Significance Testing

The second group of statistical techniques that are used in the QC system are the significance tests. In all cases, the significance levels are indicated by quoting the percentile value of the relevant test statistic on the basis of the appropriate null hypothesis. For example, if a chi-squared test on 5 degrees of freedom gave a test statistic of 3.00, the entry “P = 30” would be printed to denote that 3.00 is the 30th percentile of the distribution of X’(5). This presentation is readily assimilated by the layman and has practical advantages as compared with the more conventional approach of quoting significance at particular threshold levels. The approach we have adopted


necessitates the use of some special algorithms to compute the percentiles in each case. Details of the various significance tests are as follows. 1. Betweenday Analysis of Variance The first of two applications of the analysis of variance in the QC system +curs in the weekly control procedure. The total sum of squares of all unadjusted observations during the week is partitioned for each test into two components, between days and within days. This would typically give an F-ratio based on four, and several hundred degrees of freedom. The percentile value of the F-ratio is obtained by a look-up procedure. This routine tabulation is useful for detecting anomalies due to small transient systematic errors.

2. Homogeneity of Variance Bartlett's test is also used in the weekly control procedure to establish whether the within-day variances are significantly different from each other. The test statistic, under the null hypothesis, is distributed as chi-squared on k - 1 degrees of freedom, where k is the number of days. To determine the corresponding percentile, the algorithm expresses the tail area as an incomplete $amma integral and calculates its value using the method of Bhattacharjee.' 3. Test for Normality As discussed in Section III.B.2, the performince of the control system is affected by skewness n the distribution of the test results. To overcome his problem, the system provides the facility to iormalize the distribution, if appropriate, by neans of a suitable transformation. In order to stablish whether each empirical frequency distriiution can be reasonably regarded as having been lrawn from a Normal parent population, the nonthly analysis program tests whether the skewess and kurtosis coefficients are significantly ifferent from zero. Under the hypothesis that the arent population is normal, it follows that the oefficients gl and g2 represent unbiased estimates f the population coefficients and would be omally distributed (for large samples) with zero iean and known standard errors. The algorithm sed in this analysis employs the exact formulae )r the standard errors, which depend only on n, ie sample size, as follows: E(g, ) = [6n(n - I)! {(n - 2) (n + 1) (n + 3)}j E(g2) = [24n(n - 11'1 {(n - 3) (n - 2) (n + 3) ( n + S)}]

For large n, these approximate (6/nyh and (24/n)'?, respectively. This enables g, and gz to be expressed as standard normal variables, and the corresponding percentiles are calculated by a conversion from the deviate to the tail area, using a look-up procedure. As stated in Section 111 B.2, the test for normality is sensitive to the presence of outliers, and a supplementary test for outliers is currently under development. This will have the effect of censoring extreme observations, and may eventually replace the present truncation procedures.

4. Regression Analysis The second application of the analysis of variance is ancillary to the monthly regression analysis (Section V.B), which assesses the sexspecific effect of age on the test result. The total sums of squares of all observations for the month is partitioned into three components: the linear effect (one degree of freedom), the additional quadratic effect (also with one degree of freedom), and the residual component (typically with several hundred degrees of freedom). To establish the significance levels for the linear and additional quadratic components, we make use of the fact that an F-ratio on 1 and n degrees of freedom is distributed as the square of a student's t variate on n degrees of freedom, and we obtain the appropriate percentile by calculating the tail area of the latter distribution, using an algorithm supplied by Taylor! Non-normality has not been found to be a problem in this context. The values of the two percentiles usually make it quite clear whether the null, the linear, or the quadratic model should be selected.

D.Conclusions Most statistical applications are of an ad hoc nature in a research environment. The examples given in this section are of a routine nature in a control environment. Because these techniques are used on a routine automated basis, the algorithms must be rigorous and comprehensive. Because they are used in a control environment, the end product must be helpful to the laboratory staff responsible for the successful operation of the control procedures. The comments in Section 11, based on about 18 months of experience using the system, confirm that these statistical applications make an important contribution to its successful operation. November 1977

253


V. COMPUTER SYSTEM PERSPECTIVES A. Introduction For the purposes described in this paper, the Shepherd Foundation has developed an integrated system for the analysis of patient data. This system has the acronym SOCRATES, for Shepherd On-line Control, Retrieval, and Test Examination System. In principle, the system may be used with any machine-readable patient data files as input. The input source for the present application is the patient file created by the “Medidata” system. This is a proprietary system developed by Searle Medidata, Inc. for the entry, storage, and retrieval of test results. The SOCRATES system is written in the BASIC language and operates on the PDP/8I computer under the OS/8 operating system. It comprises six subsystems as outlined below. Data access subsystem - This is the interface between the Medidata system and the SOCRATES system. It enables the user to specify on a master file the particular tests that he is interested in and instruct the computer to extract the results of those tests from each current patient record in the Medidata files. The subsystem invokes several assembly language functions for this purpose. It creates a daily patientdata file, which is the main input to all other subsystems. No information regarding individual patients can be extracted or disseminated without the express permission of the patient. The Foundation is highly conscious of the need to maintain complete confidentiality of patient information and does not permit any breach of personal privacy. Quality control subsystem - AU relevant data are extracted from the daily patientdata file and are written to a daily testdata file. This latter file stores the test results for each successive patient and each controlled test in turn and is the input to the daily, weekly, and monthly QC programs. The relevant information for all the tests that are subject to routine control is stored on the QC master file, which is known as the “parameter file Association analysis subsystem - The user maintains a third master fie in order to direct the collection of data for specific research studies. He can nominate a set of tests which are believed to be relevant to a particular investigation (e.g., menstrual disorders or respiratory distress), and

.”

254

CRC Critical Reviews in C7inical Laboratoty Sciences

the computer will build up a file of data which contains the results of each of the specified tests for each relevant patient. When sufficient data have accumulated, the data file is analyzed statistically. Pmfde analysis subsystem - A fourth master file is maintained for directing the data collection for matched comparisons. The computer will check each patient’s data t o see whether the results are consistent with a specified biomedical profile. If so, the data are stored for future reference, together with the data for a control patient who has been selected by the computer on the basis of a matching control profile. As with the association analysis system, the user will request an analysis of the data once a sufficient volume has accumulated. In this way, it becomes possible to compare the biomedical profile for symptomatic and nonsymptomatic patients of matched backgrounds. Statistical analysis subsystem - This is an integrated suite of statistical programs which can be called in by the user on an ad hoc basis to assist in the analysis of data. The sequence of execution is controlled by a monitor program in response to teletype commands. It provides comprehensive facilities for file manipulation, univariate analysis, bivariate analysis, multivariate analysis, and qualitative analysis. Tape register subsystem - This is a housekeeping system which automates the routine task of managing the selection and rotation of the magnetic tapes that are required by the various components of the SOCRATES system.

B. Quality Control Subsystem Having placed the QC subsystem in its context, the remainder of this section will give a brief description of the seven programs in this subsystem. They are numbered QCOl to QC07. The information flow between these programs is summarized diagrammatically in Figure 4. Program QCOl is used to maintain the QC master file or “parameter fae.” Two sets of data items are specified (explicitly or by default) for each test which is to be monitored. The first set of data gives references which enable the relevant observations t o be extracted from the daily patientdata file. The second set gives the control specification for QC purposes (e.g., transformation coefficients, standardization coefficients, reference and decision limits). A third set of data for each


patient records

QC02 create daily data file

QC03

QC04

perform daily analysis

update weekly data file

weekly a naIys is

QC05

update hist or ica I data file

historical analysis

FIGURE 4. A simplified block diagram of the interaction between the computer programs in the automated quality control system.

test is also present on the parameter fie. This comprises the QC control statistics (current cusum value, status, run length, smoothed mean, smoothed square difference, etc.). They are automatically updated during each run. Program QC02 extracts the relevant data from the daily patientdata file and creates a daily test-data file which is read each day by Programs QC03, QC04 and QCOS. Program QC03 is responsible for the control of data quality at the patient level and is run each day. For each test, it analyzes the results

for successive patients, summarizes the control status at the end of the day, and prints a detailed tabulation of the results if abnormalities have been detected. Program QC04 updates a weekly data file, which is used as input to program QC06. This file contains daily summary statistics for each test. Program QCOS updates a n historical data file which is used as input to program QC07. Since the latter program calculates monthly frequency tables and percentiles, information on the historical data file is at patient level. Program QC06 is responsible November 1977

255


models, and a summary of the key statistics for each of the previous 20 months. Further details are given in Section IV.

for the control of data quality at the daily level and is run each week. For each test, it performs a cusum analysis for the daily means and the daily standard deviations, an analysis of variance between and within days, and a test for homogeneity of variance. Program QC07 provides the monthly summary of the results for each test. Separate analyses are performed for male and female patients, and the analysis in each case may be repeated with and without the outlying observations that may have been censored by virtue of the initial selection procedure. The monthly analysis report includes the frequency distribution of .the observations, a set of percentile values, various summary statistics, an analysis of variance to show the significance of alternative models relating the test result to the patient's age, the coefficients of the alternative

Job 1

TRU DAU QCU AAU PAU QCD QCW QCM AAM PAM

I 8 9 10

All of the programs in the SOCRATES system are run under the control of an overall monitor program. When this program is executed, it reads the current tape register, which is stored on a master file. This register holds the current status of each of the tapes in the SOCRATES tape library. The monitor program prints the date when the register was last updated and asks which job is to be executed. There are ten alternatives; five are for master file maintenance and five are for routine production runs. In each case, the selected job is indicated by a three-letter acronym, as listed below.

Title

Acronym

2 3 4 5 6

C.Operating Procedure

Tape Register - update mister fde Data Access - update master file Quality Control update master fde Association Analysis - update master file Profde Analysis - update master file Quality Control - daily run Quality Control - weekly run Quality Control - monthly run Association Analysis - monthly tun Profile Analysis - monthly run

-

Job 1 enables the user to change the status of the tapes in the current library, e.g., from a scratch file to a monthly summary file or vice versa. It also provides recovery facilities. Jobs 2 to 5 enable the user to modify the master tiles which control the selection and specification of tests and studies. Job 6, the QC daily run, will successively execute programs DA02, QC02, QC03, QC04, QCOS, AAQ2, and PAQ2; hence, it is responsible not only for daily control, but also for the daily collection of new data for subsequent analysis. Jobs 7 and 8 are responsible for quality control at the weekly and monthly levels. Jobs 9 and 10, the Association Analysis and Profile Analysis monthly runs, create

256

CRC' Critical Reviews irr Cliriical Lahur.ator:v Scicrrccs

Programs executed

TRO 1 DAOl Qco 1 AAO 1 PA0 1 See below QC06 QC07 AA03 PA03

files which are suitable for input to the Statistical Analysis programs. These programs are used to analyze the data on an ad hoc basis. All routine runs are designed to simplify the operating procedure as much as possible and to remove from the operator the responsibility for tape selection. For example, when the operator specifies the selected job, the monitor program lists all the tapes required for that job, together with their visible reel numbers. This type of control is maintained throughout the system. The operating procedures are now well established, and the system is running well on a routine basis.


REFERENCES 1. Bhattarchajee, G.P., Algorithm AS 32: The incompletegamma integral. Appl. Stuf., 19, 285, 1970. 2. Bissdl, A. F., Cusum techniques for quality control,Appl. Sfat.. 18. 1, 1970. 3. Iusen, L H., Williams, R. D. B., and Nicd, G. R., Automated control of biomedical data. 11. Data analysis, to be submitted 4. Taylor, G. A. R., Algorithm AS 27: The integral of Student’s tdistribution,Appl. Srur., 19,113, 1970. 5. Van Dobben de Bruyn, C S., Cumulative Sum Tests: Theory and Prucfice, Griffin’s Statistical Monographs and Courses, No. 24, CharlesGriffin, London, 1968. 6. Woodward, R. €and I. Goldsmith, P. L., CumulrrfiveSum Techniques, Imperial Chemical Industries Monograph No. 3, Oliver and Boyd, London, 1964.

November 1977

257

Automated Tools for Clinical Research Data Quality Control using NCI Common Data Elements.

Biomedical data-recording and storage system.

An automated laboratory control system: collection and analysis of behavioral and electro-physiological data.

A semi-automated workflow for biodiversity data retrieval, cleaning, and quality control.

Environmental assessment of three egg production systems--Part I: Monitoring system and indoor air quality.

Optimization of ambient air quality monitoring networks : (Part I).

Systematic Quality Control Analysis of LINCS Data.

The need to establish cost-of-practice data. Part I.

Use of an edit feedback system in data collection quality control.

Automated quality control for genome wide association studies.

Use of luting agents with an implant system: Part I.

Exploratory data analysis. Part I: Skepticism and openness.

Plant photosynthesis phenomics data quality control.

An automated system for the analyses of temporal speech patterns: description of the hardware and software.

NGS-QCbox and Raspberry for Parallel, Automated and Rapid Quality Control Analysis of Large-Scale Next Generation Sequencing (Illumina) Data.

A probability model for assessing exposure among respirator wearers: Part I--Description of the model.

nano-structured superhydrophobic surfaces in the biomedical field: part I: basic concepts and biomimetic approaches.

The funding crisis in biomedical research, Part I--Addressing the issue.

New Quality Control Algorithm Based on GNSS Sensing Data for a Bridge Health Monitoring System.

System Proposal for Mass Transit Service Quality Control Based on GPS Data.

Sharing big biomedical data.

Multisource inverse-geometry CT. Part I. System concept and development.

Sex hormones and the immune system--Part 1. Human data.

Sex hormones and the immune system--Part 2. Animal data.